Since your mentioned that the entire /opt/sge6 directory is empty,
sounds like the NFS server (ie. StarCluster master) was not available
at some point?
What instance type are you using for the master?
Note that by default /home & /opt/sge6 are NFS mounts, are thus if the
nodes are doing lots of I/O, then it can cause issues (NFS doesn't
scale well, but 30 nodes should be OK unless there really is lots of
I/O traffic).
# mount
...
master:/home on /home type nfs
(rw,user=root,nosuid,nodev,vers=3,addr=10.125.9.29)
master:/opt/sge6 on /opt/sge6 type nfs
(rw,user=root,nosuid,nodev,vers=3,addr=10.125.9.29)
Rayson
==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
On Tue, Oct 29, 2013 at 8:43 PM, Ryan Golhar
<ngsbioinformatics_at_gmail.com> wrote:
> Hi all - I came across a weird problem that I experience every once in a
> while and recently more and more. I've created a 30-node spot cluster using
> starcluster. I started a bunch of jobs on all the nodes and sge shows all
> the jobs running. I come back an hour or two later and check on the cluster
> and only half the nodes are listed as running jobs using qstat. qhost shows
> the nodes as down. I can log into the nodes and sure enough, sge_exec is
> not running. On some of the nodes I can start the service manually, on
> others, the entire /opt/sge6 directory is empty. I have no idea why this
> would be the case, especially since they were running jobs to begin with.
> Has anyone else seen?
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
Received on Tue Oct 29 2013 - 22:18:14 EDT