Re: sge node stops running jobs
Didn't know they were NFS mounted. All my nodes are m1.xlarge. They
aren't doing any I/O over NFS, just local scratch space...ephemeral drives.
They do read some program files over NFS, but not much. If it were an NFS
problem, I would suspect seeing an NFS error, or 'ls' hanging, but the
directory lists just fine and is empty. The fact that its NFS is weird and
doesn't fit the symptoms. Maybe the mount got removed somehow and 'ls' is
showing the local directory? Hmmm. I'll have to investigate further next
time it happens.
On Tue, Oct 29, 2013 at 10:18 PM, Rayson Ho <raysonlogin_at_gmail.com> wrote:
> Since your mentioned that the entire /opt/sge6 directory is empty,
> sounds like the NFS server (ie. StarCluster master) was not available
> at some point?
>
> What instance type are you using for the master?
>
> Note that by default /home & /opt/sge6 are NFS mounts, are thus if the
> nodes are doing lots of I/O, then it can cause issues (NFS doesn't
> scale well, but 30 nodes should be OK unless there really is lots of
> I/O traffic).
>
> # mount
> ...
> master:/home on /home type nfs
> (rw,user=root,nosuid,nodev,vers=3,addr=10.125.9.29)
> master:/opt/sge6 on /opt/sge6 type nfs
> (rw,user=root,nosuid,nodev,vers=3,addr=10.125.9.29)
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>
>
> On Tue, Oct 29, 2013 at 8:43 PM, Ryan Golhar
> <ngsbioinformatics_at_gmail.com> wrote:
> > Hi all - I came across a weird problem that I experience every once in a
> > while and recently more and more. I've created a 30-node spot cluster
> using
> > starcluster. I started a bunch of jobs on all the nodes and sge shows
> all
> > the jobs running. I come back an hour or two later and check on the
> cluster
> > and only half the nodes are listed as running jobs using qstat. qhost
> shows
> > the nodes as down. I can log into the nodes and sure enough, sge_exec is
> > not running. On some of the nodes I can start the service manually, on
> > others, the entire /opt/sge6 directory is empty. I have no idea why this
> > would be the case, especially since they were running jobs to begin with.
> > Has anyone else seen?
> >
> > _______________________________________________
> > StarCluster mailing list
> > StarCluster_at_mit.edu
> > http://mailman.mit.edu/mailman/listinfo/starcluster
> >
>
Received on Wed Oct 30 2013 - 00:02:26 EDT
This archive was generated by
hypermail 2.3.0.