StarCluster - Mailing List Archive

sge node stops running jobs

From: Ryan Golhar <no email>
Date: Tue, 29 Oct 2013 20:43:39 -0400

Hi all - I came across a weird problem that I experience every once in a
while and recently more and more. I've created a 30-node spot cluster
using starcluster. I started a bunch of jobs on all the nodes and sge
shows all the jobs running. I come back an hour or two later and check on
the cluster and only half the nodes are listed as running jobs using qstat.
 qhost shows the nodes as down. I can log into the nodes and sure enough,
sge_exec is not running. On some of the nodes I can start the service
manually, on others, the entire /opt/sge6 directory is empty. I have no
idea why this would be the case, especially since they were running jobs to
begin with. Has anyone else seen?
Received on Tue Oct 29 2013 - 20:43:40 EDT
This archive was generated by hypermail 2.3.0.


Sort all by: