StarCluster - Mailing List Archive

Re: jobs on slave nodes disappear

From: Justin Riley <no email>
Date: Tue, 03 Jan 2012 12:44:43 -0500

Hi Liang,

Ahhh I think I know what's happening. You're submitting the jobs as root
from root's home folder which is not NFS shared. This is why you're not
seeing any results, output files, error files, etc. unless the job
actually gets scheduled to run on the 'master node' which you're
currently logged into.

If you switch to the CLUSTER_USER and submit the job from CLUSTER_USER's
$HOME folder you'll probably find everything works fine. This is because
$HOME is NFS-shared across the cluster by default so no matter what node
the job gets scheduled to you'll always see the results from any node in
the cluster.

HTH,

~Justin


On 12/31/2011 03:18 PM, liang cheng wrote:
> Hi Justin,
>
> Thanks for your reply. There's no error log nor output log even when I
> use "-e" or "-o" option.
>
> I created a cluster with one master and 10 slave. I made a minor change
> on the master node and use "starcluster createimage i-xxxx AAA BBB".
> "i-xxxx" is the instance id of the master. After I got the ami-yyyy, I
> run "starcluster start ami-yyyy". I found all jobs submitted to slave
> nodes are finished instantly, as you see in the log I sent earlier. The
> jobs in master node are run normally.
>
> I haven't used "restart" command but will give it a try.
>
> -Liang
>
> On Sat, Dec 31, 2011 at 12:03 PM, Justin Riley <jtriley_at_mit.edu
> <mailto:jtriley_at_mit.edu>> wrote:
>
> Hi Liang,
>
> Is this happening consistently even after restarting the cluster using
> "starcluster restart mycluster"? Also, is there anything in your
> job(s) error logs? Given the output you provided these would most
> likely be located in the directory you submitted the job from and
> should be named something like "single.sh.e23".
>
> ~Justin
>
>
> On 12/30/2011 08:58 PM, liang cheng wrote:
>> Greetings !
>
>> I created a star cluster on EC2 and use qsub to submit jobs. It
>> used to work well. From this afternoon, after I requested for
>> additional EC2 instance from Amazon, the issue comes out.
>
>> Only the jobs submitted to the master node are executed. Other
>> jobs disappeared just in no time. Some diagonosis is as below. Any
>> helps are appreciated !
>
>> Happy New Year !
>
>
>> root_at_master:/# qacct -j 23
>> ==============================================================
>> qname all.q hostname node006 group root
>> owner root project NONE department defaultdepartment
>> jobname single.sh out 3 jobnumber 23 taskid
>> undefined account sge priority 0 qsub_time Sat Dec 31
>> 01:38:32 2011 start_time Sat Dec 31 01:38:39 2011 end_time
>> Sat Dec 31 01:38:39 2011 granted_pe NONE slots 1
>> failed 0 exit_status 0 ru_wallclock 0 ru_utime 0.010
>> ru_stime 0.010 ru_maxrss 2276 ru_ixrss 0
>> ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 2648
>> ru_majflt 0 ru_nswap 0 ru_inblock 0 ru_oublock 272
>> ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 12
>> ru_nivcsw 3 cpu 0.020 mem 0.000 io
>> 0.000 iow 0.000 maxvmem 0.000 arid undefined
>
>> =========================
>
>> Thanks, -Liang
>
>
>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
Received on Tue Jan 03 2012 - 12:44:46 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject