Re: Eqw errors in SGE with default starcluster configuration

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

From: Justin Riley <no email>
Date: Tue, 07 Feb 2012 11:44:10 -0500

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Yes, the issue is most likely because /root is not, and will never be,
NFS-shared on the cluster. I'd recommend, as Dustin suggested, logging
in as the CLUSTER_USER (defined in your config) and retrying your
script(s) from $HOME which *is* NFS-shared.

For example, assuming CLUSTER_USER=sgeadmin:

$ starcluster sshmaster yourcluster -u sgeadmin
sgeadmin_at_master $ qsub ....

HTH,

~Justin

P.S. - Please join the list if you can, otherwise I have to manually
login and approve each message you post.

On 02/07/2012 11:37 AM, Dustin Machi wrote:
> It looks to me like either a) the script you are submitting is
> specifically looking in /root/lme (or setting that as the working
> directory) or b) you are submitting jobs as the root user and your
> script is looking in ~/lme. I don't think /root is shared via nfs
> across the cluster nodes, but /home for standard users is shared.
> If you run this as a normal user and ensure you are looking in
> /home/username/lme instead of /root/ I think it would work. I'm
> guessing you don't run as root on your dept's cluster.
>
> Dustin
>
>
>
> On Feb 7, 2012, at 11:27 AM, Josh Moore wrote:
>
>> Ah ok, the problem is obvious now:
>>
>> 02/07/2012 16:20:31|worker|master|W|job 1.1 failed on host
>> node001 general changing into working directory because:
>> 02/07/2012 16:20:30 [0:1568]: error: can't chdir to /root/lme:
>> No such fi le or directory 02/07/2012
>> 16:20:31|worker|master|W|rescheduling job 1.1 02/07/2012
>> 16:20:31|worker|master|W|job 3.1 failed on host node001 general
>> changing into working directory because: 02/07/2012 16:20:30
>> [0:1569]: error: can't chdir to /root/lme: No such fi le or
>> directory 02/07/2012 16:20:31|worker|master|W|rescheduling job
>> 3.1
>>
>> The directory structure isn't there on the other nodes. Like I
>> said, this script runs fine on my department's cluster, which
>> runs SGE, without any extra action to set up directories on the
>> nodes. What do I need to configure so that this will replicate
>> the directory structure etc.?
>>
>> Best, Josh
>>
>> On Tue, Feb 7, 2012 at 11:02 AM, Dustin Machi
>> <dmachi_at_vbi.vt.edu> wrote: I'd check out
>> /opt/sge6/default/spool/qmaster/messages to see if there is
>> anything useful about what is happening there. It will generally
>> tell you why its not queuing an additional job. Are the parallel
>> environments setup the same between your two clusters?
>>
>> Dustin
>>
>> On Feb 7, 2012, at 2:30 AM, Josh Moore wrote:
>>
>>> I tried submitting a bunch of jobs using qsub with a script
>>> that works fine on another (non-Amazon) cluster's
>>> configuration of SGE. But on a cluster configured with
>>> StarCluster, only the first 8 (on a cluster of c1.xlarge nodes,
>>> so 8 cores each) enter the queue without error (all of those
>>> are immediately executed on the master node). Even if I delete
>>> one of the jobs on the master node, another one never takes its
>>> place. I have a cluster of 8 c1.xlarge nodes. Here is the
>>> output of qconf -ssconf:
>>>
>>> algorithm default schedule_interval
>>> 0:0:15 maxujobs 0 queue_sort_method
>>> load job_load_adjustments np_load_avg=0.50
>>> load_adjustment_decay_time 0:7:30 load_formula
>>> np_load_avg schedd_job_info false
>>> flush_submit_sec 0 flush_finish_sec 0 params
>>> none reprioritize_interval 0:0:0 halftime
>>> 168 usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000
>>> compensation_factor 5.000000 weight_user
>>> 0.250000 weight_project 0.250000
>>> weight_department 0.250000 weight_job
>>> 0.250000 weight_tickets_functional 0
>>> weight_tickets_share 0 share_override_tickets TRUE
>>> share_functional_shares TRUE
>>> max_functional_jobs_to_schedule 200 report_pjob_tickets TRUE
>>> max_pending_tasks_per_job 50 halflife_decay_list none
>>> policy_hierarchy OFS weight_ticket 0.010000
>>> weight_waiting_time 0.000000 weight_deadline
>>> 3600000.000000 weight_urgency 0.100000 weight_priority
>>> 1.000000 max_reservation 0 default_duration
>>> INFINITY
>>>
>>> I can't figure out how to change schedd_job_info to true to
>>> find out more about the error message...
>>> _______________________________________________ StarCluster
>>> mailing list StarCluster_at_mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
>
>
> _______________________________________________ StarCluster
> mailing list StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk8xVNkACgkQ4llAkMfDcrlD2wCdGrtcD9xwIJhScKMDr8tGKswt
YTQAn13eZ6u9CUYBnKlpqnF1Eaej6XAa
=7Kcc
-----END PGP SIGNATURE-----
Received on Tue Feb 07 2012 - 11:44:13 EST

This message: [ Message body ]
Next message: Justin Riley: "Re: EC2 User data"
Previous message: Josh Moore: "Re: Eqw errors in SGE with default starcluster configuration"
In reply to: Dustin Machi: "Re: Eqw errors in SGE with default starcluster configuration"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Navigation

Re: Eqw errors in SGE with default starcluster configuration

Search:

Sort all by:

Navigation