StarCluster - Mailing List Archive

Re: Eqw errors in SGE with default starcluster configuration

From: Dustin Machi <no email>
Date: Tue, 7 Feb 2012 11:37:32 -0500

It looks to me like either a) the script you are submitting is specifically looking in /root/lme (or setting that as the working directory) or b) you are submitting jobs as the root user and your script is looking in ~/lme. I don't think /root is shared via nfs across the cluster nodes, but /home for standard users is shared. If you run this as a normal user and ensure you are looking in /home/username/lme instead of /root/ I think it would work. I'm guessing you don't run as root on your dept's cluster.

Dustin



On Feb 7, 2012, at 11:27 AM, Josh Moore wrote:

> Ah ok, the problem is obvious now:
>
> 02/07/2012 16:20:31|worker|master|W|job 1.1 failed on host node001 general changing into working directory because: 02/07/2012 16:20:30 [0:1568]: error: can't chdir to /root/lme: No such fi
> le or directory
> 02/07/2012 16:20:31|worker|master|W|rescheduling job 1.1
> 02/07/2012 16:20:31|worker|master|W|job 3.1 failed on host node001 general changing into working directory because: 02/07/2012 16:20:30 [0:1569]: error: can't chdir to /root/lme: No such fi
> le or directory
> 02/07/2012 16:20:31|worker|master|W|rescheduling job 3.1
>
> The directory structure isn't there on the other nodes. Like I said, this script runs fine on my department's cluster, which runs SGE, without any extra action to set up directories on the nodes. What do I need to configure so that this will replicate the directory structure etc.?
>
> Best,
> Josh
>
> On Tue, Feb 7, 2012 at 11:02 AM, Dustin Machi <dmachi_at_vbi.vt.edu> wrote:
> I'd check out /opt/sge6/default/spool/qmaster/messages to see if there is anything useful about what is happening there. It will generally tell you why its not queuing an additional job. Are the parallel environments setup the same between your two clusters?
>
> Dustin
>
> On Feb 7, 2012, at 2:30 AM, Josh Moore wrote:
>
> > I tried submitting a bunch of jobs using qsub with a script that works fine on another (non-Amazon) cluster's configuration of SGE. But on a cluster configured with StarCluster, only the first 8 (on a cluster of c1.xlarge nodes, so 8 cores each) enter the queue without error (all of those are immediately executed on the master node). Even if I delete one of the jobs on the master node, another one never takes its place. I have a cluster of 8 c1.xlarge nodes. Here is the output of qconf -ssconf:
> >
> > algorithm default
> > schedule_interval 0:0:15
> > maxujobs 0
> > queue_sort_method load
> > job_load_adjustments np_load_avg=0.50
> > load_adjustment_decay_time 0:7:30
> > load_formula np_load_avg
> > schedd_job_info false
> > flush_submit_sec 0
> > flush_finish_sec 0
> > params none
> > reprioritize_interval 0:0:0
> > halftime 168
> > usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000
> > compensation_factor 5.000000
> > weight_user 0.250000
> > weight_project 0.250000
> > weight_department 0.250000
> > weight_job 0.250000
> > weight_tickets_functional 0
> > weight_tickets_share 0
> > share_override_tickets TRUE
> > share_functional_shares TRUE
> > max_functional_jobs_to_schedule 200
> > report_pjob_tickets TRUE
> > max_pending_tasks_per_job 50
> > halflife_decay_list none
> > policy_hierarchy OFS
> > weight_ticket 0.010000
> > weight_waiting_time 0.000000
> > weight_deadline 3600000.000000
> > weight_urgency 0.100000
> > weight_priority 1.000000
> > max_reservation 0
> > default_duration INFINITY
> >
> > I can't figure out how to change schedd_job_info to true to find out more about the error message...
> > _______________________________________________
> > StarCluster mailing list
> > StarCluster_at_mit.edu
> > http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
Received on Tue Feb 07 2012 - 11:37:44 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject