StarCluster - Mailing List Archive

Re: Eqw errors in SGE with default starcluster configuration

From: Josh Moore <no email>
Date: Tue, 7 Feb 2012 11:27:06 -0500

Ah ok, the problem is obvious now:

02/07/2012 16:20:31|worker|master|W|job 1.1 failed on host node001 general
changing into working directory because: 02/07/2012 16:20:30 [0:1568]:
error: can't chdir to /root/lme: No such fi
le or directory
02/07/2012 16:20:31|worker|master|W|rescheduling job 1.1
02/07/2012 16:20:31|worker|master|W|job 3.1 failed on host node001 general
changing into working directory because: 02/07/2012 16:20:30 [0:1569]:
error: can't chdir to /root/lme: No such fi
le or directory
02/07/2012 16:20:31|worker|master|W|rescheduling job 3.1

The directory structure isn't there on the other nodes. Like I said, this
script runs fine on my department's cluster, which runs SGE, without any
extra action to set up directories on the nodes. What do I need to
configure so that this will replicate the directory structure etc.?

Best,
Josh

On Tue, Feb 7, 2012 at 11:02 AM, Dustin Machi <dmachi_at_vbi.vt.edu> wrote:

> I'd check out /opt/sge6/default/spool/qmaster/messages to see if there is
> anything useful about what is happening there. It will generally tell you
> why its not queuing an additional job. Are the parallel environments setup
> the same between your two clusters?
>
> Dustin
>
> On Feb 7, 2012, at 2:30 AM, Josh Moore wrote:
>
> > I tried submitting a bunch of jobs using qsub with a script that works
> fine on another (non-Amazon) cluster's configuration of SGE. But on a
> cluster configured with StarCluster, only the first 8 (on a cluster of
> c1.xlarge nodes, so 8 cores each) enter the queue without error (all of
> those are immediately executed on the master node). Even if I delete one of
> the jobs on the master node, another one never takes its place. I have a
> cluster of 8 c1.xlarge nodes. Here is the output of qconf -ssconf:
> >
> > algorithm default
> > schedule_interval 0:0:15
> > maxujobs 0
> > queue_sort_method load
> > job_load_adjustments np_load_avg=0.50
> > load_adjustment_decay_time 0:7:30
> > load_formula np_load_avg
> > schedd_job_info false
> > flush_submit_sec 0
> > flush_finish_sec 0
> > params none
> > reprioritize_interval 0:0:0
> > halftime 168
> > usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000
> > compensation_factor 5.000000
> > weight_user 0.250000
> > weight_project 0.250000
> > weight_department 0.250000
> > weight_job 0.250000
> > weight_tickets_functional 0
> > weight_tickets_share 0
> > share_override_tickets TRUE
> > share_functional_shares TRUE
> > max_functional_jobs_to_schedule 200
> > report_pjob_tickets TRUE
> > max_pending_tasks_per_job 50
> > halflife_decay_list none
> > policy_hierarchy OFS
> > weight_ticket 0.010000
> > weight_waiting_time 0.000000
> > weight_deadline 3600000.000000
> > weight_urgency 0.100000
> > weight_priority 1.000000
> > max_reservation 0
> > default_duration INFINITY
> >
> > I can't figure out how to change schedd_job_info to true to find out
> more about the error message...
> > _______________________________________________
> > StarCluster mailing list
> > StarCluster_at_mit.edu
> > http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
Received on Tue Feb 07 2012 - 11:27:09 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject