Note
Condor is only available on the latest StarCluster 11.10 Ubuntu-based AMIs and above. See starcluster listpublic for a list of available AMIs.
To configure a condor pool on your cluster you must first define the condor plugin in your config file:
[plugin condor]
setup_class = starcluster.plugins.condor.CondorPlugin
After defining the plugin in your config, add the condor plugin to the list of plugins in one of your cluster templates:
[cluster smallcluster]
plugins = condor
Condor jobs cannot be submitted by the root user. Instead you must login to the cluster as the normal CLUSTER_USER:
$ starcluster sshmaster mycluster -u myuser
Warning
The “parallel” universe currently does not work. This should be resolved in a future release.
To submit a job you must first create a job script. Below is a simple example that submits a job which sleeps for 5 minutes:
Universe = vanilla
Executable = /bin/sleep
Arguments = 300
Log = sleep.log
Output = sleep.out
Error = sleep.error
Queue
The above job will run /bin/sleep passing 300 as the first argument. Condor messages will be logged to $PWD/sleep.log and the job’s standard output and standard error will be saved to $PWD/sleep.out and $PWD/sleep.error respectively where $PWD is the directory from which the job was originally submitted. Save the job script to a file, say job.txt, and use the condor_submit command to submit the job:
$ condor_submit job.txt
Submitting job(s).
1 job(s) submitted to cluster 8.
From the output above we see the job has been submitted to the cluster as job 8. Let’s submit this job once more in order to test that multiple jobs can be successfully distributed across the cluster by Condor:
$ condor_submit job.txt
Submitting job(s).
1 job(s) submitted to cluster 9.
The last job was submitted as job 9. The next step is to monitor these jobs until they’re finished.
To monitor the status of your Condor jobs use the condor_q command:
$ condor_q
-- Submitter: master : <10.220.226.138:52585> : master
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
8.0 myuser 12/12 21:31 0+00:06:40 R 0 0.0 sleep 300
9.0 myuser 12/12 21:31 0+00:05:56 R 0 0.0 sleep 300
2 jobs; 0 idle, 2 running, 0 held
From the output above we see that both jobs are currently running. To find out which cluster nodes the jobs are running on pass the -run option:
$ condor_q -run
-- Submitter: master : <10.220.226.138:52585> : master
ID OWNER SUBMITTED RUN_TIME HOST(S)
8.0 myuser 12/12 21:31 0+00:05:57 master
9.0 myuser 12/12 21:31 0+00:05:13 node001
Here we see that job 8 is running on the master and job 9 is running on node001. If your job is taking too long to run you can diagnose the issue by passing the -analyze option to condor_q:
$ condor_q -analyze
This will give you verbose output showing which scheduling conditions failed and why.
In some cases you may need to cancel queued or running jobs either because of an error in your job script or simply because you wish to change job parameters. Whatever the case may be you can cancel jobs by passing the job ids to condor_rm:
$ condor_rm 9
Cluster 9 has been marked for removal.
The above example removes job 9 from the condor queue.