StarCluster - Mailing List Archive

Re: Configure Nodes to submit jobs

From: Jennifer Staab <no email>
Date: Wed, 01 Oct 2014 16:53:27 -0400

Ed and Hugh have it right. As long as it is a python module you can use
the python package installer plugin as Hugh suggested. If it is a
library/scripts you created for your own for your personal use, you will
need to propagate that code yourself. And as Ed suggested, perhaps
setup an AMI with all the software you need and/or take an AMI of the
Master node (since it is set like you need it to be) and use that for
your worker nodes.

Also from past (bad) experience if you are going to take an AMI of a
running EC2, stop it first - then take the AMI. If your EC2 isn't EBS
backed - don't stop it or try to take the AMI -- my fear is that is in
taking the AMI it will some how reboot (or need to be reboot to function
after AMI is taken) and you could lose everything. Recall if you are
using instance/ephemeral storage in stopping the EC2 all
instance/ephemeral storage disappears, so even if it is an EBS backed
EC2 be sure to save in stuff instance/ephemeral storage before stopping
the EC2 if said "stuff" is important. And final note, if you stop a
Starcluster EC2 they tend to dump all your mounts (like mounted EBS
volumes) so you will likely need to remount all mounted EBS volumes upon
restarting it. I believe Hugh attributed this to the
"self.detach_volumes()" line in "stop_cluster" definition of the
"cluster.py" of the Starcluster source code (Thanks again Hugh --
remounting EBS volumes - total pain). To figure out what you have
mounted before stopping the master node -- run a 'df -h' or 'lsblk' on
the master node.

Good Luck.

-Jennifer

On 10/1/14 1:48 PM, greg wrote:
> Hi everyone,
>
> I found the bug. Apparently I library I installed in Python is only
> available on the master node. What's a good way to install Python
> libraries so it's available on all nodes? I guess virtualenv, but I'm
> hoping for something simpler :-)
>
> -Greg
>
> On Wed, Oct 1, 2014 at 10:33 AM, Jennifer Staab <jstaab_at_cs.unc.edu> wrote:
>> For software and scripts you can login to each node and check that software
>> is installed and you can view/find the scripts. Also you might check and
>> make sure user you qsub'ed the jobs with has correct permissions to run
>> scripts and software and write output.
>>
>> Easier way is to use qstat -j <JOBID> where the <JOBID> is jobid of one of
>> the jobs with EQW status. It and/or the .o/.e files you set when you
>> submitted the qsub job will give you file location to read the error
>> messages. If you didn't set an .o and .e file in your qsub call ( using -e
>> and -o options) I believe it defaults to files with jobid or jobname with
>> extension .o( for output) and .e ( for error). I believe Chris talked about
>> this in his reply. This is how I discovered scripts weren't shared is
>> because the .e file indicated the scripts couldn't be found.
>>
>> Also as Chris said you can do qconf to change attributes of SGE setup. I
>> have done this before on a Starcluster cluster - login to master node and as
>> long as you have admin privileges you can use qconf command to change
>> attributes of SGE setup.
>>
>> Good Luck.
>>
>> -Jennifer
>>
>> Sent from my Verizon Wireless 4G LTE DROID
>>
>>
>> greg <margeemail_at_gmail.com> wrote:
>>
>> Thanks Jennifer! Being completely new to star cluster, how can I
>> check that my scripts are available to all nodes?
>>
>> -Greg
>>
>> On Wed, Oct 1, 2014 at 8:26 AM, Jennifer Staab <jstaab_at_cs.unc.edu> wrote:
>>> Greg -
>>>
>>> Maybe check that your software and scripts are available to all nodes.
>>> I
>>> have had Starcluster throw a bunch of EQW's when I accidentally didn't
>>> have
>>> all the software and script components loaded in a directory that was
>>> NFS'ed
>>> for all the nodes of my cluster and/or individually loaded on all nodes of
>>> the cluster.
>>>
>>> And as Chris just stated use:
>>> qstat -j <JOBID> ==> gives complete info on that job
>>> qstat -j <JOBID> | grep error (looks for errors in job)
>>>
>>> When you get the error debugged you can use:
>>> qmod -cj <JOBID> (will clear error state and restart job - like Eqw )
>>>
>>> Good Luck.
>>>
>>> -Jennifer
>>>
>>>
>>>
>>> On 10/1/14 8:09 AM, Chris Dagdigian wrote:
>>>> 'EQW' is a combination of multiple message states (e)(q)(w). The
>>>> standard "qw" is familiar to everyone, the E indicates something bad at
>>>> the job level.
>>>>
>>>> There are multiple levels of debugging, starting with easy and getting
>>>> more cumbersome. Almost all require admin or sudo level access
>>>>
>>>> The 1st pass debug method is to run "qstat -j <jobID>" on the job that
>>>> is in EQW state, that should provide a bit more information about what
>>>> went wrong.
>>>>
>>>> After that you look at the .e and .o STDERR/STDOUT files from the script
>>>> if any were created
>>>>
>>>> After that you can use sudo privs to go into
>>>> $SGE_ROOT/$SGE_CELL/spool/qmaster/ and look at the messages file, there
>>>> are also per-node messages files you can look at as well.
>>>>
>>>> The next level of debugging after that usually involves setting the
>>>> sge_execd parameter KEEP_ACTIVE=true which triggers a behavior where SGE
>>>> will stop deleting the temporary files associated with a job life cycle.
>>>> Those files live down in the SGE spool at location
>>>> <executionhost>/active.jobs/<jobID/ -- and they are invaluable in
>>>> debugging nasty subtle job failures
>>>>
>>>> EQW should be easy to troubleshoot though - it indicates a fatal error
>>>> right at the beginning of the job dispatch or execution process. No
>>>> subtle things there
>>>>
>>>>
>>>> And if your other question was about nodes being allowed to submit jobs
>>>> -- yes you have to configure this. It can be done during SGE install
>>>> time or any time afterwards by doing "qconf -as <nodename>" from any
>>>> account with SGE admin privs. I have no idea if startcluster does this
>>>> automatically or not but I'd expect that it probably does, If not it's
>>>> an easy fix.
>>>>
>>>> -Chris
>>>>
>>>>
>>>> greg wrote:
>>>>> Hi guys,
>>>>>
>>>>> I'm afraid I'm still stuck on this. Besides my original question
>>>>> which I'm still not sure about. Does anyone have any general advice
>>>>> on debugging an EQW state? The same software runs fine in our local
>>>>> cluster.
>>>>>
>>>>> thanks again,
>>>>>
>>>>> Greg
>>>> _______________________________________________
>>>> StarCluster mailing list
>>>> StarCluster_at_mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>
Received on Wed Oct 01 2014 - 16:53:37 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject