StarCluster - Mailing List Archive

Re: Configure Nodes to submit jobs

From: greg <no email>
Date: Thu, 2 Oct 2014 08:26:50 -0400

Thanks Jennifer. I'm actually saving all my install steps as a giant
bash script so I can reinstall easily whenever I restart so I'm not
too worried about losing my work.

Regarding the Python stuff, yes it's homegrown stuff or extremely old
libraries probably no longer on PIP. So I ended up going with a
virtualenv in my home directory and it seems to be working fine.
Something like this:

#Set up Virtual Env for Python installs (since normally installed
packages won't be shared)
cd /home/sgeadmin/custom/bin/
virtualenv --always-copy --system-site-packages python_for_msg
#Put in bashrc and activate
echo 'export PATH=/home/sgeadmin/custom/bin/python_for_msg/bin:$PATH'
>> ~/.bashrc
export PATH=/home/sgeadmin/custom/bin/python_for_msg/bin:$PATH

install Python modules ...

-Greg

On Wed, Oct 1, 2014 at 4:53 PM, Jennifer Staab <jstaab_at_cs.unc.edu> wrote:
> Ed and Hugh have it right. As long as it is a python module you can use the
> python package installer plugin as Hugh suggested. If it is a
> library/scripts you created for your own for your personal use, you will
> need to propagate that code yourself. And as Ed suggested, perhaps setup an
> AMI with all the software you need and/or take an AMI of the Master node
> (since it is set like you need it to be) and use that for your worker nodes.
>
> Also from past (bad) experience if you are going to take an AMI of a running
> EC2, stop it first - then take the AMI. If your EC2 isn't EBS backed -
> don't stop it or try to take the AMI -- my fear is that is in taking the AMI
> it will some how reboot (or need to be reboot to function after AMI is
> taken) and you could lose everything. Recall if you are using
> instance/ephemeral storage in stopping the EC2 all instance/ephemeral
> storage disappears, so even if it is an EBS backed EC2 be sure to save in
> stuff instance/ephemeral storage before stopping the EC2 if said "stuff" is
> important. And final note, if you stop a Starcluster EC2 they tend to dump
> all your mounts (like mounted EBS volumes) so you will likely need to
> remount all mounted EBS volumes upon restarting it. I believe Hugh
> attributed this to the "self.detach_volumes()" line in "stop_cluster"
> definition of the "cluster.py" of the Starcluster source code (Thanks again
> Hugh -- remounting EBS volumes - total pain). To figure out what you have
> mounted before stopping the master node -- run a 'df -h' or 'lsblk' on the
> master node.
>
> Good Luck.
>
> -Jennifer
>
>
> On 10/1/14 1:48 PM, greg wrote:
>>
>> Hi everyone,
>>
>> I found the bug. Apparently I library I installed in Python is only
>> available on the master node. What's a good way to install Python
>> libraries so it's available on all nodes? I guess virtualenv, but I'm
>> hoping for something simpler :-)
>>
>> -Greg
>>
>> On Wed, Oct 1, 2014 at 10:33 AM, Jennifer Staab <jstaab_at_cs.unc.edu> wrote:
>>>
>>> For software and scripts you can login to each node and check that
>>> software
>>> is installed and you can view/find the scripts. Also you might check and
>>> make sure user you qsub'ed the jobs with has correct permissions to run
>>> scripts and software and write output.
>>>
>>> Easier way is to use qstat -j <JOBID> where the <JOBID> is jobid of one
>>> of
>>> the jobs with EQW status. It and/or the .o/.e files you set when you
>>> submitted the qsub job will give you file location to read the error
>>> messages. If you didn't set an .o and .e file in your qsub call ( using
>>> -e
>>> and -o options) I believe it defaults to files with jobid or jobname with
>>> extension .o( for output) and .e ( for error). I believe Chris talked
>>> about
>>> this in his reply. This is how I discovered scripts weren't shared is
>>> because the .e file indicated the scripts couldn't be found.
>>>
>>> Also as Chris said you can do qconf to change attributes of SGE setup. I
>>> have done this before on a Starcluster cluster - login to master node and
>>> as
>>> long as you have admin privileges you can use qconf command to change
>>> attributes of SGE setup.
>>>
>>> Good Luck.
>>>
>>> -Jennifer
>>>
>>> Sent from my Verizon Wireless 4G LTE DROID
>>>
>>>
>>> greg <margeemail_at_gmail.com> wrote:
>>>
>>> Thanks Jennifer! Being completely new to star cluster, how can I
>>> check that my scripts are available to all nodes?
>>>
>>> -Greg
>>>
>>> On Wed, Oct 1, 2014 at 8:26 AM, Jennifer Staab <jstaab_at_cs.unc.edu> wrote:
>>>>
>>>> Greg -
>>>>
>>>> Maybe check that your software and scripts are available to all
>>>> nodes.
>>>> I
>>>> have had Starcluster throw a bunch of EQW's when I accidentally didn't
>>>> have
>>>> all the software and script components loaded in a directory that was
>>>> NFS'ed
>>>> for all the nodes of my cluster and/or individually loaded on all nodes
>>>> of
>>>> the cluster.
>>>>
>>>> And as Chris just stated use:
>>>> qstat -j <JOBID> ==> gives complete info on that job
>>>> qstat -j <JOBID> | grep error (looks for errors in job)
>>>>
>>>> When you get the error debugged you can use:
>>>> qmod -cj <JOBID> (will clear error state and restart job - like Eqw )
>>>>
>>>> Good Luck.
>>>>
>>>> -Jennifer
>>>>
>>>>
>>>>
>>>> On 10/1/14 8:09 AM, Chris Dagdigian wrote:
>>>>>
>>>>> 'EQW' is a combination of multiple message states (e)(q)(w). The
>>>>> standard "qw" is familiar to everyone, the E indicates something bad at
>>>>> the job level.
>>>>>
>>>>> There are multiple levels of debugging, starting with easy and getting
>>>>> more cumbersome. Almost all require admin or sudo level access
>>>>>
>>>>> The 1st pass debug method is to run "qstat -j <jobID>" on the job that
>>>>> is in EQW state, that should provide a bit more information about what
>>>>> went wrong.
>>>>>
>>>>> After that you look at the .e and .o STDERR/STDOUT files from the
>>>>> script
>>>>> if any were created
>>>>>
>>>>> After that you can use sudo privs to go into
>>>>> $SGE_ROOT/$SGE_CELL/spool/qmaster/ and look at the messages file, there
>>>>> are also per-node messages files you can look at as well.
>>>>>
>>>>> The next level of debugging after that usually involves setting the
>>>>> sge_execd parameter KEEP_ACTIVE=true which triggers a behavior where
>>>>> SGE
>>>>> will stop deleting the temporary files associated with a job life
>>>>> cycle.
>>>>> Those files live down in the SGE spool at location
>>>>> <executionhost>/active.jobs/<jobID/ -- and they are invaluable in
>>>>> debugging nasty subtle job failures
>>>>>
>>>>> EQW should be easy to troubleshoot though - it indicates a fatal error
>>>>> right at the beginning of the job dispatch or execution process. No
>>>>> subtle things there
>>>>>
>>>>>
>>>>> And if your other question was about nodes being allowed to submit jobs
>>>>> -- yes you have to configure this. It can be done during SGE install
>>>>> time or any time afterwards by doing "qconf -as <nodename>" from any
>>>>> account with SGE admin privs. I have no idea if startcluster does this
>>>>> automatically or not but I'd expect that it probably does, If not it's
>>>>> an easy fix.
>>>>>
>>>>> -Chris
>>>>>
>>>>>
>>>>> greg wrote:
>>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> I'm afraid I'm still stuck on this. Besides my original question
>>>>>> which I'm still not sure about. Does anyone have any general advice
>>>>>> on debugging an EQW state? The same software runs fine in our local
>>>>>> cluster.
>>>>>>
>>>>>> thanks again,
>>>>>>
>>>>>> Greg
>>>>>
>>>>> _______________________________________________
>>>>> StarCluster mailing list
>>>>> StarCluster_at_mit.edu
>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>
>>>>
>
Received on Thu Oct 02 2014 - 08:26:51 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject