StarCluster - Mailing List Archive

Re: Configure Nodes to submit jobs

From: Jacob barhak <no email>
Date: Wed, 1 Oct 2014 15:38:23 -0500

Hi Greg,

There is a starcluster plugin for pip install. Also, you may wish to check an anaconda AMI that comes with many libraries and conda on it.

Here is a link to instructions on setting up an anaconda AMI:
http://continuum.io/blog/starcluster-anaconda

These two paths should give you enough options to continue.

        Jacob

Sent from my iPhone

On Oct 1, 2014, at 12:48 PM, greg <margeemail_at_gmail.com> wrote:

> Hi everyone,
>
> I found the bug. Apparently I library I installed in Python is only
> available on the master node. What's a good way to install Python
> libraries so it's available on all nodes? I guess virtualenv, but I'm
> hoping for something simpler :-)
>
> -Greg
>
> On Wed, Oct 1, 2014 at 10:33 AM, Jennifer Staab <jstaab_at_cs.unc.edu> wrote:
>> For software and scripts you can login to each node and check that software
>> is installed and you can view/find the scripts. Also you might check and
>> make sure user you qsub'ed the jobs with has correct permissions to run
>> scripts and software and write output.
>>
>> Easier way is to use qstat -j <JOBID> where the <JOBID> is jobid of one of
>> the jobs with EQW status. It and/or the .o/.e files you set when you
>> submitted the qsub job will give you file location to read the error
>> messages. If you didn't set an .o and .e file in your qsub call ( using -e
>> and -o options) I believe it defaults to files with jobid or jobname with
>> extension .o( for output) and .e ( for error). I believe Chris talked about
>> this in his reply. This is how I discovered scripts weren't shared is
>> because the .e file indicated the scripts couldn't be found.
>>
>> Also as Chris said you can do qconf to change attributes of SGE setup. I
>> have done this before on a Starcluster cluster - login to master node and as
>> long as you have admin privileges you can use qconf command to change
>> attributes of SGE setup.
>>
>> Good Luck.
>>
>> -Jennifer
>>
>> Sent from my Verizon Wireless 4G LTE DROID
>>
>>
>> greg <margeemail_at_gmail.com> wrote:
>>
>> Thanks Jennifer! Being completely new to star cluster, how can I
>> check that my scripts are available to all nodes?
>>
>> -Greg
>>
>> On Wed, Oct 1, 2014 at 8:26 AM, Jennifer Staab <jstaab_at_cs.unc.edu> wrote:
>>> Greg -
>>>
>>> Maybe check that your software and scripts are available to all nodes.
>>> I
>>> have had Starcluster throw a bunch of EQW's when I accidentally didn't
>>> have
>>> all the software and script components loaded in a directory that was
>>> NFS'ed
>>> for all the nodes of my cluster and/or individually loaded on all nodes of
>>> the cluster.
>>>
>>> And as Chris just stated use:
>>> qstat -j <JOBID> ==> gives complete info on that job
>>> qstat -j <JOBID> | grep error (looks for errors in job)
>>>
>>> When you get the error debugged you can use:
>>> qmod -cj <JOBID> (will clear error state and restart job - like Eqw )
>>>
>>> Good Luck.
>>>
>>> -Jennifer
>>>
>>>
>>>
>>> On 10/1/14 8:09 AM, Chris Dagdigian wrote:
>>>>
>>>> 'EQW' is a combination of multiple message states (e)(q)(w). The
>>>> standard "qw" is familiar to everyone, the E indicates something bad at
>>>> the job level.
>>>>
>>>> There are multiple levels of debugging, starting with easy and getting
>>>> more cumbersome. Almost all require admin or sudo level access
>>>>
>>>> The 1st pass debug method is to run "qstat -j <jobID>" on the job that
>>>> is in EQW state, that should provide a bit more information about what
>>>> went wrong.
>>>>
>>>> After that you look at the .e and .o STDERR/STDOUT files from the script
>>>> if any were created
>>>>
>>>> After that you can use sudo privs to go into
>>>> $SGE_ROOT/$SGE_CELL/spool/qmaster/ and look at the messages file, there
>>>> are also per-node messages files you can look at as well.
>>>>
>>>> The next level of debugging after that usually involves setting the
>>>> sge_execd parameter KEEP_ACTIVE=true which triggers a behavior where SGE
>>>> will stop deleting the temporary files associated with a job life cycle.
>>>> Those files live down in the SGE spool at location
>>>> <executionhost>/active.jobs/<jobID/ -- and they are invaluable in
>>>> debugging nasty subtle job failures
>>>>
>>>> EQW should be easy to troubleshoot though - it indicates a fatal error
>>>> right at the beginning of the job dispatch or execution process. No
>>>> subtle things there
>>>>
>>>>
>>>> And if your other question was about nodes being allowed to submit jobs
>>>> -- yes you have to configure this. It can be done during SGE install
>>>> time or any time afterwards by doing "qconf -as <nodename>" from any
>>>> account with SGE admin privs. I have no idea if startcluster does this
>>>> automatically or not but I'd expect that it probably does, If not it's
>>>> an easy fix.
>>>>
>>>> -Chris
>>>>
>>>>
>>>> greg wrote:
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> I'm afraid I'm still stuck on this. Besides my original question
>>>>> which I'm still not sure about. Does anyone have any general advice
>>>>> on debugging an EQW state? The same software runs fine in our local
>>>>> cluster.
>>>>>
>>>>> thanks again,
>>>>>
>>>>> Greg
>>>>
>>>> _______________________________________________
>>>> StarCluster mailing list
>>>> StarCluster_at_mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
Received on Wed Oct 01 2014 - 16:38:27 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject