StarCluster - Mailing List Archive

Re: AWS instance runs out of memory and swaps

From: Justin Riley <no email>
Date: Mon, 05 Dec 2011 18:14:52 -0500

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Also, just for drill, try running:

$ source /etc/profile

before trying qconf in case your environment was altered at some point
in your SSH session.

~Justin

On 12/05/2011 06:13 PM, Justin Riley wrote:
> Amir,
>
> qconf is included in the StarCluster AMIs so there must be some
> other issue you're facing. Also, I wouldn't recommend installing
> the gridengine packages from ubuntu as they're most likely not
> compatible with StarCluster's bundled version in /opt/sge6 as
> you're seeing.
>
> With that said which AMI are you using and what does "echo $PATH"
> look like when you login as root (via sshmaster)?
>
> ~Justin
>
>
> On 12/05/2011 06:07 PM, Amirhossein Kiani wrote:
>> So I tried this and couldn't run qconf because it was not
>> installed. I then tried installing it using apt-get and
>> specified default for the cell name and "master" for the master
>> name which is the default for the SGE created using StarCluster.
>
>> However now when I want to use qconf, it says:
>
>> root_at_master:/data/stanford/aligned# qconf -msconf error: commlib
>> error: got select error (Connection refused) unable to send
>> message to qmaster using port 6444 on host "master": got send
>> error
>
>
>> Any idea how i could configure it to work?
>
>
>> Many thanks, Amir
>
>> On Dec 5, 2011, at 1:52 PM, Rayson Ho wrote:
>
>>> Hi Amirhossein,
>>>
>>> I was working on a few other things, and I just saw your
>>> message -- I have to spend less time on mailing list
>>> discussions these days due to the number of things that I
>>> needed to develop and/or fix, and I am also working on a new
>>> patch release of OGS/Grid Engine 2011.11. Luckily, I just found
>>> the mail that exactly solves the issue you are encountering:
>>>
>>> http://markmail.org/message/zdj5ebfrzhnadglf
>>>
>>>
>>> For more info, see the "job_load_adjustments" and
>>> "load_adjustment_decay_time" parameters in the Grid Engine
>>> manpage:
>>>
>>>
>>> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
>>>
>>>
>>>
>>>
>
>>>
Rayson
>>>
>>> ================================= Grid Engine / Open Grid
>>> Scheduler http://gridscheduler.sourceforge.net/
>>>
>>> Scalable Grid Engine Support Program
>>> http://www.scalablelogic.com/
>>>
>>>
>>>
>>>
>>> ________________________________ From: Amirhossein Kiani
>>> <amirhkiani_at_gmail.com> To: Rayson Ho <raysonlogin_at_yahoo.com>
>>> Cc: Justin Riley <justin.t.riley_at_gmail.com>;
>>> "starcluster_at_mit.edu" <starcluster_at_mit.edu> Sent: Friday,
>>> December 2, 2011 6:36 PM Subject: Re: [StarCluster] AWS
>>> instance runs out of memory and swaps
>>>
>>>
>>> Dear Rayson,
>>>
>>> Did you have a chance to test your solution on this?
>>> Basically, all I want is to prevent a job from running on an
>>> instance if it does not have the memory required for the job.
>>>
>>> I would very much appreciate your help!
>>>
>>> Many thanks, Amir
>>>
>>>
>>>
>>> On Nov 21, 2011, at 10:29 AM, Rayson Ho wrote:
>>>
>>> Amir,
>>>>
>>>>
>>>> You can use qhost to list all the node and resources that
>>>> each node has.
>>>>
>>>>
>>>> I have an answer to the memory issue, but I have not have
>>>> time to properly type up a response and test it.
>>>>
>>>>
>>>>
>>>> Rayson
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________ From: Amirhossein Kiani
>>>> <amirhkiani_at_gmail.com> To: Justin Riley
>>>> <justin.t.riley_at_gmail.com> Cc: Rayson Ho
>>>> <rayrayson_at_gmail.com>; "starcluster_at_mit.edu"
>>>> <starcluster_at_mit.edu> Sent: Monday, November 21, 2011 1:26
>>>> PM Subject: Re: [StarCluster] AWS instance runs out of memory
>>>> and swaps
>>>>
>>>> Hi Justin,
>>>>
>>>> Many thanks for your reply. I don't have any issue with
>>>> multiple jobs running per node if there is enough memory for
>>>> them. But since I know about the nature of my jobs, I can
>>>> predict that only one per node should be running. How can I
>>>> see how much memory does SGE think each node have? Is there
>>>> a way to list that?
>>>>
>>>> Regards, Amir
>>>>
>>>>
>>>> On Nov 21, 2011, at 8:18 AM, Justin Riley wrote:
>>>>
>>>>> Hi Amir,
>>>>>
>>>>> Sorry to hear you're still having issues. This is really
>>>>> more of an SGE issue more than anything but perhaps Rayson
>>>>> can give a better insight as to what's going on. It seems
>>>>> you're using 23G nodes and 12GB jobs. Just for drill does
>>>>> 'qhost' show each node having 23GB? Definitely seems like
>>>>> there's a boundary issue here given that two of your jobs
>>>>> together approaches the total memory of the machine
>>>>> (23GB). Is it your goal only to have one job per
>>> node?
>>>>>
>>>>> ~Justin
>>>>>
>>>>> On 11/16/2011 09:00 PM, Amirhossein Kiani wrote:
>>>>>> Dear all,
>>>>>>
>>>>>> I even wrote the queue submission script myself, adding
>>>>>> the mem_free=MEM_NEEDED,h_vmem=MEM_MAX parameter but
>>>>>> sometimes two jobs are randomly sent to one node that
>>>>>> does not have enough memory for two jobs and they start
>>>>>> running. I think the SGE should check on the instance
>>>>>> memory and not run multiple jobs on a machine when the
>>>>>> memory requirement for the jobs in total is above the
>>>>>> memory available in the node (or maybe there is a bug in
>>>>>> the current check)
>>>>>>
>>>>>> Amir
>>>>>>
>>>>>> On Nov 8, 2011, at 5:37 PM, Amirhossein Kiani wrote:
>>>>>>
>>>>>>> Hi Justin,
>>>>>>>
>>>>>>> I'm using a third-party tool to submit the jobs but I
>>>>>>> am setting the hard
>>> limit.
>>>>>>> For all my jobs I have something like this for the job
>>>>>>> description:
>>>>>>>
>>>>>>> [root_at_master test]# qstat -j 1
>>>>>>> ==============================================================
>>>>>>>
>>>>>>>
>>>>>>>
>
>>>>>>>
job_number: 1
>>>>>>> exec_file: job_scripts/1
>>>>>>> submission_time: Tue Nov 8 17:31:39 2011
>>>>>>> owner: root uid: 0 group: root
>>>>>>> gid: 0
>>>>>>>
>>> sge_o_home: /root
>>>>>>> sge_o_log_name: root sge_o_path:
>>>>>>> /home/apps/bin:/home/apps/vcftools_0.1.7/bin:/home/apps/tabix-0.2.5:/home/apps/BEDTools-Version-2.14.2/bin:/home/apps/samtools/bcftools:/home/apps/samtools:/home/apps/bwa-0.5.9:/home/apps/Python-2.7.2:/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/bin:/home/apps/sjm-1.0/bin:/home/apps/hugeseq/bin:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/usr/local/cuda/computeprof/bin:/usr/local/cuda/open64/bin:/opt/sge6/bin/lx24-amd64:/root/bin
>>>>>>>
>>>>>>>
>>>>>>>
>
>>>>>>>
sge_o_shell: /bin/bash
>>>>>>> sge_o_workdir:
>>> /data/test
>>>>>>> sge_o_host: master account: sge
>>>>>>> stderr_path_list:
>>>>>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_e111108173139.txt
>>>>>>>
>>>>>>>
>>>>>>>
>
>>>>>>>
*hard resource_list: h_vmem=12000M*
>>>>>>> mail_list: root_at_master notify: FALSE
>>>>>>> job_name: SAMPLE.bin_aln-chr1
>>>>>>> stdout_path_list:
>>>>>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_o111108173139.txt
>>>>>>>
>>>>>>>
>>>>>>>
>
>>>>>>>
jobshare:
>>> 0
>>>>>>> hard_queue_list: all.q env_list: job_args:
>>>>>>> -c,/home/apps/hugeseq/bin/hugeseq_mod.sh bin_sam.sh
>>>>>>> chr1 /data/chr1.bam /data/bwa_small.bam &&
>>>>>>> /home/apps/hugeseq/bin/hugeseq_mod.sh sam_index.sh
>>>>>>> /data/chr1.bam script_file: /bin/sh
>>>>>>> verify_suitable_queues: 2 scheduling info:
>>>>>>> (Collecting of scheduler job information is turned
>>>>>>> off)
>>>>>>>
>>>>>>> And I'm using the Cluster GPU Quadruple Extra Large
>>>>>>> instances which
>>> I
>>>>>>> think has about 23G memory. The issue that I see is
>>>>>>> too many of the jobs are submitted. I guess I need to
>>>>>>> set the mem_free too? (the problem is the tool im using
>>>>>>> does not seem to have a way tot set that...)
>>>>>>>
>>>>>>> Many thanks, Amir
>>>>>>>
>>>>>>> On Nov 8, 2011, at 5:47 AM, Justin Riley wrote:
>>>>>>>
>>>>>>>>
>>>>>> Hi Amirhossein,
>>>>>>
>>>>>> Did you specify the memory usage in your job script or
>>>>>> at command line and what parameters did you use exactly?
>>>>>>
>>>>>> Doing a quick search I believe that the following will
>>>>>> solve the problem although I haven't tested myself:
>>>>>>
>>>>>> $ qsub -l mem_free=MEM_NEEDED,h_vmem=MEM_MAX yourjob.sh
>>>>>>
>>>>>> Here, MEM_NEEDED and MEM_MAX are the lower and
>>> upper bounds for your
>>>>>> job's memory requirements.
>>>>>>
>>>>>> HTH,
>>>>>>
>>>>>> ~Justin
>>>>>>
>>>>>> On 7/22/64 2:59 PM, Amirhossein Kiani wrote:
>>>>>>> Dear Star Cluster users,
>>>>>>
>>>>>>> I'm using Star Cluster to set up an SGE and when I ran
>>>>>>> my job list,
>>>>>> although I had specified the memory usage for each job,
>>>>>> it submitted too many jobs on my instance and my
>>>>>> instance started going out of memory and swapping.
>>>>>>> I wonder if anyone knows how I could tell the SGE the
>>>>>>> max memory to
>>>>>> consider when submitting jobs to each node so that it
>>>>>> doesn't run the jobs if there is not enough memory
>>>>>> available on a node.
>>>>>>
>>>>>>> I'm using the Cluster GPU Quadruple Extra Large
>>>>>>> instances.
>>>>>>
>>>>>>> Many thanks, Amirhossein Kiani
>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> StarCluster mailing list StarCluster_at_mit.edu
>>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>>
>>>>
>>>>
>>>> _______________________________________________ StarCluster
>>>> mailing list StarCluster_at_mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>
>>>>
>>>>
>
>
>> _______________________________________________ StarCluster
>> mailing list StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>
> _______________________________________________ StarCluster mailing
> list StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7dUGwACgkQ4llAkMfDcrnxJgCggl/bdu/yC5LurXC5ybgHCYTa
d4IAnAmMgw/Qo2f7p/o5aoZyHOLwNFrh
=2OxU
-----END PGP SIGNATURE-----
Received on Mon Dec 05 2011 - 18:14:56 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject