StarCluster - Mailing List Archive

Re: AWS instance runs out of memory and swaps

From: Amirhossein Kiani <no email>
Date: Mon, 5 Dec 2011 16:07:10 -0800

This finally solved my problem. Thank you Rayson and Justin!

Amir

On Dec 5, 2011, at 3:14 PM, Justin Riley wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Also, just for drill, try running:
>
> $ source /etc/profile
>
> before trying qconf in case your environment was altered at some point
> in your SSH session.
>
> ~Justin
>
> On 12/05/2011 06:13 PM, Justin Riley wrote:
>> Amir,
>>
>> qconf is included in the StarCluster AMIs so there must be some
>> other issue you're facing. Also, I wouldn't recommend installing
>> the gridengine packages from ubuntu as they're most likely not
>> compatible with StarCluster's bundled version in /opt/sge6 as
>> you're seeing.
>>
>> With that said which AMI are you using and what does "echo $PATH"
>> look like when you login as root (via sshmaster)?
>>
>> ~Justin
>>
>>
>> On 12/05/2011 06:07 PM, Amirhossein Kiani wrote:
>>> So I tried this and couldn't run qconf because it was not
>>> installed. I then tried installing it using apt-get and
>>> specified default for the cell name and "master" for the master
>>> name which is the default for the SGE created using StarCluster.
>>
>>> However now when I want to use qconf, it says:
>>
>>> root_at_master:/data/stanford/aligned# qconf -msconf error: commlib
>>> error: got select error (Connection refused) unable to send
>>> message to qmaster using port 6444 on host "master": got send
>>> error
>>
>>
>>> Any idea how i could configure it to work?
>>
>>
>>> Many thanks, Amir
>>
>>> On Dec 5, 2011, at 1:52 PM, Rayson Ho wrote:
>>
>>>> Hi Amirhossein,
>>>>
>>>> I was working on a few other things, and I just saw your
>>>> message -- I have to spend less time on mailing list
>>>> discussions these days due to the number of things that I
>>>> needed to develop and/or fix, and I am also working on a new
>>>> patch release of OGS/Grid Engine 2011.11. Luckily, I just found
>>>> the mail that exactly solves the issue you are encountering:
>>>>
>>>> http://markmail.org/message/zdj5ebfrzhnadglf
>>>>
>>>>
>>>> For more info, see the "job_load_adjustments" and
>>>> "load_adjustment_decay_time" parameters in the Grid Engine
>>>> manpage:
>>>>
>>>>
>>>> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
>>>>
>>>>
>>>>
>>>>
>>
>>>>
> Rayson
>>>>
>>>> ================================= Grid Engine / Open Grid
>>>> Scheduler http://gridscheduler.sourceforge.net/
>>>>
>>>> Scalable Grid Engine Support Program
>>>> http://www.scalablelogic.com/
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________ From: Amirhossein Kiani
>>>> <amirhkiani_at_gmail.com> To: Rayson Ho <raysonlogin_at_yahoo.com>
>>>> Cc: Justin Riley <justin.t.riley_at_gmail.com>;
>>>> "starcluster_at_mit.edu" <starcluster_at_mit.edu> Sent: Friday,
>>>> December 2, 2011 6:36 PM Subject: Re: [StarCluster] AWS
>>>> instance runs out of memory and swaps
>>>>
>>>>
>>>> Dear Rayson,
>>>>
>>>> Did you have a chance to test your solution on this?
>>>> Basically, all I want is to prevent a job from running on an
>>>> instance if it does not have the memory required for the job.
>>>>
>>>> I would very much appreciate your help!
>>>>
>>>> Many thanks, Amir
>>>>
>>>>
>>>>
>>>> On Nov 21, 2011, at 10:29 AM, Rayson Ho wrote:
>>>>
>>>> Amir,
>>>>>
>>>>>
>>>>> You can use qhost to list all the node and resources that
>>>>> each node has.
>>>>>
>>>>>
>>>>> I have an answer to the memory issue, but I have not have
>>>>> time to properly type up a response and test it.
>>>>>
>>>>>
>>>>>
>>>>> Rayson
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________ From: Amirhossein Kiani
>>>>> <amirhkiani_at_gmail.com> To: Justin Riley
>>>>> <justin.t.riley_at_gmail.com> Cc: Rayson Ho
>>>>> <rayrayson_at_gmail.com>; "starcluster_at_mit.edu"
>>>>> <starcluster_at_mit.edu> Sent: Monday, November 21, 2011 1:26
>>>>> PM Subject: Re: [StarCluster] AWS instance runs out of memory
>>>>> and swaps
>>>>>
>>>>> Hi Justin,
>>>>>
>>>>> Many thanks for your reply. I don't have any issue with
>>>>> multiple jobs running per node if there is enough memory for
>>>>> them. But since I know about the nature of my jobs, I can
>>>>> predict that only one per node should be running. How can I
>>>>> see how much memory does SGE think each node have? Is there
>>>>> a way to list that?
>>>>>
>>>>> Regards, Amir
>>>>>
>>>>>
>>>>> On Nov 21, 2011, at 8:18 AM, Justin Riley wrote:
>>>>>
>>>>>> Hi Amir,
>>>>>>
>>>>>> Sorry to hear you're still having issues. This is really
>>>>>> more of an SGE issue more than anything but perhaps Rayson
>>>>>> can give a better insight as to what's going on. It seems
>>>>>> you're using 23G nodes and 12GB jobs. Just for drill does
>>>>>> 'qhost' show each node having 23GB? Definitely seems like
>>>>>> there's a boundary issue here given that two of your jobs
>>>>>> together approaches the total memory of the machine
>>>>>> (23GB). Is it your goal only to have one job per
>>>> node?
>>>>>>
>>>>>> ~Justin
>>>>>>
>>>>>> On 11/16/2011 09:00 PM, Amirhossein Kiani wrote:
>>>>>>> Dear all,
>>>>>>>
>>>>>>> I even wrote the queue submission script myself, adding
>>>>>>> the mem_free=MEM_NEEDED,h_vmem=MEM_MAX parameter but
>>>>>>> sometimes two jobs are randomly sent to one node that
>>>>>>> does not have enough memory for two jobs and they start
>>>>>>> running. I think the SGE should check on the instance
>>>>>>> memory and not run multiple jobs on a machine when the
>>>>>>> memory requirement for the jobs in total is above the
>>>>>>> memory available in the node (or maybe there is a bug in
>>>>>>> the current check)
>>>>>>>
>>>>>>> Amir
>>>>>>>
>>>>>>> On Nov 8, 2011, at 5:37 PM, Amirhossein Kiani wrote:
>>>>>>>
>>>>>>>> Hi Justin,
>>>>>>>>
>>>>>>>> I'm using a third-party tool to submit the jobs but I
>>>>>>>> am setting the hard
>>>> limit.
>>>>>>>> For all my jobs I have something like this for the job
>>>>>>>> description:
>>>>>>>>
>>>>>>>> [root_at_master test]# qstat -j 1
>>>>>>>> ==============================================================
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>
>>>>>>>>
> job_number: 1
>>>>>>>> exec_file: job_scripts/1
>>>>>>>> submission_time: Tue Nov 8 17:31:39 2011
>>>>>>>> owner: root uid: 0 group: root
>>>>>>>> gid: 0
>>>>>>>>
>>>> sge_o_home: /root
>>>>>>>> sge_o_log_name: root sge_o_path:
>>>>>>>> /home/apps/bin:/home/apps/vcftools_0.1.7/bin:/home/apps/tabix-0.2.5:/home/apps/BEDTools-Version-2.14.2/bin:/home/apps/samtools/bcftools:/home/apps/samtools:/home/apps/bwa-0.5.9:/home/apps/Python-2.7.2:/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/bin:/home/apps/sjm-1.0/bin:/home/apps/hugeseq/bin:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/usr/local/cuda/computeprof/bin:/usr/local/cuda/open64/bin:/opt/sge6/bin/lx24-amd64:/root/bin
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>
>>>>>>>>
> sge_o_shell: /bin/bash
>>>>>>>> sge_o_workdir:
>>>> /data/test
>>>>>>>> sge_o_host: master account: sge
>>>>>>>> stderr_path_list:
>>>>>>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_e111108173139.txt
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>
>>>>>>>>
> *hard resource_list: h_vmem=12000M*
>>>>>>>> mail_list: root_at_master notify: FALSE
>>>>>>>> job_name: SAMPLE.bin_aln-chr1
>>>>>>>> stdout_path_list:
>>>>>>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_o111108173139.txt
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>
>>>>>>>>
> jobshare:
>>>> 0
>>>>>>>> hard_queue_list: all.q env_list: job_args:
>>>>>>>> -c,/home/apps/hugeseq/bin/hugeseq_mod.sh bin_sam.sh
>>>>>>>> chr1 /data/chr1.bam /data/bwa_small.bam &&
>>>>>>>> /home/apps/hugeseq/bin/hugeseq_mod.sh sam_index.sh
>>>>>>>> /data/chr1.bam script_file: /bin/sh
>>>>>>>> verify_suitable_queues: 2 scheduling info:
>>>>>>>> (Collecting of scheduler job information is turned
>>>>>>>> off)
>>>>>>>>
>>>>>>>> And I'm using the Cluster GPU Quadruple Extra Large
>>>>>>>> instances which
>>>> I
>>>>>>>> think has about 23G memory. The issue that I see is
>>>>>>>> too many of the jobs are submitted. I guess I need to
>>>>>>>> set the mem_free too? (the problem is the tool im using
>>>>>>>> does not seem to have a way tot set that...)
>>>>>>>>
>>>>>>>> Many thanks, Amir
>>>>>>>>
>>>>>>>> On Nov 8, 2011, at 5:47 AM, Justin Riley wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>> Hi Amirhossein,
>>>>>>>
>>>>>>> Did you specify the memory usage in your job script or
>>>>>>> at command line and what parameters did you use exactly?
>>>>>>>
>>>>>>> Doing a quick search I believe that the following will
>>>>>>> solve the problem although I haven't tested myself:
>>>>>>>
>>>>>>> $ qsub -l mem_free=MEM_NEEDED,h_vmem=MEM_MAX yourjob.sh
>>>>>>>
>>>>>>> Here, MEM_NEEDED and MEM_MAX are the lower and
>>>> upper bounds for your
>>>>>>> job's memory requirements.
>>>>>>>
>>>>>>> HTH,
>>>>>>>
>>>>>>> ~Justin
>>>>>>>
>>>>>>> On 7/22/64 2:59 PM, Amirhossein Kiani wrote:
>>>>>>>> Dear Star Cluster users,
>>>>>>>
>>>>>>>> I'm using Star Cluster to set up an SGE and when I ran
>>>>>>>> my job list,
>>>>>>> although I had specified the memory usage for each job,
>>>>>>> it submitted too many jobs on my instance and my
>>>>>>> instance started going out of memory and swapping.
>>>>>>>> I wonder if anyone knows how I could tell the SGE the
>>>>>>>> max memory to
>>>>>>> consider when submitting jobs to each node so that it
>>>>>>> doesn't run the jobs if there is not enough memory
>>>>>>> available on a node.
>>>>>>>
>>>>>>>> I'm using the Cluster GPU Quadruple Extra Large
>>>>>>>> instances.
>>>>>>>
>>>>>>>> Many thanks, Amirhossein Kiani
>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> StarCluster mailing list StarCluster_at_mit.edu
>>>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________ StarCluster
>>>>> mailing list StarCluster_at_mit.edu
>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>>
>>>>>
>>>>>
>>
>>
>>> _______________________________________________ StarCluster
>>> mailing list StarCluster_at_mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>> _______________________________________________ StarCluster mailing
>> list StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.17 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk7dUGwACgkQ4llAkMfDcrnxJgCggl/bdu/yC5LurXC5ybgHCYTa
> d4IAnAmMgw/Qo2f7p/o5aoZyHOLwNFrh
> =2OxU
> -----END PGP SIGNATURE-----
Received on Mon Dec 05 2011 - 19:07:10 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject