StarCluster - Mailing List Archive

Re: AWS instance runs out of memory and swaps

From: Rayson Ho <no email>
Date: Mon, 5 Dec 2011 13:52:45 -0800 (PST)

Hi Amirhossein,

I was working on a few other things, and I just saw your message -- I have to spend less time on mailing list discussions these days due to the number of things that I needed to develop and/or fix, and I am also working on a new patch release of OGS/Grid Engine 2011.11. Luckily, I just found the mail that exactly solves the issue you are encountering:

http://markmail.org/message/zdj5ebfrzhnadglf


For more info, see the "job_load_adjustments" and "load_adjustment_decay_time" parameters in the Grid Engine manpage:


http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html

Rayson

=================================
Grid Engine / Open Grid Scheduler
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/




________________________________
From: Amirhossein Kiani <amirhkiani_at_gmail.com>
To: Rayson Ho <raysonlogin_at_yahoo.com>
Cc: Justin Riley <justin.t.riley_at_gmail.com>; "starcluster_at_mit.edu" <starcluster_at_mit.edu>
Sent: Friday, December 2, 2011 6:36 PM
Subject: Re: [StarCluster] AWS instance runs out of memory and swaps


Dear Rayson,

Did you have a chance to test your solution on this? Basically, all I want is to prevent a job from running on an instance if it does not have the memory required for the job.

I would very much appreciate your help!

Many thanks,
Amir



On Nov 21, 2011, at 10:29 AM, Rayson Ho wrote:

Amir,
>
>
>You can use qhost to list all the node and resources that each node has.
>
>
>I have an answer to the memory issue, but I have not have time to properly type up a response and test it.
>
>
>
>Rayson
>
>
>
>
>
>
>
>________________________________
> From: Amirhossein Kiani <amirhkiani_at_gmail.com>
>To: Justin Riley <justin.t.riley_at_gmail.com>
>Cc: Rayson Ho <rayrayson_at_gmail.com>; "starcluster_at_mit.edu" <starcluster_at_mit.edu>
>Sent: Monday, November 21, 2011 1:26 PM
>Subject: Re: [StarCluster] AWS instance runs out of memory and swaps
>
>Hi Justin,
>
>Many thanks for your reply.
>I don't have any issue with multiple jobs running per node if there is enough memory for them. But since I know about the nature of my jobs, I can predict that only one per node should be running.
>How can I see how much memory does SGE think each node have? Is there a way to list that?
>
>Regards,
>Amir
>
>
>On Nov 21, 2011, at 8:18 AM, Justin Riley wrote:
>
>> Hi Amir,
>>
>> Sorry to hear you're still having issues. This is really more of an SGE
>> issue more than anything but perhaps Rayson can give a better insight as
>> to what's going on. It seems you're using 23G nodes and 12GB jobs. Just
>> for drill does 'qhost' show each node having 23GB? Definitely seems like
>> there's a boundary issue here given that two of your jobs together
>> approaches the total memory of the machine (23GB). Is it your goal only
>> to have one job per
node?
>>
>> ~Justin
>>
>> On 11/16/2011 09:00 PM, Amirhossein Kiani wrote:
>>> Dear all,
>>>
>>> I even wrote the queue submission script myself, adding
>>> the mem_free=MEM_NEEDED,h_vmem=MEM_MAX parameter but sometimes two jobs
>>> are randomly sent to one node that does not have enough memory for two
>>> jobs and they start running. I think the SGE should check on the
>>> instance memory and not run multiple jobs on a machine when the memory
>>> requirement for the jobs in total is above the memory available in the
>>> node (or maybe there is a bug in the current check)
>>>
>>> Amir
>>>
>>> On Nov 8, 2011, at 5:37 PM, Amirhossein Kiani wrote:
>>>
>>>> Hi Justin,
>>>>
>>>> I'm using a third-party tool to submit the jobs but I am setting the
>>>> hard
limit.
>>>> For all my jobs I have something like this for the job description:
>>>>
>>>> [root_at_master test]# qstat -j 1
>>>> ==============================================================
>>>> job_number:                 1
>>>> exec_file:                  job_scripts/1
>>>> submission_time:            Tue Nov  8 17:31:39 2011
>>>> owner:                      root
>>>> uid:                        0
>>>> group:                      root
>>>> gid:                        0
>>>>
sge_o_home:                 /root
>>>> sge_o_log_name:             root
>>>> sge_o_path:               
>>>> /home/apps/bin:/home/apps/vcftools_0.1.7/bin:/home/apps/tabix-0.2.5:/home/apps/BEDTools-Version-2.14.2/bin:/home/apps/samtools/bcftools:/home/apps/samtools:/home/apps/bwa-0.5.9:/home/apps/Python-2.7.2:/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/bin:/home/apps/sjm-1.0/bin:/home/apps/hugeseq/bin:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/usr/local/cuda/computeprof/bin:/usr/local/cuda/open64/bin:/opt/sge6/bin/lx24-amd64:/root/bin
>>>> sge_o_shell:                /bin/bash
>>>> sge_o_workdir:             
/data/test
>>>> sge_o_host:                 master
>>>> account:                    sge
>>>> stderr_path_list:         
>>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_e111108173139.txt
>>>> *hard resource_list:         h_vmem=12000M*
>>>> mail_list:                  root_at_master
>>>> notify:                     FALSE
>>>> job_name:                   SAMPLE.bin_aln-chr1
>>>> stdout_path_list:         
>>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_o111108173139.txt
>>>> jobshare:                 
  0
>>>> hard_queue_list:            all.q
>>>> env_list:                 
>>>> job_args:                   -c,/home/apps/hugeseq/bin/hugeseq_mod.sh
>>>> bin_sam.sh chr1 /data/chr1.bam /data/bwa_small.bam &&
>>>> /home/apps/hugeseq/bin/hugeseq_mod.sh sam_index.sh /data/chr1.bam
>>>> script_file:                /bin/sh
>>>> verify_suitable_queues:     2
>>>> scheduling info:            (Collecting of scheduler job information
>>>> is turned off)
>>>>
>>>> And I'm using the Cluster GPU Quadruple Extra Large instances which
I
>>>> think has about 23G memory. The issue that I see is too many of the
>>>> jobs are submitted. I guess I need to set the mem_free too? (the
>>>> problem is the tool im using does not seem to have a way tot set that...)
>>>>
>>>> Many thanks,
>>>> Amir
>>>>
>>>> On Nov 8, 2011, at 5:47 AM, Justin Riley wrote:
>>>>
>>>>>
>>> Hi Amirhossein,
>>>
>>> Did you specify the memory usage in your job script or at command
>>> line and what parameters did you use exactly?
>>>
>>> Doing a quick search I believe that the following will solve the
>>> problem although I haven't tested myself:
>>>
>>> $ qsub -l mem_free=MEM_NEEDED,h_vmem=MEM_MAX yourjob.sh
>>>
>>> Here, MEM_NEEDED and MEM_MAX are the lower and
upper bounds for your
>>> job's memory requirements.
>>>
>>> HTH,
>>>
>>> ~Justin
>>>
>>> On 7/22/64 2:59 PM, Amirhossein Kiani wrote:
>>>> Dear Star Cluster users,
>>>
>>>> I'm using Star Cluster to set up an SGE and when I ran my job list,
>>> although I had specified the memory usage for each job, it submitted
>>> too many jobs on my instance and my instance started going out of
>>> memory and swapping.
>>>> I wonder if anyone knows how I could tell the SGE the max memory to
>>> consider when submitting jobs to each node so that it doesn't run the
>>> jobs if there is not enough memory available on a node.
>>>
>>>> I'm using the Cluster GPU Quadruple Extra Large instances.
>>>
>>>> Many thanks,
>>>> Amirhossein Kiani
>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> StarCluster mailing list
>>> StarCluster_at_mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>
>
>_______________________________________________
>StarCluster mailing list
>StarCluster_at_mit.edu
>http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
>
Received on Mon Dec 05 2011 - 16:52:47 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject