StarCluster - Mailing List Archive

Re: AWS instance runs out of memory and swaps

From: Justin Riley <no email>
Date: Mon, 21 Nov 2011 11:18:35 -0500

Hi Amir,

Sorry to hear you're still having issues. This is really more of an SGE
issue more than anything but perhaps Rayson can give a better insight as
to what's going on. It seems you're using 23G nodes and 12GB jobs. Just
for drill does 'qhost' show each node having 23GB? Definitely seems like
there's a boundary issue here given that two of your jobs together
approaches the total memory of the machine (23GB). Is it your goal only
to have one job per node?

~Justin

On 11/16/2011 09:00 PM, Amirhossein Kiani wrote:
> Dear all,
>
> I even wrote the queue submission script myself, adding
> the mem_free=MEM_NEEDED,h_vmem=MEM_MAX parameter but sometimes two jobs
> are randomly sent to one node that does not have enough memory for two
> jobs and they start running. I think the SGE should check on the
> instance memory and not run multiple jobs on a machine when the memory
> requirement for the jobs in total is above the memory available in the
> node (or maybe there is a bug in the current check)
>
> Amir
>
> On Nov 8, 2011, at 5:37 PM, Amirhossein Kiani wrote:
>
>> Hi Justin,
>>
>> I'm using a third-party tool to submit the jobs but I am setting the
>> hard limit.
>> For all my jobs I have something like this for the job description:
>>
>> [root_at_master test]# qstat -j 1
>> ==============================================================
>> job_number: 1
>> exec_file: job_scripts/1
>> submission_time: Tue Nov 8 17:31:39 2011
>> owner: root
>> uid: 0
>> group: root
>> gid: 0
>> sge_o_home: /root
>> sge_o_log_name: root
>> sge_o_path:
>> /home/apps/bin:/home/apps/vcftools_0.1.7/bin:/home/apps/tabix-0.2.5:/home/apps/BEDTools-Version-2.14.2/bin:/home/apps/samtools/bcftools:/home/apps/samtools:/home/apps/bwa-0.5.9:/home/apps/Python-2.7.2:/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/bin:/home/apps/sjm-1.0/bin:/home/apps/hugeseq/bin:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/usr/local/cuda/computeprof/bin:/usr/local/cuda/open64/bin:/opt/sge6/bin/lx24-amd64:/root/bin
>> sge_o_shell: /bin/bash
>> sge_o_workdir: /data/test
>> sge_o_host: master
>> account: sge
>> stderr_path_list:
>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_e111108173139.txt
>> *hard resource_list: h_vmem=12000M*
>> mail_list: root_at_master
>> notify: FALSE
>> job_name: SAMPLE.bin_aln-chr1
>> stdout_path_list:
>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_o111108173139.txt
>> jobshare: 0
>> hard_queue_list: all.q
>> env_list:
>> job_args: -c,/home/apps/hugeseq/bin/hugeseq_mod.sh
>> bin_sam.sh chr1 /data/chr1.bam /data/bwa_small.bam &&
>> /home/apps/hugeseq/bin/hugeseq_mod.sh sam_index.sh /data/chr1.bam
>> script_file: /bin/sh
>> verify_suitable_queues: 2
>> scheduling info: (Collecting of scheduler job information
>> is turned off)
>>
>> And I'm using the Cluster GPU Quadruple Extra Large instances which I
>> think has about 23G memory. The issue that I see is too many of the
>> jobs are submitted. I guess I need to set the mem_free too? (the
>> problem is the tool im using does not seem to have a way tot set that...)
>>
>> Many thanks,
>> Amir
>>
>> On Nov 8, 2011, at 5:47 AM, Justin Riley wrote:
>>
>>>
> Hi Amirhossein,
>
> Did you specify the memory usage in your job script or at command
> line and what parameters did you use exactly?
>
> Doing a quick search I believe that the following will solve the
> problem although I haven't tested myself:
>
> $ qsub -l mem_free=MEM_NEEDED,h_vmem=MEM_MAX yourjob.sh
>
> Here, MEM_NEEDED and MEM_MAX are the lower and upper bounds for your
> job's memory requirements.
>
> HTH,
>
> ~Justin
>
> On 7/22/64 2:59 PM, Amirhossein Kiani wrote:
>> Dear Star Cluster users,
>
>> I'm using Star Cluster to set up an SGE and when I ran my job list,
> although I had specified the memory usage for each job, it submitted
> too many jobs on my instance and my instance started going out of
> memory and swapping.
>> I wonder if anyone knows how I could tell the SGE the max memory to
> consider when submitting jobs to each node so that it doesn't run the
> jobs if there is not enough memory available on a node.
>
>> I'm using the Cluster GPU Quadruple Extra Large instances.
>
>> Many thanks,
>> Amirhossein Kiani
>
>>>
>>
>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
Received on Mon Nov 21 2011 - 11:18:40 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject