Re: AWS instance runs out of memory and swaps
Hi Amir,
Sorry to hear you're still having issues. This is really more of an SGE
issue more than anything but perhaps Rayson can give a better insight as
to what's going on. It seems you're using 23G nodes and 12GB jobs. Just
for drill does 'qhost' show each node having 23GB? Definitely seems like
there's a boundary issue here given that two of your jobs together
approaches the total memory of the machine (23GB). Is it your goal only
to have one job per node?
~Justin
On 11/16/2011 09:00 PM, Amirhossein Kiani wrote:
> Dear all,
>
> I even wrote the queue submission script myself, adding
> the mem_free=MEM_NEEDED,h_vmem=MEM_MAX parameter but sometimes two jobs
> are randomly sent to one node that does not have enough memory for two
> jobs and they start running. I think the SGE should check on the
> instance memory and not run multiple jobs on a machine when the memory
> requirement for the jobs in total is above the memory available in the
> node (or maybe there is a bug in the current check)
>
> Amir
>
> On Nov 8, 2011, at 5:37 PM, Amirhossein Kiani wrote:
>
>> Hi Justin,
>>
>> I'm using a third-party tool to submit the jobs but I am setting the
>> hard limit.
>> For all my jobs I have something like this for the job description:
>>
>> [root_at_master test]# qstat -j 1
>> ==============================================================
>> job_number: 1
>> exec_file: job_scripts/1
>> submission_time: Tue Nov 8 17:31:39 2011
>> owner: root
>> uid: 0
>> group: root
>> gid: 0
>> sge_o_home: /root
>> sge_o_log_name: root
>> sge_o_path:
>> /home/apps/bin:/home/apps/vcftools_0.1.7/bin:/home/apps/tabix-0.2.5:/home/apps/BEDTools-Version-2.14.2/bin:/home/apps/samtools/bcftools:/home/apps/samtools:/home/apps/bwa-0.5.9:/home/apps/Python-2.7.2:/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/bin:/home/apps/sjm-1.0/bin:/home/apps/hugeseq/bin:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/usr/local/cuda/computeprof/bin:/usr/local/cuda/open64/bin:/opt/sge6/bin/lx24-amd64:/root/bin
>> sge_o_shell: /bin/bash
>> sge_o_workdir: /data/test
>> sge_o_host: master
>> account: sge
>> stderr_path_list:
>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_e111108173139.txt
>> *hard resource_list: h_vmem=12000M*
>> mail_list: root_at_master
>> notify: FALSE
>> job_name: SAMPLE.bin_aln-chr1
>> stdout_path_list:
>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_o111108173139.txt
>> jobshare: 0
>> hard_queue_list: all.q
>> env_list:
>> job_args: -c,/home/apps/hugeseq/bin/hugeseq_mod.sh
>> bin_sam.sh chr1 /data/chr1.bam /data/bwa_small.bam &&
>> /home/apps/hugeseq/bin/hugeseq_mod.sh sam_index.sh /data/chr1.bam
>> script_file: /bin/sh
>> verify_suitable_queues: 2
>> scheduling info: (Collecting of scheduler job information
>> is turned off)
>>
>> And I'm using the Cluster GPU Quadruple Extra Large instances which I
>> think has about 23G memory. The issue that I see is too many of the
>> jobs are submitted. I guess I need to set the mem_free too? (the
>> problem is the tool im using does not seem to have a way tot set that...)
>>
>> Many thanks,
>> Amir
>>
>> On Nov 8, 2011, at 5:47 AM, Justin Riley wrote:
>>
>>>
> Hi Amirhossein,
>
> Did you specify the memory usage in your job script or at command
> line and what parameters did you use exactly?
>
> Doing a quick search I believe that the following will solve the
> problem although I haven't tested myself:
>
> $ qsub -l mem_free=MEM_NEEDED,h_vmem=MEM_MAX yourjob.sh
>
> Here, MEM_NEEDED and MEM_MAX are the lower and upper bounds for your
> job's memory requirements.
>
> HTH,
>
> ~Justin
>
> On 7/22/64 2:59 PM, Amirhossein Kiani wrote:
>> Dear Star Cluster users,
>
>> I'm using Star Cluster to set up an SGE and when I ran my job list,
> although I had specified the memory usage for each job, it submitted
> too many jobs on my instance and my instance started going out of
> memory and swapping.
>> I wonder if anyone knows how I could tell the SGE the max memory to
> consider when submitting jobs to each node so that it doesn't run the
> jobs if there is not enough memory available on a node.
>
>> I'm using the Cluster GPU Quadruple Extra Large instances.
>
>> Many thanks,
>> Amirhossein Kiani
>
>>>
>>
>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
Received on Mon Nov 21 2011 - 11:18:40 EST
This archive was generated by
hypermail 2.3.0.