StarCluster - Mailing List Archive

Re: AWS instance runs out of memory and swaps

From: Rayson Ho <no email>
Date: Mon, 21 Nov 2011 10:29:55 -0800 (PST)

Amir, You can use qhost to list all the node and resources that each node has. I have an answer to the memory issue, but I have not have time to properly type up a response and test it. Rayson ________________________________ From: Amirhossein Kiani <amirhkiani_at_gmail.com> To: Justin Riley <justin.t.riley_at_gmail.com> Cc: Rayson Ho <rayrayson@gmail.com>; "starcluster@mit.edu" <starcluster@mit.edu> Sent: Monday, November 21, 2011 1:26 PM Subject: Re: [StarCluster] AWS instance runs out of memory and swaps Hi Justin, Many thanks for your reply. I don't have any issue with multiple jobs running per node if there is enough memory for them. But since I know about the nature of my jobs, I can predict that only one per node should be running. How can I see how much memory does SGE think each node have? Is there a way to list that? Regards, Amir On Nov 21, 2011, at 8:18 AM, Justin Riley wrote: > Hi Amir, > > Sorry to hear you're still having issues. This is really more of an SGE > issue more than anything but perhaps Rayson can give a better insight as > to what's going on. It seems you're using 23G nodes and 12GB jobs. Just > for drill does 'qhost' show each node having 23GB? Definitely seems like > there's a boundary issue here given that two of your jobs together > approaches the total memory of the machine (23GB). Is it your goal only > to have one job per node? > > ~Justin > > On 11/16/2011 09:00 PM, Amirhossein Kiani wrote: >> Dear all, >> >> I even wrote the queue submission script myself, adding >> the mem_free=MEM_NEEDED,h_vmem=MEM_MAX parameter but sometimes two jobs >> are randomly sent to one node that does not have enough memory for two >> jobs and they start running. I think the SGE should check on the >> instance memory and not run multiple jobs on a machine when the memory >> requirement for the jobs in total is above the memory available in the >> node (or maybe there is a bug in the current check) >> >> Amir >> >> On Nov 8, 2011, at 5:37 PM, Amirhossein Kiani wrote: >> >>> Hi Justin, >>> >>> I'm using a third-party tool to submit the jobs but I am setting the >>> hard limit. >>> For all my jobs I have something like this for the job description: >>> >>> [root@master test]# qstat -j 1 >>> ============================================================== >>> job_number:                1 >>> exec_file:                  job_scripts/1 >>> submission_time:            Tue Nov  8 17:31:39 2011 >>> owner:                      root >>> uid:                        0 >>> group:                      root >>> gid:                        0 >>> sge_o_home:                /root >>> sge_o_log_name:            root >>> sge_o_path:                >>> /home/apps/bin:/home/apps/vcftools_0.1.7/bin:/home/apps/tabix-0.2.5:/home/apps/BEDTools-Version-2.14.2/bin:/home/apps/samtools/bcftools:/home/apps/samtools:/home/apps/bwa-0.5.9:/home/apps/Python-2.7.2:/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/bin:/home/apps/sjm-1.0/bin:/home/apps/hugeseq/bin:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/usr/local/cuda/computeprof/bin:/usr/local/cuda/open64/bin:/opt/sge6/bin/lx24-amd64:/root/bin >>> sge_o_shell:                /bin/bash >>> sge_o_workdir:              /data/test >>> sge_o_host:                master >>> account:                    sge >>> stderr_path_list:          >>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_e111108173139.txt >>> *hard resource_list:        h_vmem=12000M* >>> mail_list:                  root@master >>> notify:                    FALSE >>> job_name:                  SAMPLE.bin_aln-chr1 >>> stdout_path_list:          >>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_o111108173139.txt >>> jobshare:                  0 >>> hard_queue_list:            all.q >>> env_list:                  >>> job_args:                  -c,/home/apps/hugeseq/bin/hugeseq_mod.sh >>> bin_sam.sh chr1 /data/chr1.bam /data/bwa_small.bam && >>> /home/apps/hugeseq/bin/hugeseq_mod.sh sam_index.sh /data/chr1.bam >>> script_file:                /bin/sh >>> verify_suitable_queues:    2 >>> scheduling info:            (Collecting of scheduler job information >>> is turned off) >>> >>> And I'm using the Cluster GPU Quadruple Extra Large instances which I >>> think has about 23G memory. The issue that I see is too many of the >>> jobs are submitted. I guess I need to set the mem_free too? (the >>> problem is the tool im using does not seem to have a way tot set that...) >>> >>> Many thanks, >>> Amir >>> >>> On Nov 8, 2011, at 5:47 AM, Justin Riley wrote: >>> >>>> >> Hi Amirhossein, >> >> Did you specify the memory usage in your job script or at command >> line and what parameters did you use exactly? >> >> Doing a quick search I believe that the following will solve the >> problem although I haven't tested myself: >> >> $ qsub -l mem_free=MEM_NEEDED,h_vmem=MEM_MAX yourjob.sh >> >> Here, MEM_NEEDED and MEM_MAX are the lower and upper bounds for your >> job's memory requirements. >> >> HTH, >> >> ~Justin >> >> On 7/22/64 2:59 PM, Amirhossein Kiani wrote: >>> Dear Star Cluster users, >> >>> I'm using Star Cluster to set up an SGE and when I ran my job list, >> although I had specified the memory usage for each job, it submitted >> too many jobs on my instance and my instance started going out of >> memory and swapping. >>> I wonder if anyone knows how I could tell the SGE the max memory to >> consider when submitting jobs to each node so that it doesn't run the >> jobs if there is not enough memory available on a node. >> >>> I'm using the Cluster GPU Quadruple Extra Large instances. >> >>> Many thanks, >>> Amirhossein Kiani >> >>>> >>> >> >> >> >> _______________________________________________ >> StarCluster mailing list >> StarCluster@mit.edu >> http://mailman.mit.edu/mailman/listinfo/starcluster > _______________________________________________ StarCluster mailing list StarCluster@mit.edu http://mailman.mit.edu/mailman/listinfo/starcluster
Received on Mon Nov 21 2011 - 13:29:57 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject