StarCluster - Mailing List Archive

Re: AWS instance runs out of memory and swaps

From: Justin Riley <no email>
Date: Mon, 05 Dec 2011 19:21:36 -0500

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

OK, in this case running 'source /etc/profile' should fix the issue if
it ever happens again - no need to terminate the cluster.

In any event, glad you got things working. Would you mind sharing the
exact settings/procedures you used to fix the issue? This should
probably be tunable from StarCluster...

~Justin


On 12/5/11 6:16 PM, Amirhossein Kiani wrote:
> Thanks Justin... I think the issue was I had "sudo su" 'ed on the
instance and qconf was not on the roots path...
> I teared down my cluster and creating a new one...
>
> On Dec 5, 2011, at 3:13 PM, Justin Riley wrote:
>
> Amir,
>
> qconf is included in the StarCluster AMIs so there must be some other
> issue you're facing. Also, I wouldn't recommend installing the
> gridengine packages from ubuntu as they're most likely not compatible
> with StarCluster's bundled version in /opt/sge6 as you're seeing.
>
> With that said which AMI are you using and what does "echo $PATH" look
> like when you login as root (via sshmaster)?
>
> ~Justin
>
>
> On 12/05/2011 06:07 PM, Amirhossein Kiani wrote:
> >>> So I tried this and couldn't run qconf because it was not
> >>> installed. I then tried installing it using apt-get and specified
> >>> default for the cell name and "master" for the master name which
> >>> is the default for the SGE created using StarCluster.
> >>>
> >>> However now when I want to use qconf, it says:
> >>>
> >>> root_at_master:/data/stanford/aligned# qconf -msconf error: commlib
> >>> error: got select error (Connection refused) unable to send
> >>> message to qmaster using port 6444 on host "master": got send
> >>> error
> >>>
> >>>
> >>> Any idea how i could configure it to work?
> >>>
> >>>
> >>> Many thanks, Amir
> >>>
> >>> On Dec 5, 2011, at 1:52 PM, Rayson Ho wrote:
> >>>
> >>>> Hi Amirhossein,
> >>>>
> >>>> I was working on a few other things, and I just saw your message
> >>>> -- I have to spend less time on mailing list discussions these
> >>>> days due to the number of things that I needed to develop and/or
> >>>> fix, and I am also working on a new patch release of OGS/Grid
> >>>> Engine 2011.11. Luckily, I just found the mail that exactly
> >>>> solves the issue you are encountering:
> >>>>
> >>>> http://markmail.org/message/zdj5ebfrzhnadglf
> >>>>
> >>>>
> >>>> For more info, see the "job_load_adjustments" and
> >>>> "load_adjustment_decay_time" parameters in the Grid Engine
> >>>> manpage:
> >>>>
> >>>>
> >>>> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
> >>>>
> >>>>
> >>>>
> >>>>
> Rayson
> >>>>
> >>>> ================================= Grid Engine / Open Grid
> >>>> Scheduler http://gridscheduler.sourceforge.net/
> >>>>
> >>>> Scalable Grid Engine Support Program
> >>>> http://www.scalablelogic.com/
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> ________________________________ From: Amirhossein Kiani
> >>>> <amirhkiani_at_gmail.com> To: Rayson Ho <raysonlogin_at_yahoo.com> Cc:
> >>>> Justin Riley <justin.t.riley_at_gmail.com>; "starcluster_at_mit.edu"
> >>>> <starcluster_at_mit.edu> Sent: Friday, December 2, 2011 6:36 PM
> >>>> Subject: Re: [StarCluster] AWS instance runs out of memory and
> >>>> swaps
> >>>>
> >>>>
> >>>> Dear Rayson,
> >>>>
> >>>> Did you have a chance to test your solution on this? Basically,
> >>>> all I want is to prevent a job from running on an instance if it
> >>>> does not have the memory required for the job.
> >>>>
> >>>> I would very much appreciate your help!
> >>>>
> >>>> Many thanks, Amir
> >>>>
> >>>>
> >>>>
> >>>> On Nov 21, 2011, at 10:29 AM, Rayson Ho wrote:
> >>>>
> >>>> Amir,
> >>>>>
> >>>>>
> >>>>> You can use qhost to list all the node and resources that each
> >>>>> node has.
> >>>>>
> >>>>>
> >>>>> I have an answer to the memory issue, but I have not have time
> >>>>> to properly type up a response and test it.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Rayson
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> ________________________________ From: Amirhossein Kiani
> >>>>> <amirhkiani_at_gmail.com> To: Justin Riley
> >>>>> <justin.t.riley_at_gmail.com> Cc: Rayson Ho
> >>>>> <rayrayson_at_gmail.com>; "starcluster_at_mit.edu"
> >>>>> <starcluster_at_mit.edu> Sent: Monday, November 21, 2011 1:26 PM
> >>>>> Subject: Re: [StarCluster] AWS instance runs out of memory and
> >>>>> swaps
> >>>>>
> >>>>> Hi Justin,
> >>>>>
> >>>>> Many thanks for your reply. I don't have any issue with
> >>>>> multiple jobs running per node if there is enough memory for
> >>>>> them. But since I know about the nature of my jobs, I can
> >>>>> predict that only one per node should be running. How can I
> >>>>> see how much memory does SGE think each node have? Is there a
> >>>>> way to list that?
> >>>>>
> >>>>> Regards, Amir
> >>>>>
> >>>>>
> >>>>> On Nov 21, 2011, at 8:18 AM, Justin Riley wrote:
> >>>>>
> >>>>>> Hi Amir,
> >>>>>>
> >>>>>> Sorry to hear you're still having issues. This is really
> >>>>>> more of an SGE issue more than anything but perhaps Rayson
> >>>>>> can give a better insight as to what's going on. It seems
> >>>>>> you're using 23G nodes and 12GB jobs. Just for drill does
> >>>>>> 'qhost' show each node having 23GB? Definitely seems like
> >>>>>> there's a boundary issue here given that two of your jobs
> >>>>>> together approaches the total memory of the machine (23GB).
> >>>>>> Is it your goal only to have one job per
> >>>> node?
> >>>>>>
> >>>>>> ~Justin
> >>>>>>
> >>>>>> On 11/16/2011 09:00 PM, Amirhossein Kiani wrote:
> >>>>>>> Dear all,
> >>>>>>>
> >>>>>>> I even wrote the queue submission script myself, adding
> >>>>>>> the mem_free=MEM_NEEDED,h_vmem=MEM_MAX parameter but
> >>>>>>> sometimes two jobs are randomly sent to one node that does
> >>>>>>> not have enough memory for two jobs and they start running.
> >>>>>>> I think the SGE should check on the instance memory and not
> >>>>>>> run multiple jobs on a machine when the memory requirement
> >>>>>>> for the jobs in total is above the memory available in the
> >>>>>>> node (or maybe there is a bug in the current check)
> >>>>>>>
> >>>>>>> Amir
> >>>>>>>
> >>>>>>> On Nov 8, 2011, at 5:37 PM, Amirhossein Kiani wrote:
> >>>>>>>
> >>>>>>>> Hi Justin,
> >>>>>>>>
> >>>>>>>> I'm using a third-party tool to submit the jobs but I am
> >>>>>>>> setting the hard
> >>>> limit.
> >>>>>>>> For all my jobs I have something like this for the job
> >>>>>>>> description:
> >>>>>>>>
> >>>>>>>> [root_at_master test]# qstat -j 1
> >>>>>>>> ==============================================================
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> job_number: 1
> >>>>>>>> exec_file: job_scripts/1
> >>>>>>>> submission_time: Tue Nov 8 17:31:39 2011
> >>>>>>>> owner: root uid: 0 group:
> >>>>>>>> root gid: 0
> >>>>>>>>
> >>>> sge_o_home: /root
> >>>>>>>> sge_o_log_name: root sge_o_path:
> >>>>>>>>
/home/apps/bin:/home/apps/vcftools_0.1.7/bin:/home/apps/tabix-0.2.5:/home/apps/BEDTools-Version-2.14.2/bin:/home/apps/samtools/bcftools:/home/apps/samtools:/home/apps/bwa-0.5.9:/home/apps/Python-2.7.2:/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/bin:/home/apps/sjm-1.0/bin:/home/apps/hugeseq/bin:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/usr/local/cuda/computeprof/bin:/usr/local/cuda/open64/bin:/opt/sge6/bin/lx24-amd64:/root/bin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> sge_o_shell: /bin/bash
> >>>>>>>> sge_o_workdir:
> >>>> /data/test
> >>>>>>>> sge_o_host: master account: sge
> >>>>>>>> stderr_path_list:
> >>>>>>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_e111108173139.txt
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> *hard resource_list: h_vmem=12000M*
> >>>>>>>> mail_list: root_at_master notify: FALSE
> >>>>>>>> job_name: SAMPLE.bin_aln-chr1
> >>>>>>>> stdout_path_list:
> >>>>>>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_o111108173139.txt
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> jobshare:
> >>>> 0
> >>>>>>>> hard_queue_list: all.q env_list: job_args:
> >>>>>>>> -c,/home/apps/hugeseq/bin/hugeseq_mod.sh bin_sam.sh chr1
> >>>>>>>> /data/chr1.bam /data/bwa_small.bam &&
> >>>>>>>> /home/apps/hugeseq/bin/hugeseq_mod.sh sam_index.sh
> >>>>>>>> /data/chr1.bam script_file: /bin/sh
> >>>>>>>> verify_suitable_queues: 2 scheduling info:
> >>>>>>>> (Collecting of scheduler job information is turned off)
> >>>>>>>>
> >>>>>>>> And I'm using the Cluster GPU Quadruple Extra Large
> >>>>>>>> instances which
> >>>> I
> >>>>>>>> think has about 23G memory. The issue that I see is too
> >>>>>>>> many of the jobs are submitted. I guess I need to set
> >>>>>>>> the mem_free too? (the problem is the tool im using does
> >>>>>>>> not seem to have a way tot set that...)
> >>>>>>>>
> >>>>>>>> Many thanks, Amir
> >>>>>>>>
> >>>>>>>> On Nov 8, 2011, at 5:47 AM, Justin Riley wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>> Hi Amirhossein,
> >>>>>>>
> >>>>>>> Did you specify the memory usage in your job script or at
> >>>>>>> command line and what parameters did you use exactly?
> >>>>>>>
> >>>>>>> Doing a quick search I believe that the following will
> >>>>>>> solve the problem although I haven't tested myself:
> >>>>>>>
> >>>>>>> $ qsub -l mem_free=MEM_NEEDED,h_vmem=MEM_MAX yourjob.sh
> >>>>>>>
> >>>>>>> Here, MEM_NEEDED and MEM_MAX are the lower and
> >>>> upper bounds for your
> >>>>>>> job's memory requirements.
> >>>>>>>
> >>>>>>> HTH,
> >>>>>>>
> >>>>>>> ~Justin
> >>>>>>>
> >>>>>>> On 7/22/64 2:59 PM, Amirhossein Kiani wrote:
> >>>>>>>> Dear Star Cluster users,
> >>>>>>>
> >>>>>>>> I'm using Star Cluster to set up an SGE and when I ran
> >>>>>>>> my job list,
> >>>>>>> although I had specified the memory usage for each job, it
> >>>>>>> submitted too many jobs on my instance and my instance
> >>>>>>> started going out of memory and swapping.
> >>>>>>>> I wonder if anyone knows how I could tell the SGE the
> >>>>>>>> max memory to
> >>>>>>> consider when submitting jobs to each node so that it
> >>>>>>> doesn't run the jobs if there is not enough memory
> >>>>>>> available on a node.
> >>>>>>>
> >>>>>>>> I'm using the Cluster GPU Quadruple Extra Large
> >>>>>>>> instances.
> >>>>>>>
> >>>>>>>> Many thanks, Amirhossein Kiani
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> StarCluster mailing list StarCluster_at_mit.edu
> >>>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
> >>>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________ StarCluster
> >>>>> mailing list StarCluster_at_mit.edu
> >>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
> >>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>> _______________________________________________ StarCluster
> >>> mailing list StarCluster_at_mit.edu
> >>> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7dYA8ACgkQ4llAkMfDcrl3/gCfWl/niHCWOAmdAe9kRF5I6r//
bTQAnjM5LpXxNLrPX7Pr+lXlxkJTkBJN
=9p5j
-----END PGP SIGNATURE-----
Received on Mon Dec 05 2011 - 19:21:38 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject