StarCluster - Mailing List Archive

Re: AWS instance runs out of memory and swaps

From: Rayson Ho <no email>
Date: Thu, 5 Jan 2012 12:05:15 -0800 (PST)

Hi Justin,

If you are not referring to the simple issue of qconf not in the PATH, then the real issue of "AWS instances running out of memory & swap" has to do with Grid Engine's internal accounting, and how it debts the expected resource usage of each job on the nodes. I will write some docs related to this part of Grid Engine, so that users of Open Grid Scheduler/Grid Engine and StarCluster will benefit when they need to handle this kind of job scheduling.

Rayson

=================================
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/

________________________________
From: Justin Riley <jtriley_at_MIT.EDU>
To: Amirhossein Kiani <amirhkiani_at_gmail.com>
Cc: Rayson Ho <raysonlogin_at_yahoo.com>; "starcluster_at_mit.edu" <starcluster_at_mit.edu>
Sent: Monday, December 5, 2011 7:21 PM
Subject: Re: [StarCluster] AWS instance runs out of memory and swaps

In any event, glad you got things working. Would you mind sharing the exact settings/procedures you used to fix the issue? This should probably be tunable from StarCluster...

~Justin


On 12/5/11 6:16 PM, Amirhossein Kiani wrote:
> Thanks Justin... I think the issue was I had "sudo su" 'ed on the instance and qconf was not on the roots path...
> I teared down my cluster and creating a new one...
>
> On Dec 5, 2011, at 3:13 PM, Justin Riley wrote:
>
> Amir,
>
> qconf is included in the StarCluster AMIs so there must be
      some other
> issue you're facing. Also, I wouldn't recommend installing
      the
> gridengine packages from ubuntu as they're most likely not
      compatible
> with StarCluster's bundled version in /opt/sge6 as you're
      seeing.
>
> With that said which AMI are you using and what does "echo
      $PATH" look
> like when you login as root (via sshmaster)?
>
> ~Justin
>
>
> On 12/05/2011 06:07 PM, Amirhossein Kiani wrote:
> >>> So I tried this and couldn't run qconf because
      it was not
> >>> installed. I then tried installing it using
      apt-get and specified
> >>> default for the cell name and "master" for the
      master name which
> >>> is the default for the SGE created using
      StarCluster.
> >>>
> >>> However now when I want to use qconf, it says:
> >>>
> >>> root_at_master:/data/stanford/aligned# qconf
      -msconf error: commlib
> >>> error: got select error (Connection refused)
      unable to send
> >>> message to qmaster using port 6444 on host
      "master": got send
> >>> error
> >>>
> >>>
> >>> Any idea how i could configure it to work?
> >>>
> >>>
> >>> Many thanks, Amir
> >>>
> >>> On Dec 5, 2011, at 1:52 PM, Rayson Ho wrote:
> >>>
> >>>> Hi Amirhossein,
> >>>>
> >>>> I was working on a few other things, and I
      just saw your message
> >>>> -- I have to spend less time on mailing list
      discussions these
> >>>> days due to the number of things that I
      needed to develop and/or
> >>>> fix, and I am also working on a new patch
      release of OGS/Grid
> >>>> Engine 2011.11. Luckily, I just found the
      mail that exactly
> >>>> solves the issue you are encountering:
> >>>>
> >>>> http://markmail.org/message/zdj5ebfrzhnadglf
> >>>>
> >>>>
> >>>> For more info, see the
      "job_load_adjustments" and
> >>>> "load_adjustment_decay_time" parameters in
      the Grid Engine
> >>>> manpage:
> >>>>
> >>>>
> >>>>
http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
> >>>>
> >>>>
> >>>>
> >>>>
> Rayson
> >>>>
> >>>> ================================= Grid
      Engine / Open Grid
> >>>> Scheduler
http://gridscheduler.sourceforge.net/
> >>>>
> >>>> Scalable Grid Engine Support Program
> >>>> http://www.scalablelogic.com/
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> ________________________________ From:
      Amirhossein Kiani
> >>>> <amirhkiani_at_gmail.com> To: Rayson Ho <raysonlogin_at_yahoo.com> Cc:
> >>>> Justin Riley <justin.t.riley_at_gmail.com>; "starcluster_at_mit.edu"
> >>>> <starcluster_at_mit.edu> Sent: Friday, December 2, 2011 6:36 PM
> >>>> Subject: Re: [StarCluster] AWS instance runs
      out of memory and
> >>>> swaps
> >>>>
> >>>>
> >>>> Dear Rayson,
> >>>>
> >>>> Did you have a chance to test your solution
      on this? Basically,
> >>>> all I want is to prevent a job from running
      on an instance if it
> >>>> does not have the memory required for the
      job.
> >>>>
> >>>> I would very much appreciate your help!
> >>>>
> >>>> Many thanks, Amir
> >>>>
> >>>>
> >>>>
> >>>> On Nov 21, 2011, at 10:29 AM, Rayson Ho
      wrote:
> >>>>
> >>>> Amir,
> >>>>>
> >>>>>
> >>>>> You can use qhost to list all the node
      and resources that each
> >>>>> node has.
> >>>>>
> >>>>>
> >>>>> I have an answer to the memory issue,
      but I have not have time
> >>>>> to properly type up a response and test
      it.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Rayson
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> ________________________________ From:
      Amirhossein Kiani
> >>>>> <amirhkiani_at_gmail.com> To: Justin Riley
> >>>>> <justin.t.riley_at_gmail.com> Cc: Rayson Ho
> >>>>> <rayrayson_at_gmail.com>; "starcluster_at_mit.edu"
> >>>>> <starcluster_at_mit.edu> Sent: Monday, November 21, 2011 1:26 PM
> >>>>> Subject: Re: [StarCluster] AWS instance
      runs out of memory and
> >>>>> swaps
> >>>>>
> >>>>> Hi Justin,
> >>>>>
> >>>>> Many thanks for your reply. I don't have
      any issue with
> >>>>> multiple jobs running per node if there
      is enough memory for
> >>>>> them. But since I know about the nature
      of my jobs, I can
> >>>>> predict that only one per node should be
      running. How can I
> >>>>> see how much memory does SGE think each
      node have? Is there a
> >>>>> way to list that?
> >>>>>
> >>>>> Regards, Amir
> >>>>>
> >>>>>
> >>>>> On Nov 21, 2011, at 8:18 AM, Justin
      Riley wrote:
> >>>>>
> >>>>>> Hi Amir,
> >>>>>>
> >>>>>> Sorry to hear you're still having
      issues. This is really
> >>>>>> more of an SGE issue more than
      anything but perhaps Rayson
> >>>>>> can give a better insight as to
      what's going on. It seems
> >>>>>> you're using 23G nodes and 12GB
      jobs. Just for drill does
> >>>>>> 'qhost' show each node having 23GB?
      Definitely seems like
> >>>>>> there's a boundary issue here given
      that two of your jobs
> >>>>>> together approaches the total memory
      of the machine (23GB).
> >>>>>> Is it your goal only to have one job
      per
> >>>> node?
> >>>>>>
> >>>>>> ~Justin
> >>>>>>
> >>>>>> On 11/16/2011 09:00 PM, Amirhossein
      Kiani wrote:
> >>>>>>> Dear all,
> >>>>>>>
> >>>>>>> I even wrote the queue
      submission script myself, adding
> >>>>>>> the
      mem_free=MEM_NEEDED,h_vmem=MEM_MAX parameter but
> >>>>>>> sometimes two jobs are randomly
      sent to one node that does
> >>>>>>> not have enough memory for two
      jobs and they start running.
> >>>>>>> I think the SGE should check on
      the instance memory and not
> >>>>>>> run multiple jobs on a machine
      when the memory requirement
> >>>>>>> for the jobs in total is above
      the memory available in the
> >>>>>>> node (or maybe there is a bug in
      the current check)
> >>>>>>>
> >>>>>>> Amir
> >>>>>>>
> >>>>>>> On Nov 8, 2011, at 5:37 PM,
      Amirhossein Kiani wrote:
> >>>>>>>
> >>>>>>>> Hi Justin,
> >>>>>>>>
> >>>>>>>> I'm using a third-party tool
      to submit the jobs but I am
> >>>>>>>> setting the hard
> >>>> limit.
> >>>>>>>> For all my jobs I have
      something like this for the job
> >>>>>>>> description:
> >>>>>>>>
> >>>>>>>> [root_at_master test]# qstat -j
      1
> >>>>>>>>
      ==============================================================
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> job_number: 1
> >>>>>>>> exec_file: job_scripts/1
> >>>>>>>> submission_time: Tue Nov 8
      17:31:39 2011
> >>>>>>>> owner: root uid: 0 group:
> >>>>>>>> root gid: 0
> >>>>>>>>
> >>>> sge_o_home: /root
> >>>>>>>> sge_o_log_name: root
      sge_o_path:
> >>>>>>>>
/home/apps/bin:/home/apps/vcftools_0.1.7/bin:/home/apps/tabix-0.2.5:/home/apps/BEDTools-Version-2.14.2/bin:/home/apps/samtools/bcftools:/home/apps/samtools:/home/apps/bwa-0.5.9:/home/apps/Python-2.7.2:/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/bin:/home/apps/sjm-1.0/bin:/home/apps/hugeseq/bin:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/usr/local/cuda/computeprof/bin:/usr/local/cuda/open64/bin:/opt/sge6/bin/lx24-amd64:/root/bin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> sge_o_shell: /bin/bash
> >>>>>>>> sge_o_workdir:
> >>>> /data/test
> >>>>>>>> sge_o_host: master account:
      sge
> >>>>>>>> stderr_path_list:
> >>>>>>>>
      NONE:master:/data/log/SAMPLE.bin_aln-chr1_e111108173139.txt
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> *hard resource_list: h_vmem=12000M*
> >>>>>>>> mail_list: root_at_master
      notify: FALSE
> >>>>>>>> job_name:
      SAMPLE.bin_aln-chr1
> >>>>>>>> stdout_path_list:
> >>>>>>>>
      NONE:master:/data/log/SAMPLE.bin_aln-chr1_o111108173139.txt
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> jobshare:
> >>>> 0
> >>>>>>>> hard_queue_list: all.q
      env_list: job_args:
> >>>>>>>>
      -c,/home/apps/hugeseq/bin/hugeseq_mod.sh bin_sam.sh chr1
> >>>>>>>> /data/chr1.bam
      /data/bwa_small.bam &&
> >>>>>>>>
      /home/apps/hugeseq/bin/hugeseq_mod.sh sam_index.sh
> >>>>>>>> /data/chr1.bam script_file:
      /bin/sh
> >>>>>>>> verify_suitable_queues: 2
      scheduling info:
> >>>>>>>> (Collecting of scheduler job
      information is turned off)
> >>>>>>>>
> >>>>>>>> And I'm using the Cluster
      GPU Quadruple Extra Large
> >>>>>>>> instances which
> >>>> I
> >>>>>>>> think has about 23G memory.
      The issue that I see is too
> >>>>>>>> many of the jobs are
      submitted. I guess I need to set
> >>>>>>>> the mem_free too? (the
      problem is the tool im using does
> >>>>>>>> not seem to have a way tot
      set that...)
> >>>>>>>>
> >>>>>>>> Many thanks, Amir
> >>>>>>>>
> >>>>>>>> On Nov 8, 2011, at 5:47 AM,
      Justin Riley wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>> Hi Amirhossein,
> >>>>>>>
> >>>>>>> Did you specify the memory usage
      in your job script or at
> >>>>>>> command line and what parameters
      did you use exactly?
> >>>>>>>
> >>>>>>> Doing a quick search I believe
      that the following will
> >>>>>>> solve the problem although I
      haven't tested myself:
> >>>>>>>
> >>>>>>> $ qsub -l
      mem_free=MEM_NEEDED,h_vmem=MEM_MAX yourjob.sh
> >>>>>>>
> >>>>>>> Here, MEM_NEEDED and MEM_MAX are
      the lower and
> >>>> upper bounds for your
> >>>>>>> job's memory requirements.
> >>>>>>>
> >>>>>>> HTH,
> >>>>>>>
> >>>>>>> ~Justin
> >>>>>>>
> >>>>>>> On 7/22/64 2:59 PM, Amirhossein
      Kiani wrote:
> >>>>>>>> Dear Star Cluster users,
> >>>>>>>
> >>>>>>>> I'm using Star Cluster to
      set up an SGE and when I ran
> >>>>>>>> my job list,
> >>>>>>> although I had specified the
      memory usage for each job, it
> >>>>>>> submitted too many jobs on my
      instance and my instance
> >>>>>>> started going out of memory and
      swapping.
> >>>>>>>> I wonder if anyone knows how
      I could tell the SGE the
> >>>>>>>> max memory to
> >>>>>>> consider when submitting jobs to
      each node so that it
> >>>>>>> doesn't run the jobs if there is
      not enough memory
> >>>>>>> available on a node.
> >>>>>>>
> >>>>>>>> I'm using the Cluster GPU
      Quadruple Extra Large
> >>>>>>>> instances.
> >>>>>>>
> >>>>>>>> Many thanks, Amirhossein
      Kiani
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
      _______________________________________________
> >>>>>>> StarCluster mailing list StarCluster_at_mit.edu
> >>>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
      _______________________________________________ StarCluster
> >>>>> mailing list StarCluster_at_mit.edu
> >>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
> >>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>> _______________________________________________
      StarCluster
> >>> mailing list StarCluster_at_mit.edu
> >>> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7dYA8ACgkQ4llAkMfDcrl3/gCfWl/niHCWOAmdAe9kRF5I6r//
bTQAnjM5LpXxNLrPX7Pr+lXlxkJTkBJN
=9p5j
-----END PGP SIGNATURE-----            
Received on Thu Jan 05 2012 - 15:05:17 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject