Re: AWS instance runs out of memory and swaps
So I tried this and couldn't run qconf because it was not installed.
I then tried installing it using apt-get and specified default for the cell name and "master" for the master name which is the default for the SGE created using StarCluster.
However now when I want to use qconf, it says:
root_at_master:/data/stanford/aligned# qconf -msconf
error: commlib error: got select error (Connection refused)
unable to send message to qmaster using port 6444 on host "master": got send error
Any idea how i could configure it to work?
Many thanks,
Amir
On Dec 5, 2011, at 1:52 PM, Rayson Ho wrote:
> Hi Amirhossein,
>
> I was working on a few other things, and I just saw your message -- I have to spend less time on mailing list discussions these days due to the number of things that I needed to develop and/or fix, and I am also working on a new patch release of OGS/Grid Engine 2011.11. Luckily, I just found the mail that exactly solves the issue you are encountering:
>
> http://markmail.org/message/zdj5ebfrzhnadglf
>
>
> For more info, see the "job_load_adjustments" and "load_adjustment_decay_time" parameters in the Grid Engine manpage:
>
>
> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
>
> Rayson
>
> =================================
> Grid Engine / Open Grid Scheduler
> http://gridscheduler.sourceforge.net/
>
> Scalable Grid Engine Support Program
> http://www.scalablelogic.com/
>
>
>
>
> ________________________________
> From: Amirhossein Kiani <amirhkiani_at_gmail.com>
> To: Rayson Ho <raysonlogin_at_yahoo.com>
> Cc: Justin Riley <justin.t.riley_at_gmail.com>; "starcluster_at_mit.edu" <starcluster_at_mit.edu>
> Sent: Friday, December 2, 2011 6:36 PM
> Subject: Re: [StarCluster] AWS instance runs out of memory and swaps
>
>
> Dear Rayson,
>
> Did you have a chance to test your solution on this? Basically, all I want is to prevent a job from running on an instance if it does not have the memory required for the job.
>
> I would very much appreciate your help!
>
> Many thanks,
> Amir
>
>
>
> On Nov 21, 2011, at 10:29 AM, Rayson Ho wrote:
>
> Amir,
>>
>>
>> You can use qhost to list all the node and resources that each node has.
>>
>>
>> I have an answer to the memory issue, but I have not have time to properly type up a response and test it.
>>
>>
>>
>> Rayson
>>
>>
>>
>>
>>
>>
>>
>> ________________________________
>> From: Amirhossein Kiani <amirhkiani_at_gmail.com>
>> To: Justin Riley <justin.t.riley_at_gmail.com>
>> Cc: Rayson Ho <rayrayson_at_gmail.com>; "starcluster_at_mit.edu" <starcluster_at_mit.edu>
>> Sent: Monday, November 21, 2011 1:26 PM
>> Subject: Re: [StarCluster] AWS instance runs out of memory and swaps
>>
>> Hi Justin,
>>
>> Many thanks for your reply.
>> I don't have any issue with multiple jobs running per node if there is enough memory for them. But since I know about the nature of my jobs, I can predict that only one per node should be running.
>> How can I see how much memory does SGE think each node have? Is there a way to list that?
>>
>> Regards,
>> Amir
>>
>>
>> On Nov 21, 2011, at 8:18 AM, Justin Riley wrote:
>>
>>> Hi Amir,
>>>
>>> Sorry to hear you're still having issues. This is really more of an SGE
>>> issue more than anything but perhaps Rayson can give a better insight as
>>> to what's going on. It seems you're using 23G nodes and 12GB jobs. Just
>>> for drill does 'qhost' show each node having 23GB? Definitely seems like
>>> there's a boundary issue here given that two of your jobs together
>>> approaches the total memory of the machine (23GB). Is it your goal only
>>> to have one job per
> node?
>>>
>>> ~Justin
>>>
>>> On 11/16/2011 09:00 PM, Amirhossein Kiani wrote:
>>>> Dear all,
>>>>
>>>> I even wrote the queue submission script myself, adding
>>>> the mem_free=MEM_NEEDED,h_vmem=MEM_MAX parameter but sometimes two jobs
>>>> are randomly sent to one node that does not have enough memory for two
>>>> jobs and they start running. I think the SGE should check on the
>>>> instance memory and not run multiple jobs on a machine when the memory
>>>> requirement for the jobs in total is above the memory available in the
>>>> node (or maybe there is a bug in the current check)
>>>>
>>>> Amir
>>>>
>>>> On Nov 8, 2011, at 5:37 PM, Amirhossein Kiani wrote:
>>>>
>>>>> Hi Justin,
>>>>>
>>>>> I'm using a third-party tool to submit the jobs but I am setting the
>>>>> hard
> limit.
>>>>> For all my jobs I have something like this for the job description:
>>>>>
>>>>> [root_at_master test]# qstat -j 1
>>>>> ==============================================================
>>>>> job_number: 1
>>>>> exec_file: job_scripts/1
>>>>> submission_time: Tue Nov 8 17:31:39 2011
>>>>> owner: root
>>>>> uid: 0
>>>>> group: root
>>>>> gid: 0
>>>>>
> sge_o_home: /root
>>>>> sge_o_log_name: root
>>>>> sge_o_path:
>>>>> /home/apps/bin:/home/apps/vcftools_0.1.7/bin:/home/apps/tabix-0.2.5:/home/apps/BEDTools-Version-2.14.2/bin:/home/apps/samtools/bcftools:/home/apps/samtools:/home/apps/bwa-0.5.9:/home/apps/Python-2.7.2:/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/bin:/home/apps/sjm-1.0/bin:/home/apps/hugeseq/bin:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/usr/local/cuda/computeprof/bin:/usr/local/cuda/open64/bin:/opt/sge6/bin/lx24-amd64:/root/bin
>>>>> sge_o_shell: /bin/bash
>>>>> sge_o_workdir:
> /data/test
>>>>> sge_o_host: master
>>>>> account: sge
>>>>> stderr_path_list:
>>>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_e111108173139.txt
>>>>> *hard resource_list: h_vmem=12000M*
>>>>> mail_list: root_at_master
>>>>> notify: FALSE
>>>>> job_name: SAMPLE.bin_aln-chr1
>>>>> stdout_path_list:
>>>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_o111108173139.txt
>>>>> jobshare:
> 0
>>>>> hard_queue_list: all.q
>>>>> env_list:
>>>>> job_args: -c,/home/apps/hugeseq/bin/hugeseq_mod.sh
>>>>> bin_sam.sh chr1 /data/chr1.bam /data/bwa_small.bam &&
>>>>> /home/apps/hugeseq/bin/hugeseq_mod.sh sam_index.sh /data/chr1.bam
>>>>> script_file: /bin/sh
>>>>> verify_suitable_queues: 2
>>>>> scheduling info: (Collecting of scheduler job information
>>>>> is turned off)
>>>>>
>>>>> And I'm using the Cluster GPU Quadruple Extra Large instances which
> I
>>>>> think has about 23G memory. The issue that I see is too many of the
>>>>> jobs are submitted. I guess I need to set the mem_free too? (the
>>>>> problem is the tool im using does not seem to have a way tot set that...)
>>>>>
>>>>> Many thanks,
>>>>> Amir
>>>>>
>>>>> On Nov 8, 2011, at 5:47 AM, Justin Riley wrote:
>>>>>
>>>>>>
>>>> Hi Amirhossein,
>>>>
>>>> Did you specify the memory usage in your job script or at command
>>>> line and what parameters did you use exactly?
>>>>
>>>> Doing a quick search I believe that the following will solve the
>>>> problem although I haven't tested myself:
>>>>
>>>> $ qsub -l mem_free=MEM_NEEDED,h_vmem=MEM_MAX yourjob.sh
>>>>
>>>> Here, MEM_NEEDED and MEM_MAX are the lower and
> upper bounds for your
>>>> job's memory requirements.
>>>>
>>>> HTH,
>>>>
>>>> ~Justin
>>>>
>>>> On 7/22/64 2:59 PM, Amirhossein Kiani wrote:
>>>>> Dear Star Cluster users,
>>>>
>>>>> I'm using Star Cluster to set up an SGE and when I ran my job list,
>>>> although I had specified the memory usage for each job, it submitted
>>>> too many jobs on my instance and my instance started going out of
>>>> memory and swapping.
>>>>> I wonder if anyone knows how I could tell the SGE the max memory to
>>>> consider when submitting jobs to each node so that it doesn't run the
>>>> jobs if there is not enough memory available on a node.
>>>>
>>>>> I'm using the Cluster GPU Quadruple Extra Large instances.
>>>>
>>>>> Many thanks,
>>>>> Amirhossein Kiani
>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> StarCluster mailing list
>>>> StarCluster_at_mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>
>>
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
>>
Received on Mon Dec 05 2011 - 18:07:08 EST
This archive was generated by
hypermail 2.3.0.