StarCluster - Mailing List Archive

Re: Fwd: StarCluster Digest, Vol 44, Issue 5

From: Rajat Banerjee <no email>
Date: Thu, 23 May 2013 13:14:36 -0400

Hi Sergio,
The reason that a node has not been removed is documented here in the ELB
document :

http://star.mit.edu/cluster/docs/0.93.3/manual/load_balancer.html

See the "criteria for removing a node section"

Does that make sense?


On Thu, May 23, 2013 at 8:49 AM, Sergio Mafra <sergiohmafra_at_gmail.com>wrote:

> Hi Hugh,
>
> You´re right. My mistake.. :(
> But I think that loadbalancer should tell us how many jobs are running. So
> you are pretty sure that it won´t shutdown instances that are still busy.
>
> All the best,
> Sergio
>
>
> On Wed, May 22, 2013 at 11:52 AM, MacMullan, Hugh <
> hughmac_at_wharton.upenn.edu> wrote:
>
>> Hi Sergio:****
>>
>> ** **
>>
>> Those jobs aren’t queued, they’re running.****
>>
>> ** **
>>
>> Don’t know why nodes aren’t being removed though …. has it been 55
>> minutes (I think that’s the default)?****
>>
>> ** **
>>
>> Cheers, Hugh****
>>
>> ** **
>>
>> *From:* Sergio Mafra [mailto:sergiohmafra_at_gmail.com]
>> *Sent:* Wednesday, May 22, 2013 10:43 AM
>> *To:* MacMullan, Hugh
>> *Cc:* starcluster_at_mit.edu
>> *Subject:* Re: [StarCluster] Fwd: StarCluster Digest, Vol 44, Issue 5****
>>
>> ** **
>>
>> Hi all,****
>>
>> Just followed the tip from Hugh MacMullan (submit a ghost job to restart
>> the stats to begin loadbalancing starcluster but still got some strange
>> info.****
>>
>> ** **
>>
>> At this present time, the queue has 3 jobs:****
>>
>> ** **
>>
>> sgeadmin_at_master:~/gpo2/PEN_2013/GTBase/caso4$ qstat****
>>
>> job-ID prior name user state submit/start at queue
>> slots ja-task-ID****
>>
>>
>> -----------------------------------------------------------------------------------------------------------------
>> ****
>>
>> 2 0.50500 c4 sgeadmin r 05/22/2013 14:30:08
>> all.q_at_node001 53****
>>
>> 3 0.50500 c5 sgeadmin r 05/22/2013 14:30:23
>> all.q_at_node001 53****
>>
>> 4 0.60500 c6 sgeadmin r 05/22/2013 14:30:38
>> all.q_at_node001 54****
>>
>> ** **
>>
>> But StarCluster LoadBalancer doesn´t understant that... It says Queued
>> Jobs: 0****
>>
>> ** **
>>
>> >>> Loading full job history****
>>
>> Execution hosts: 5****
>>
>> Queued jobs: 0****
>>
>> Avg job duration: 0 secs****
>>
>> Avg job wait time: 1 secs****
>>
>> Last cluster modification time: 2013-05-22 14:33:34****
>>
>> >>> Not adding nodes: already at or above maximum (5)****
>>
>> >>> Looking for nodes to remove...****
>>
>> >>> No nodes can be removed at this time****
>>
>> >>> Sleeping...(looping again in 60 secs****
>>
>> ** **
>>
>> On Tue, May 7, 2013 at 4:37 PM, MacMullan, Hugh <
>> hughmac_at_wharton.upenn.edu> wrote:****
>>
>> Sergio:****
>>
>> ****
>>
>> One job needs to run to completion before the stats are available to
>> begin loadbalancing in StarCluster. Like:****
>>
>> ****
>>
>> echo hostname | qsub -o /dev/null -j y****
>>
>> ****
>>
>> It’s funny, I just ran into those same two issues (timezone and no
>> completed job) recently while fiddling with my own AMIs and loadbalancing.
>> ****
>>
>> ****
>>
>> Cheers, Hugh****
>>
>> ****
>>
>> *From:* starcluster-bounces_at_mit.edu [mailto:starcluster-bounces_at_mit.edu]
>> *On Behalf Of *Sergio Mafra
>> *Sent:* Tuesday, May 07, 2013 3:30 PM
>> *To:* starcluster_at_mit.edu
>> *Subject:* [StarCluster] Fwd: StarCluster Digest, Vol 44, Issue 5****
>>
>> ****
>>
>> Hi fellows,****
>>
>> ****
>>
>> Any help on this. ****
>>
>> ****
>>
>> All the best,****
>>
>> ****
>>
>> Sergio****
>>
>> ****
>>
>> ---------- Forwarded message ----------
>> From: *Sergio Mafra* <sergiohmafra_at_gmail.com>
>> Date: Tue, May 7, 2013 at 4:17 PM
>> Subject: Re: [StarCluster] StarCluster Digest, Vol 44, Issue 5
>> To: Rajat Banerjee <rajatb_at_post.harvard.edu>****
>>
>> Hi Rajat,****
>>
>> ****
>>
>> I think that the date problem was over. Now we´ve got a new one. Check it
>> out:****
>>
>> ****
>>
>> ubuntu_at_domU-12-31-39-02-19-36:~$ starcluster loadbalance spotcluster****
>>
>> StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)****
>>
>> Software Tools for Academics and Researchers (STAR)****
>>
>> Please submit bug reports to starcluster_at_mit.edu****
>>
>> ****
>>
>> >>> Starting load balancer (Use ctrl-c to exit)****
>>
>> Maximum cluster size: 5****
>>
>> Minimum cluster size: 1****
>>
>> Cluster growth rate: 1 nodes/iteration****
>>
>> ****
>>
>> >>> Loading full job history****
>>
>> *** WARNING - Failed to retrieve stats (1/5):****
>>
>> Traceback (most recent call last):****
>>
>> File
>> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/balancers/sge/__init__.py",
>> line 515, in get_stats****
>>
>> self.stat = self._get_stats()****
>>
>> File
>> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/balancers/sge/__init__.py",
>> line 493, in _get_stats****
>>
>> qacct = '\n'.join(master.ssh.execute(qacct_cmd))****
>>
>> File
>> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/sshutils/__init__.py",
>> line 538, in execute****
>>
>> msg, command, exit_status, out_str)****
>>
>> RemoteCommandFailed: remote command 'source /etc/profile && qacct -j -b
>> 201305071615' failed with status 1:****
>>
>> no jobs running since startup****
>>
>> /opt/sge6/default/common/accounting: No such file or directory****
>>
>> *** WARNING - Retrying in 60s****
>>
>> ****
>>
>> Just to tell you that I´m running MPICH2. This is part of my config file:
>> ****
>>
>> ****
>>
>> [cluster NewaveUbuntuHVM]****
>>
>> KEYNAME = MasterNode****
>>
>> CLUSTER_SIZE = 5****
>>
>> CLUSTER_USER = sgeadmin****
>>
>> CLUSTER_SHELL = bash****
>>
>> MASTER_IMAGE_ID = ami-7f1d8a16****
>>
>> NODE_IMAGE_ID = ami-411d8a28****
>>
>> NODE_INSTANCE_TYPE = cr1.8xlarge****
>>
>> PLUGINS = mpich2****
>>
>> VOLUMES = newave****
>>
>> ****
>>
>> All the best,****
>>
>> ****
>>
>> Sergio****
>>
>> ****
>>
>> On Mon, May 6, 2013 at 10:42 AM, Sergio Mafra <sergiohmafra_at_gmail.com>
>> wrote:****
>>
>> Hi Rajat,****
>>
>> ****
>>
>> Thanks so much for your help. I´ll do as you said and report the results
>> here.****
>>
>> ****
>>
>> All the best,****
>>
>>
>> Sergio****
>>
>> ****
>>
>> On Sun, May 5, 2013 at 2:48 PM, Rajat Banerjee <rajatb_at_post.harvard.edu>
>> wrote:****
>>
>> Hi Sergio, ****
>>
>> Sorry for the delayed response. Busy week at work. Adding starcluster
>> alias back, in case this helps other people in the future.****
>>
>> ****
>>
>> Like I said, I'm not sure why your instance is coming up with PDT in the
>> EC2 instance, since from what I remember it would always return UTC.****
>>
>> ****
>>
>> Is it possible for you to download the latest dev version if you haven't
>> tried that already?****
>>
>> ****
>>
>> http://star.mit.edu/cluster/docs/latest/contribute.html****
>>
>> ****
>>
>> Slightly different directions than the one you specified. Then, you can
>> modify this file:****
>>
>> starcluster/balancers/sge/__init__.py line 466****
>>
>> To replace UTC with PDT. Then run the ELB, and it'll run the latest code.
>> Let me know if that works, and we can file a bug and go through the formal
>> process of letting you switch the timezone. I'm guessing that you're pretty
>> familiar with python programming, but if you have more problems then feel
>> free to ask more questions.****
>>
>> ****
>>
>> Best,****
>>
>> Rajat****
>>
>> ****
>>
>> On Tue, Apr 30, 2013 at 2:07 PM, Sergio Mafra <sergiohmafra_at_gmail.com>
>> wrote:****
>>
>> Hi Rajat,****
>>
>> ****
>>
>> Thanks so much for your kindness in order to find out where this error
>> was. Nice!****
>>
>> ****
>>
>> It´s a little bit odd to understand what is causing that since I´m using
>> the SC Controller as an instance in the same zone (us-east-1d) as the
>> cluster launched by it. So this should be in the same time format...???**
>> **
>>
>> ****
>>
>> What I did was donwload the code directly from the GIT´s site and compile
>> it as described in
>> http://star.mit.edu/cluster/docs/latest/installation.html****
>>
>> ****
>>
>> ???****
>>
>> ****
>>
>> All the best,****
>>
>> ****
>>
>> Sergio****
>>
>> ****
>>
>> On Tue, Apr 30, 2013 at 2:18 PM, Rajat Banerjee <rajatb_at_post.harvard.edu>
>> wrote:****
>>
>> Sergio,
>> I looked at the code that is causing your problems. It's this line:****
>>
>> return datetime.datetime.strptime(str, "%a %b %d %H:%M:%S UTC %Y")****
>>
>> ****
>>
>> where 'str' is the output of the *remote* call to date. My mac returns
>> this:
>> rbanerjee:~/starcluster/StarCluster/starcluster/balancers/sge $ date
>> Tue Apr 30 13:12:44 EDT 2013****
>>
>> Which looks like it would time format OK. One oddity is that your time
>> format is returning PDT when UTC is expected:
>>
>> ValueError: time data 'Tue Apr 30 07:01:35 PDT 2013' does not match
>> format '%a %b %d %H:%M:%S UTC %Y'
>>
>> Not sure what's causing the problems since it looks mostly right, but
>> feel free to tweak the time setting in the code in
>> starcluster/balancers/sge/__init__.py line 466 to make the pattern match
>> yours. Do you know why your time zone may be set differently than other AWS
>> instances we've used? Custom images?****
>>
>> Best,
>> Rajat****
>>
>> ****
>>
>> ****
>>
>> On Tue, Apr 30, 2013 at 12:43 PM, <starcluster-request_at_mit.edu> wrote:***
>> *
>>
>> Send StarCluster mailing list submissions to
>> starcluster_at_mit.edu
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>> or, via email, send a message with subject or body 'help' to
>> starcluster-request_at_mit.edu
>>
>> You can reach the person managing the list at
>> starcluster-owner_at_mit.edu
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of StarCluster digest..."
>>
>> Today's Topics:
>>
>> 1. Unable to mount ebs volume (Jerry Lee, GW/US)
>> 2. LoadBalance (Sergio Mafra)
>>
>>
>> ---------- Forwarded message ----------
>> From: "Jerry Lee, GW/US" <Jerry.Lee_at_genewiz.com>
>> To: <starcluster_at_mit.edu>
>> Cc:
>> Date: Mon, 15 Apr 2013 17:11:38 -0500
>> Subject: [StarCluster] Unable to mount ebs volume****
>>
>> Hi,****
>>
>> ****
>>
>> I am a beginner of using the StarCluster. I created a ebs volme via
>> Amazon AWS and configure it on the config file to use it for my cluster,
>> but no matter what I do, it doesn't automatically mount the ebs volume onto
>> the cluster. Please help.****
>>
>> ****
>>
>> [cluster jerrycluster]****
>>
>> EXTENDS = smallcluster****
>>
>> VOLUMES = testdata****
>>
>> ****
>>
>> [volume testdata]****
>>
>> VOLUME_ID=vol-a0fe24f9****
>>
>> MOUNT_PATH=/data****
>>
>> ****
>>
>> >>> Waiting for cluster to come up... (updating every 30s)****
>>
>> >>> Waiting for instances to activate...****
>>
>> >>> Waiting for all nodes to be in a 'running' state...****
>>
>> 2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>> 100%****
>>
>> >>> Waiting for SSH to come up on all nodes...****
>>
>> 2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>> 100%****
>>
>> >>> Waiting for cluster to come up took 1.847 mins****
>>
>> >>> The master node is ec2-54-234-229-206.compute-1.amazonaws.com****
>>
>> >>> Setting up the cluster...****
>>
>> >>> Configuring hostnames...****
>>
>> 2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>> 100%****
>>
>> >>> Creating cluster user: None (uid: 1001, gid: 1001)****
>>
>> 2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>> 100%****
>>
>> >>> Configuring scratch space for user(s): sgeadmin****
>>
>> 2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>> 100%****
>>
>> >>> Configuring /etc/hosts on each node****
>>
>> 2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>> 100%****
>>
>> >>> Starting NFS server on master****
>>
>> >>> Configuring NFS exports path(s):****
>>
>> /home****
>>
>> >>> Mounting all NFS export path(s) on 1 worker node(s)****
>>
>> 1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>> 100%****
>>
>> >>> Setting up NFS took 0.073 mins****
>>
>> >>> Configuring passwordless ssh for root****
>>
>> >>> Configuring passwordless ssh for sgeadmin****
>>
>> >>> Shutting down threads...****
>>
>> 20/20 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>> 100%****
>>
>> >>> Configuring SGE...****
>>
>> >>> Configuring NFS exports path(s):****
>>
>> /opt/sge6****
>>
>> >>> Mounting all NFS export path(s) on 1 worker node(s)****
>>
>> 1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>> 100%****
>>
>> >>> Setting up NFS took 0.020 mins****
>>
>> >>> Installing Sun Grid Engine...****
>>
>> 1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>> 100%****
>>
>> >>> Creating SGE parallel environment 'orte'****
>>
>> 2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>> 100%****
>>
>> >>> Adding parallel environment 'orte' to queue 'all.q'****
>>
>> >>> Shutting down threads...****
>>
>> 20/20 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>> 100%****
>>
>> >>> Configuring cluster took 1.325 mins****
>>
>> >>> Starting cluster took 3.197 mins****
>>
>> ****
>>
>> ****
>>
>> Thanks,****
>>
>> ****
>>
>> Jerry Lee****
>>
>> ****
>>
>> Jerry Lee****
>>
>> Assistant Manager of Global Infrastructure****
>>
>> GENEWIZ Inc.****
>>
>> 40 Cragwood Road. Suite 201****
>>
>> South Plainfield, NJ 07080****
>>
>> Phone: 908-222-0711 ext. 3379****
>>
>> Fax: 908-333-4511 ****
>>
>> jerry.lee_at_genewiz.com****
>>
>> www.genewiz.com****
>>
>> ****
>>
>> This electronic message, including its attachments, is confidential and
>> proprietary and is solely for the intended recipient. If you are not the
>> intended recipient, this message was sent to you in error and you are
>> hereby advised that any review, disclosure, copying, distribution or use of
>> this message or any of the information included in this message by you is
>> unauthorized and strictly prohibited. If you have received this message in
>> error, please immediately notify the sender by reply to this message and
>> permanently delete all copies of this message and its attachments in your
>> possession. Thank you for your cooperation.****
>>
>> ****
>>
>>
>>
>> ---------- Forwarded message ----------
>> From: Sergio Mafra <sergiohmafra_at_gmail.com>
>> To: "starcluster_at_mit.edu" <starcluster_at_mit.edu>
>> Cc:
>> Date: Tue, 30 Apr 2013 11:05:45 -0300
>> Subject: [StarCluster] LoadBalance****
>>
>> Hi fellows,****
>>
>> ****
>>
>> I´m testing StarCluster version 0.999 and so far so good.****
>>
>> one thing that isn´t working is loadbalance. This is what I get:****
>>
>> ****
>>
>> ubuntu_at_domU-12-31-39-02-19-36:~$ starcluster loadbalance newcam****
>>
>> StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)****
>>
>> Software Tools for Academics and Researchers (STAR)****
>>
>> Please submit bug reports to starcluster_at_mit.edu****
>>
>> ****
>>
>> >>> Starting load balancer (Use ctrl-c to exit)****
>>
>> Maximum cluster size: 3****
>>
>> Minimum cluster size: 1****
>>
>> Cluster growth rate: 1 nodes/iteration****
>>
>> ****
>>
>> *** WARNING - Failed to retrieve stats (1/5):****
>>
>> Traceback (most recent call last):****
>>
>> File
>> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/balancers/sge/__init__.py",
>> line 515, in get_stats****
>>
>> self.stat = self._get_stats()****
>>
>> File
>> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/balancers/sge/__init__.py",
>> line 487, in _get_stats****
>>
>> now = self.get_remote_time()****
>>
>> File
>> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/balancers/sge/__init__.py",
>> line 466, in get_remote_time****
>>
>> return datetime.datetime.strptime(str, "%a %b %d %H:%M:%S UTC %Y")***
>> *
>>
>> File "/usr/lib/python2.7/_strptime.py", line 325, in _strptime****
>>
>> (data_string, format))****
>>
>> ValueError: time data 'Tue Apr 30 07:01:35 PDT 2013' does not match
>> format '%a %b %d %H:%M:%S UTC %Y'****
>>
>> *** WARNING - Retrying in 60s****
>>
>> ^CTraceback (most recent call last):****
>>
>> File "/usr/local/bin/starcluster", line 9, in <module>****
>>
>> load_entry_point('StarCluster==0.9999', 'console_scripts',
>> 'starcluster')()****
>>
>> File
>> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/cli.py",
>> line 313, in main****
>>
>> StarClusterCLI().main()****
>>
>> File
>> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/cli.py",
>> line 257, in main****
>>
>> sc.execute(args)****
>>
>> File
>> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/commands/loadbalance.py",
>> line 90, in execute****
>>
>> lb.run(cluster)****
>>
>> File
>> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/balancers/sge/__init__.py",
>> line 576, in run****
>>
>> self.get_stats()****
>>
>> File "<string>", line 2, in get_stats****
>>
>> File
>> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/utils.py",
>> line 92, in wrap_f****
>>
>> res = func(*arg, **kargs)****
>>
>> File
>> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/balancers/sge/__init__.py",
>> line 521, in get_stats****
>>
>> time.sleep(self.polling_interval)****
>>
>> ****
>>
>> Any ideas?****
>>
>> ****
>>
>> All Best,****
>>
>> ****
>>
>> Sergio****
>>
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster****
>>
>> ****
>>
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster****
>>
>> ****
>>
>> ****
>>
>> ****
>>
>> ****
>>
>> ****
>>
>> ** **
>>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
Received on Thu May 23 2013 - 13:14:58 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject