StarCluster - Mailing List Archive

Re: Fwd: StarCluster Digest, Vol 44, Issue 5

From: Sergio Mafra <no email>
Date: Thu, 23 May 2013 09:49:16 -0300

Hi Hugh,

You´re right. My mistake.. :(
But I think that loadbalancer should tell us how many jobs are running. So
you are pretty sure that it won´t shutdown instances that are still busy.

All the best,
Sergio


On Wed, May 22, 2013 at 11:52 AM, MacMullan, Hugh <hughmac_at_wharton.upenn.edu
> wrote:

> Hi Sergio:****
>
> ** **
>
> Those jobs aren’t queued, they’re running.****
>
> ** **
>
> Don’t know why nodes aren’t being removed though …. has it been 55 minutes
> (I think that’s the default)?****
>
> ** **
>
> Cheers, Hugh****
>
> ** **
>
> *From:* Sergio Mafra [mailto:sergiohmafra_at_gmail.com]
> *Sent:* Wednesday, May 22, 2013 10:43 AM
> *To:* MacMullan, Hugh
> *Cc:* starcluster_at_mit.edu
> *Subject:* Re: [StarCluster] Fwd: StarCluster Digest, Vol 44, Issue 5****
>
> ** **
>
> Hi all,****
>
> Just followed the tip from Hugh MacMullan (submit a ghost job to restart
> the stats to begin loadbalancing starcluster but still got some strange
> info.****
>
> ** **
>
> At this present time, the queue has 3 jobs:****
>
> ** **
>
> sgeadmin_at_master:~/gpo2/PEN_2013/GTBase/caso4$ qstat****
>
> job-ID prior name user state submit/start at queue
> slots ja-task-ID****
>
>
> -----------------------------------------------------------------------------------------------------------------
> ****
>
> 2 0.50500 c4 sgeadmin r 05/22/2013 14:30:08
> all.q_at_node001 53****
>
> 3 0.50500 c5 sgeadmin r 05/22/2013 14:30:23
> all.q_at_node001 53****
>
> 4 0.60500 c6 sgeadmin r 05/22/2013 14:30:38
> all.q_at_node001 54****
>
> ** **
>
> But StarCluster LoadBalancer doesn´t understant that... It says Queued
> Jobs: 0****
>
> ** **
>
> >>> Loading full job history****
>
> Execution hosts: 5****
>
> Queued jobs: 0****
>
> Avg job duration: 0 secs****
>
> Avg job wait time: 1 secs****
>
> Last cluster modification time: 2013-05-22 14:33:34****
>
> >>> Not adding nodes: already at or above maximum (5)****
>
> >>> Looking for nodes to remove...****
>
> >>> No nodes can be removed at this time****
>
> >>> Sleeping...(looping again in 60 secs****
>
> ** **
>
> On Tue, May 7, 2013 at 4:37 PM, MacMullan, Hugh <hughmac_at_wharton.upenn.edu>
> wrote:****
>
> Sergio:****
>
> ****
>
> One job needs to run to completion before the stats are available to begin
> loadbalancing in StarCluster. Like:****
>
> ****
>
> echo hostname | qsub -o /dev/null -j y****
>
> ****
>
> It’s funny, I just ran into those same two issues (timezone and no
> completed job) recently while fiddling with my own AMIs and loadbalancing.
> ****
>
> ****
>
> Cheers, Hugh****
>
> ****
>
> *From:* starcluster-bounces_at_mit.edu [mailto:starcluster-bounces_at_mit.edu] *On
> Behalf Of *Sergio Mafra
> *Sent:* Tuesday, May 07, 2013 3:30 PM
> *To:* starcluster_at_mit.edu
> *Subject:* [StarCluster] Fwd: StarCluster Digest, Vol 44, Issue 5****
>
> ****
>
> Hi fellows,****
>
> ****
>
> Any help on this. ****
>
> ****
>
> All the best,****
>
> ****
>
> Sergio****
>
> ****
>
> ---------- Forwarded message ----------
> From: *Sergio Mafra* <sergiohmafra_at_gmail.com>
> Date: Tue, May 7, 2013 at 4:17 PM
> Subject: Re: [StarCluster] StarCluster Digest, Vol 44, Issue 5
> To: Rajat Banerjee <rajatb_at_post.harvard.edu>****
>
> Hi Rajat,****
>
> ****
>
> I think that the date problem was over. Now we´ve got a new one. Check it
> out:****
>
> ****
>
> ubuntu_at_domU-12-31-39-02-19-36:~$ starcluster loadbalance spotcluster****
>
> StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)****
>
> Software Tools for Academics and Researchers (STAR)****
>
> Please submit bug reports to starcluster_at_mit.edu****
>
> ****
>
> >>> Starting load balancer (Use ctrl-c to exit)****
>
> Maximum cluster size: 5****
>
> Minimum cluster size: 1****
>
> Cluster growth rate: 1 nodes/iteration****
>
> ****
>
> >>> Loading full job history****
>
> *** WARNING - Failed to retrieve stats (1/5):****
>
> Traceback (most recent call last):****
>
> File
> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/balancers/sge/__init__.py",
> line 515, in get_stats****
>
> self.stat = self._get_stats()****
>
> File
> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/balancers/sge/__init__.py",
> line 493, in _get_stats****
>
> qacct = '\n'.join(master.ssh.execute(qacct_cmd))****
>
> File
> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/sshutils/__init__.py",
> line 538, in execute****
>
> msg, command, exit_status, out_str)****
>
> RemoteCommandFailed: remote command 'source /etc/profile && qacct -j -b
> 201305071615' failed with status 1:****
>
> no jobs running since startup****
>
> /opt/sge6/default/common/accounting: No such file or directory****
>
> *** WARNING - Retrying in 60s****
>
> ****
>
> Just to tell you that I´m running MPICH2. This is part of my config file:*
> ***
>
> ****
>
> [cluster NewaveUbuntuHVM]****
>
> KEYNAME = MasterNode****
>
> CLUSTER_SIZE = 5****
>
> CLUSTER_USER = sgeadmin****
>
> CLUSTER_SHELL = bash****
>
> MASTER_IMAGE_ID = ami-7f1d8a16****
>
> NODE_IMAGE_ID = ami-411d8a28****
>
> NODE_INSTANCE_TYPE = cr1.8xlarge****
>
> PLUGINS = mpich2****
>
> VOLUMES = newave****
>
> ****
>
> All the best,****
>
> ****
>
> Sergio****
>
> ****
>
> On Mon, May 6, 2013 at 10:42 AM, Sergio Mafra <sergiohmafra_at_gmail.com>
> wrote:****
>
> Hi Rajat,****
>
> ****
>
> Thanks so much for your help. I´ll do as you said and report the results
> here.****
>
> ****
>
> All the best,****
>
>
> Sergio****
>
> ****
>
> On Sun, May 5, 2013 at 2:48 PM, Rajat Banerjee <rajatb_at_post.harvard.edu>
> wrote:****
>
> Hi Sergio, ****
>
> Sorry for the delayed response. Busy week at work. Adding starcluster
> alias back, in case this helps other people in the future.****
>
> ****
>
> Like I said, I'm not sure why your instance is coming up with PDT in the
> EC2 instance, since from what I remember it would always return UTC.****
>
> ****
>
> Is it possible for you to download the latest dev version if you haven't
> tried that already?****
>
> ****
>
> http://star.mit.edu/cluster/docs/latest/contribute.html****
>
> ****
>
> Slightly different directions than the one you specified. Then, you can
> modify this file:****
>
> starcluster/balancers/sge/__init__.py line 466****
>
> To replace UTC with PDT. Then run the ELB, and it'll run the latest code.
> Let me know if that works, and we can file a bug and go through the formal
> process of letting you switch the timezone. I'm guessing that you're pretty
> familiar with python programming, but if you have more problems then feel
> free to ask more questions.****
>
> ****
>
> Best,****
>
> Rajat****
>
> ****
>
> On Tue, Apr 30, 2013 at 2:07 PM, Sergio Mafra <sergiohmafra_at_gmail.com>
> wrote:****
>
> Hi Rajat,****
>
> ****
>
> Thanks so much for your kindness in order to find out where this error
> was. Nice!****
>
> ****
>
> It´s a little bit odd to understand what is causing that since I´m using
> the SC Controller as an instance in the same zone (us-east-1d) as the
> cluster launched by it. So this should be in the same time format...???***
> *
>
> ****
>
> What I did was donwload the code directly from the GIT´s site and compile
> it as described in
> http://star.mit.edu/cluster/docs/latest/installation.html****
>
> ****
>
> ???****
>
> ****
>
> All the best,****
>
> ****
>
> Sergio****
>
> ****
>
> On Tue, Apr 30, 2013 at 2:18 PM, Rajat Banerjee <rajatb_at_post.harvard.edu>
> wrote:****
>
> Sergio,
> I looked at the code that is causing your problems. It's this line:****
>
> return datetime.datetime.strptime(str, "%a %b %d %H:%M:%S UTC %Y")****
>
> ****
>
> where 'str' is the output of the *remote* call to date. My mac returns
> this:
> rbanerjee:~/starcluster/StarCluster/starcluster/balancers/sge $ date
> Tue Apr 30 13:12:44 EDT 2013****
>
> Which looks like it would time format OK. One oddity is that your time
> format is returning PDT when UTC is expected:
>
> ValueError: time data 'Tue Apr 30 07:01:35 PDT 2013' does not match format
> '%a %b %d %H:%M:%S UTC %Y'
>
> Not sure what's causing the problems since it looks mostly right, but feel
> free to tweak the time setting in the code in
> starcluster/balancers/sge/__init__.py line 466 to make the pattern match
> yours. Do you know why your time zone may be set differently than other AWS
> instances we've used? Custom images?****
>
> Best,
> Rajat****
>
> ****
>
> ****
>
> On Tue, Apr 30, 2013 at 12:43 PM, <starcluster-request_at_mit.edu> wrote:****
>
> Send StarCluster mailing list submissions to
> starcluster_at_mit.edu
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://mailman.mit.edu/mailman/listinfo/starcluster
> or, via email, send a message with subject or body 'help' to
> starcluster-request_at_mit.edu
>
> You can reach the person managing the list at
> starcluster-owner_at_mit.edu
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of StarCluster digest..."
>
> Today's Topics:
>
> 1. Unable to mount ebs volume (Jerry Lee, GW/US)
> 2. LoadBalance (Sergio Mafra)
>
>
> ---------- Forwarded message ----------
> From: "Jerry Lee, GW/US" <Jerry.Lee_at_genewiz.com>
> To: <starcluster_at_mit.edu>
> Cc:
> Date: Mon, 15 Apr 2013 17:11:38 -0500
> Subject: [StarCluster] Unable to mount ebs volume****
>
> Hi,****
>
> ****
>
> I am a beginner of using the StarCluster. I created a ebs volme via Amazon
> AWS and configure it on the config file to use it for my cluster, but no
> matter what I do, it doesn't automatically mount the ebs volume onto the
> cluster. Please help.****
>
> ****
>
> [cluster jerrycluster]****
>
> EXTENDS = smallcluster****
>
> VOLUMES = testdata****
>
> ****
>
> [volume testdata]****
>
> VOLUME_ID=vol-a0fe24f9****
>
> MOUNT_PATH=/data****
>
> ****
>
> >>> Waiting for cluster to come up... (updating every 30s)****
>
> >>> Waiting for instances to activate...****
>
> >>> Waiting for all nodes to be in a 'running' state...****
>
> 2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%****
>
> >>> Waiting for SSH to come up on all nodes...****
>
> 2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%****
>
> >>> Waiting for cluster to come up took 1.847 mins****
>
> >>> The master node is ec2-54-234-229-206.compute-1.amazonaws.com****
>
> >>> Setting up the cluster...****
>
> >>> Configuring hostnames...****
>
> 2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%****
>
> >>> Creating cluster user: None (uid: 1001, gid: 1001)****
>
> 2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%****
>
> >>> Configuring scratch space for user(s): sgeadmin****
>
> 2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%****
>
> >>> Configuring /etc/hosts on each node****
>
> 2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%****
>
> >>> Starting NFS server on master****
>
> >>> Configuring NFS exports path(s):****
>
> /home****
>
> >>> Mounting all NFS export path(s) on 1 worker node(s)****
>
> 1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%****
>
> >>> Setting up NFS took 0.073 mins****
>
> >>> Configuring passwordless ssh for root****
>
> >>> Configuring passwordless ssh for sgeadmin****
>
> >>> Shutting down threads...****
>
> 20/20 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%****
>
> >>> Configuring SGE...****
>
> >>> Configuring NFS exports path(s):****
>
> /opt/sge6****
>
> >>> Mounting all NFS export path(s) on 1 worker node(s)****
>
> 1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%****
>
> >>> Setting up NFS took 0.020 mins****
>
> >>> Installing Sun Grid Engine...****
>
> 1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%****
>
> >>> Creating SGE parallel environment 'orte'****
>
> 2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%****
>
> >>> Adding parallel environment 'orte' to queue 'all.q'****
>
> >>> Shutting down threads...****
>
> 20/20 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%****
>
> >>> Configuring cluster took 1.325 mins****
>
> >>> Starting cluster took 3.197 mins****
>
> ****
>
> ****
>
> Thanks,****
>
> ****
>
> Jerry Lee****
>
> ****
>
> Jerry Lee****
>
> Assistant Manager of Global Infrastructure****
>
> GENEWIZ Inc.****
>
> 40 Cragwood Road. Suite 201****
>
> South Plainfield, NJ 07080****
>
> Phone: 908-222-0711 ext. 3379****
>
> Fax: 908-333-4511 ****
>
> jerry.lee_at_genewiz.com****
>
> www.genewiz.com****
>
> ****
>
> This electronic message, including its attachments, is confidential and
> proprietary and is solely for the intended recipient. If you are not the
> intended recipient, this message was sent to you in error and you are
> hereby advised that any review, disclosure, copying, distribution or use of
> this message or any of the information included in this message by you is
> unauthorized and strictly prohibited. If you have received this message in
> error, please immediately notify the sender by reply to this message and
> permanently delete all copies of this message and its attachments in your
> possession. Thank you for your cooperation.****
>
> ****
>
>
>
> ---------- Forwarded message ----------
> From: Sergio Mafra <sergiohmafra_at_gmail.com>
> To: "starcluster_at_mit.edu" <starcluster_at_mit.edu>
> Cc:
> Date: Tue, 30 Apr 2013 11:05:45 -0300
> Subject: [StarCluster] LoadBalance****
>
> Hi fellows,****
>
> ****
>
> I´m testing StarCluster version 0.999 and so far so good.****
>
> one thing that isn´t working is loadbalance. This is what I get:****
>
> ****
>
> ubuntu_at_domU-12-31-39-02-19-36:~$ starcluster loadbalance newcam****
>
> StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)****
>
> Software Tools for Academics and Researchers (STAR)****
>
> Please submit bug reports to starcluster_at_mit.edu****
>
> ****
>
> >>> Starting load balancer (Use ctrl-c to exit)****
>
> Maximum cluster size: 3****
>
> Minimum cluster size: 1****
>
> Cluster growth rate: 1 nodes/iteration****
>
> ****
>
> *** WARNING - Failed to retrieve stats (1/5):****
>
> Traceback (most recent call last):****
>
> File
> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/balancers/sge/__init__.py",
> line 515, in get_stats****
>
> self.stat = self._get_stats()****
>
> File
> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/balancers/sge/__init__.py",
> line 487, in _get_stats****
>
> now = self.get_remote_time()****
>
> File
> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/balancers/sge/__init__.py",
> line 466, in get_remote_time****
>
> return datetime.datetime.strptime(str, "%a %b %d %H:%M:%S UTC %Y")****
>
> File "/usr/lib/python2.7/_strptime.py", line 325, in _strptime****
>
> (data_string, format))****
>
> ValueError: time data 'Tue Apr 30 07:01:35 PDT 2013' does not match format
> '%a %b %d %H:%M:%S UTC %Y'****
>
> *** WARNING - Retrying in 60s****
>
> ^CTraceback (most recent call last):****
>
> File "/usr/local/bin/starcluster", line 9, in <module>****
>
> load_entry_point('StarCluster==0.9999', 'console_scripts',
> 'starcluster')()****
>
> File
> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/cli.py",
> line 313, in main****
>
> StarClusterCLI().main()****
>
> File
> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/cli.py",
> line 257, in main****
>
> sc.execute(args)****
>
> File
> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/commands/loadbalance.py",
> line 90, in execute****
>
> lb.run(cluster)****
>
> File
> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/balancers/sge/__init__.py",
> line 576, in run****
>
> self.get_stats()****
>
> File "<string>", line 2, in get_stats****
>
> File
> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/utils.py",
> line 92, in wrap_f****
>
> res = func(*arg, **kargs)****
>
> File
> "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/balancers/sge/__init__.py",
> line 521, in get_stats****
>
> time.sleep(self.polling_interval)****
>
> ****
>
> Any ideas?****
>
> ****
>
> All Best,****
>
> ****
>
> Sergio****
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster****
>
> ****
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster****
>
> ****
>
> ****
>
> ****
>
> ****
>
> ****
>
> ** **
>
Received on Thu May 23 2013 - 08:49:20 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject