Re: FW: commlib error

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

From: Rayson Ho <no email>
Date: Tue, 23 Sep 2014 21:14:42 -0400

There's an issue with the code. I will submit a fix soon.

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html

On Tue, Sep 23, 2014 at 6:06 PM, Amanda Joy Kedaigle <mandyjoy_at_mit.edu>
wrote:

> Thanks, Rayson. However, when I got the latest git clone and tried this
> out, I am getting the error 'Placement groups may not be used with
> instances of type 't2.micro'.
>
> ------------------------------
> *From:* Rayson Ho [raysonlogin_at_gmail.com]
> *Sent:* Tuesday, September 23, 2014 3:09 PM
> *To:* Amanda Joy Kedaigle
> *Cc:* MacMullan, Hugh; starcluster_at_mit.edu
>
> *Subject:* Re: [StarCluster] FW: commlib error
>
> t2 support was added back in July, I believe you need to use the latest
> git clone as the code is not in a release yet.
>
> Also, t2 only runs in a VPC, but if your account is new enough, then all
> your instances by default runs in a default-vpc. Otherwise, you will need
> to set up a VPC.
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>
> On Tue, Sep 23, 2014 at 2:56 PM, Amanda Joy Kedaigle <mandyjoy_at_mit.edu>
> wrote:
>
>> Thanks for that plugin suggestion! I tried avoiding the master node
>> through "qsub -l", and so far that seems to be helping!
>>
>> As for using a t2.micro or small, those don't seem to be an option in
>> starcluster? They are not listed as instance options in my config file, and
>> when I previously tried an option outside of those listed, I got an error.
>> If anyone knows a way around that, I'd be interested!
>>
>> Thanks to both of you
>> Amanda
>>
>> ------------------------------
>> *From:* MacMullan, Hugh [hughmac_at_wharton.upenn.edu]
>> *Sent:* Tuesday, September 23, 2014 2:50 PM
>> *To:* Amanda Joy Kedaigle
>> *Cc:* starcluster_at_mit.edu
>> *Subject:* RE: [StarCluster] FW: commlib error
>>
>> Amanda:
>>
>>
>>
>> I agree with Rajat's t2 suggestion … even just a t2.micro will help over
>> a t1.micro … and it's cheaper!
>>
>>
>>
>> And are you running jobs on the master as well as the nodes? If so, you
>> could disable that:
>>
>>
>>
>> [cluster mycluster]
>>
>> DISABLE_QUEUE=True
>>
>> PLUGINS = sge
>>
>>
>>
>> [plugin sge]
>>
>> setup_class = starcluster.plugins.sge.SGEPlugin
>>
>> master_is_exec_host = False
>>
>>
>>
>> That might help with stability a good bit.
>>
>>
>>
>> You can also use spot pricing for the master to get a beefier master for
>> a much lower price … but of course with the risk of losing the whole
>> cluster if you are outbid.
>>
>>
>>
>> Good luck with the project!
>>
>>
>>
>> -Hugh
>>
>>
>>
>> *From:* starcluster-bounces_at_mit.edu [mailto:starcluster-bounces_at_mit.edu] *On
>> Behalf Of *Rajat Banerjee
>> *Sent:* Tuesday, September 23, 2014 1:11 PM
>> *To:* Amanda Joy Kedaigle
>> *Cc:* starcluster_at_mit.edu
>> *Subject:* Re: [StarCluster] FW: commlib error
>>
>>
>>
>> HI Amanda,
>>
>> I googled your error and found a few pages that suggest that sge service
>> on the master node went down:
>>
>>
>> http://verahill.blogspot.com/2012/08/sun-gridengine-commlib-error-got-select.html
>>
>> https://supcom.hgc.jp/english/utili_info/manual/faq.html
>>
>> http://comments.gmane.org/gmane.comp.clustering.gridengine.users/17283
>>
>> If your OpenBLAS command is killing the process on master that could
>> cause your issues according to those authors. Sorry I don't have anything
>> more helpful, but the t2.small is still less than $.03 per hour now. That
>> may not increase your costs too much.
>>
>> Raj
>>
>>
>>
>> On Tue, Sep 23, 2014 at 12:55 PM, Amanda Joy Kedaigle <mandyjoy_at_mit.edu>
>> wrote:
>>
>> Thanks, Raj. I can communicate with the master node, it just looks
>> like SGE is failing. I restarted the cluster and everything seemed to be
>> working, but then it just failed in the same way again.
>>
>>
>>
>> > starcluster listclusters (should list status of all your active
>> clusters and running nodes)
>>
>>
>>
>> -----------------------------------------------------
>>
>> fraenkelcluster (security group: _at_sc-fraenkelcluster)
>>
>> -----------------------------------------------------
>>
>> Launch time: 2014-09-23 11:59:43
>>
>> Uptime: 0 days, 00:45:58
>>
>> VPC: vpc-c71f0fa5
>>
>> Subnet: subnet-e6b8c8ce
>>
>> Zone: us-east-1c
>>
>> Keypair: fraenkel-keypair
>>
>> EBS volumes:
>>
>> vol-5e75ba11 on master:/dev/sdz (status: attached)
>>
>> Cluster nodes:
>>
>> master running i-acc76242 ec2-54-164-81-80.compute-1.amazonaws.com
>>
>> node001 running i-5177ddbf ec2-54-164-98-38.compute-1.amazonaws.com
>>
>> node002 running i-9976c077 ec2-54-164-88-184.compute-1.amazonaws.com
>>
>> node003 running i-9e76c070 ec2-54-164-38-146.compute-1.amazonaws.com
>>
>> node004 running i-1776c0f9 ec2-54-86-252-119.compute-1.amazonaws.com
>>
>> node005 running i-1676c0f8 ec2-54-165-66-3.compute-1.amazonaws.com
>>
>> Total nodes: 6
>>
>>
>>
>> > starcluster sshmaster <your cluster name>
>>
>>
>>
>> works just fine, I am ssh'd into master under root user.
>>
>>
>>
>> Some more details: I am wondering if this is because my master node is a
>> t1.micro - either it is an older generation and not updated, or doesn't
>> have enough memory to run the queue? When doing my initial tests, running
>> thousands of simple jobs, it worked fine, and the load balancer added and
>> deleted nodes as expected. However, when running slightly more intensive
>> jobs, including the python module networkx, the jobs give this error and
>> then SGE dies:
>>
>> OpenBLAS : Your OS does not support AVX instructions. OpenBLAS is using
>> Nehalem kernels as a fallback, which may give poorer performance.
>>
>> Killed
>>
>>
>>
>> I would really like to have a very cheap master node since I expect to
>> keep it running 24/7, but only use the cluster in bursts.
>>
>>
>>
>> On Mon, Sep 22, 2014 at 5:13 PM, Amanda Joy Kedaigle <mandyjoy_at_mit.edu>
>> wrote:
>>
>> Hi,
>>
>> I am trying to run starcluster's loadbalancer to keep only one node
>> running until jobs are submitted to the cluster. I know it's an
>> experimental feature, but I'm wondering if anyone has run into this error
>> before, or has any suggestions. The cluster has been whittled down to 1
>> node after a weekend of inactivity, and now it seems that when jobs are
>> submitted to the queue, instead of adding nodes, SGE fails.
>>
>> >>> Loading full job history
>> *** WARNING - Failed to retrieve stats (1/5):
>> Traceback (most recent call last):
>> File
>> "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py",
>> line 552, in get_stats
>> return self._get_stats()
>> File
>> "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py",
>> line 522, in _get_stats
>> qhostxml = '\n'.join(master.ssh.execute('qhost -xml'))
>> File
>> "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/sshutils.py",
>> line 578, in execute
>> msg, command, exit_status, out_str)
>> RemoteCommandFailed: remote command 'source /etc/profile && qhost -xml'
>> failed with status 1:
>> error: commlib error: got select error (Connection refused)
>> error: unable to send message to qmaster using port 63231 on host
>> "master": got send error
>>
>> Thanks for any help!
>> Amanda
>>
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
>>
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
>
Received on Tue Sep 23 2014 - 21:14:45 EDT

This message: [ Message body ]
Next message: rlinan: "Re: stopping the cluster makes dismounts all EBS volumes"
Previous message: Amanda Joy Kedaigle: "Re: FW: commlib error"
In reply to: Amanda Joy Kedaigle: "Re: FW: commlib error"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Navigation

Re: FW: commlib error

Search:

Sort all by:

Navigation