StarCluster - Mailing List Archive

Re: FW: commlib error

From: Amanda Joy Kedaigle <no email>
Date: Tue, 23 Sep 2014 22:06:31 +0000

Thanks, Rayson. However, when I got the latest git clone and tried this out, I am getting the error 'Placement groups may not be used with instances of type 't2.micro'.

________________________________
From: Rayson Ho [raysonlogin_at_gmail.com]
Sent: Tuesday, September 23, 2014 3:09 PM
To: Amanda Joy Kedaigle
Cc: MacMullan, Hugh; starcluster_at_mit.edu
Subject: Re: [StarCluster] FW: commlib error

t2 support was added back in July, I believe you need to use the latest git clone as the code is not in a release yet.

Also, t2 only runs in a VPC, but if your account is new enough, then all your instances by default runs in a default-vpc. Otherwise, you will need to set up a VPC.

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html

On Tue, Sep 23, 2014 at 2:56 PM, Amanda Joy Kedaigle <mandyjoy_at_mit.edu<mailto:mandyjoy_at_mit.edu>> wrote:
Thanks for that plugin suggestion! I tried avoiding the master node through "qsub -l", and so far that seems to be helping!

As for using a t2.micro or small, those don't seem to be an option in starcluster? They are not listed as instance options in my config file, and when I previously tried an option outside of those listed, I got an error. If anyone knows a way around that, I'd be interested!

Thanks to both of you
Amanda

________________________________
From: MacMullan, Hugh [hughmac_at_wharton.upenn.edu<mailto:hughmac_at_wharton.upenn.edu>]
Sent: Tuesday, September 23, 2014 2:50 PM
To: Amanda Joy Kedaigle
Cc: starcluster_at_mit.edu<mailto:starcluster_at_mit.edu>
Subject: RE: [StarCluster] FW: commlib error

Amanda:

I agree with Rajat's t2 suggestion … even just a t2.micro will help over a t1.micro … and it's cheaper!

And are you running jobs on the master as well as the nodes? If so, you could disable that:

[cluster mycluster]
DISABLE_QUEUE=True
PLUGINS = sge

[plugin sge]
setup_class = starcluster.plugins.sge.SGEPlugin
master_is_exec_host = False

That might help with stability a good bit.

You can also use spot pricing for the master to get a beefier master for a much lower price … but of course with the risk of losing the whole cluster if you are outbid.

Good luck with the project!

-Hugh

From: starcluster-bounces_at_mit.edu<mailto:starcluster-bounces_at_mit.edu> [mailto:starcluster-bounces_at_mit.edu<mailto:starcluster-bounces_at_mit.edu>] On Behalf Of Rajat Banerjee
Sent: Tuesday, September 23, 2014 1:11 PM
To: Amanda Joy Kedaigle
Cc: starcluster_at_mit.edu<mailto:starcluster_at_mit.edu>
Subject: Re: [StarCluster] FW: commlib error

HI Amanda,
I googled your error and found a few pages that suggest that sge service on the master node went down:

http://verahill.blogspot.com/2012/08/sun-gridengine-commlib-error-got-select.html

https://supcom.hgc.jp/english/utili_info/manual/faq.html

http://comments.gmane.org/gmane.comp.clustering.gridengine.users/17283
If your OpenBLAS command is killing the process on master that could cause your issues according to those authors. Sorry I don't have anything more helpful, but the t2.small is still less than $.03 per hour now. That may not increase your costs too much.

Raj

On Tue, Sep 23, 2014 at 12:55 PM, Amanda Joy Kedaigle <mandyjoy_at_mit.edu<mailto:mandyjoy_at_mit.edu>> wrote:
Thanks, Raj. I can communicate with the master node, it just looks like SGE is failing. I restarted the cluster and everything seemed to be working, but then it just failed in the same way again.

> starcluster listclusters (should list status of all your active clusters and running nodes)


-----------------------------------------------------

fraenkelcluster (security group: _at_sc-fraenkelcluster)

-----------------------------------------------------

Launch time: 2014-09-23 11:59:43

Uptime: 0 days, 00:45:58

VPC: vpc-c71f0fa5

Subnet: subnet-e6b8c8ce

Zone: us-east-1c

Keypair: fraenkel-keypair

EBS volumes:

    vol-5e75ba11 on master:/dev/sdz (status: attached)

Cluster nodes:

     master running i-acc76242 ec2-54-164-81-80.compute-1.amazonaws.com<http://ec2-54-164-81-80.compute-1.amazonaws.com>

    node001 running i-5177ddbf ec2-54-164-98-38.compute-1.amazonaws.com<http://ec2-54-164-98-38.compute-1.amazonaws.com>

    node002 running i-9976c077 ec2-54-164-88-184.compute-1.amazonaws.com<http://ec2-54-164-88-184.compute-1.amazonaws.com>

    node003 running i-9e76c070 ec2-54-164-38-146.compute-1.amazonaws.com<http://ec2-54-164-38-146.compute-1.amazonaws.com>

    node004 running i-1776c0f9 ec2-54-86-252-119.compute-1.amazonaws.com<http://ec2-54-86-252-119.compute-1.amazonaws.com>

    node005 running i-1676c0f8 ec2-54-165-66-3.compute-1.amazonaws.com<http://ec2-54-165-66-3.compute-1.amazonaws.com>

Total nodes: 6

> starcluster sshmaster <your cluster name>

works just fine, I am ssh'd into master under root user.

Some more details: I am wondering if this is because my master node is a t1.micro - either it is an older generation and not updated, or doesn't have enough memory to run the queue? When doing my initial tests, running thousands of simple jobs, it worked fine, and the load balancer added and deleted nodes as expected. However, when running slightly more intensive jobs, including the python module networkx, the jobs give this error and then SGE dies:
OpenBLAS : Your OS does not support AVX instructions. OpenBLAS is using Nehalem kernels as a fallback, which may give poorer performance.
Killed

I would really like to have a very cheap master node since I expect to keep it running 24/7, but only use the cluster in bursts.

On Mon, Sep 22, 2014 at 5:13 PM, Amanda Joy Kedaigle <mandyjoy_at_mit.edu<mailto:mandyjoy_at_mit.edu>> wrote:
Hi,

I am trying to run starcluster's loadbalancer to keep only one node running until jobs are submitted to the cluster. I know it's an experimental feature, but I'm wondering if anyone has run into this error before, or has any suggestions. The cluster has been whittled down to 1 node after a weekend of inactivity, and now it seems that when jobs are submitted to the queue, instead of adding nodes, SGE fails.

>>> Loading full job history
*** WARNING - Failed to retrieve stats (1/5):
Traceback (most recent call last):
  File "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py", line 552, in get_stats
    return self._get_stats()
  File "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py", line 522, in _get_stats
    qhostxml = '\n'.join(master.ssh.execute('qhost -xml'))
  File "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/sshutils.py", line 578, in execute
    msg, command, exit_status, out_str)
RemoteCommandFailed: remote command 'source /etc/profile && qhost -xml' failed with status 1:
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 63231 on host "master": got send error

Thanks for any help!
Amanda

_______________________________________________
StarCluster mailing list
StarCluster_at_mit.edu<mailto:StarCluster_at_mit.edu>
http://mailman.mit.edu/mailman/listinfo/starcluster


_______________________________________________
StarCluster mailing list
StarCluster_at_mit.edu<mailto:StarCluster_at_mit.edu>
http://mailman.mit.edu/mailman/listinfo/starcluster


_______________________________________________
StarCluster mailing list
StarCluster_at_mit.edu<mailto:StarCluster_at_mit.edu>
http://mailman.mit.edu/mailman/listinfo/starcluster
Received on Tue Sep 23 2014 - 18:06:39 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject