StarCluster - Mailing List Archive

Re: FW: commlib error

From: Rajat Banerjee <no email>
Date: Tue, 23 Sep 2014 13:11:01 -0400

HI Amanda,
I googled your error and found a few pages that suggest that sge service on
the master node went down:

http://verahill.blogspot.com/2012/08/sun-gridengine-commlib-error-got-select.html

https://supcom.hgc.jp/english/utili_info/manual/faq.html

http://comments.gmane.org/gmane.comp.clustering.gridengine.users/17283

If your OpenBLAS command is killing the process on master that could cause
your issues according to those authors. Sorry I don't have anything more
helpful, but the t2.small is still less than $.03 per hour now. That may
not increase your costs too much.

Raj

On Tue, Sep 23, 2014 at 12:55 PM, Amanda Joy Kedaigle <mandyjoy_at_mit.edu>
wrote:

> Thanks, Raj. I can communicate with the master node, it just looks like
> SGE is failing. I restarted the cluster and everything seemed to be
> working, but then it just failed in the same way again.
>
> > starcluster listclusters (should list status of all your active
> clusters and running nodes)
>
> -----------------------------------------------------
>
> fraenkelcluster (security group: _at_sc-fraenkelcluster)
>
> -----------------------------------------------------
>
> Launch time: 2014-09-23 11:59:43
>
> Uptime: 0 days, 00:45:58
>
> VPC: vpc-c71f0fa5
>
> Subnet: subnet-e6b8c8ce
>
> Zone: us-east-1c
>
> Keypair: fraenkel-keypair
>
> EBS volumes:
>
> vol-5e75ba11 on master:/dev/sdz (status: attached)
>
> Cluster nodes:
>
> master running i-acc76242 ec2-54-164-81-80.compute-1.amazonaws.com
>
> node001 running i-5177ddbf ec2-54-164-98-38.compute-1.amazonaws.com
>
> node002 running i-9976c077 ec2-54-164-88-184.compute-1.amazonaws.com
>
> node003 running i-9e76c070 ec2-54-164-38-146.compute-1.amazonaws.com
>
> node004 running i-1776c0f9 ec2-54-86-252-119.compute-1.amazonaws.com
>
> node005 running i-1676c0f8 ec2-54-165-66-3.compute-1.amazonaws.com
>
> Total nodes: 6
>
> > starcluster sshmaster <your cluster name>
>
> works just fine, I am ssh'd into master under root user.
>
> Some more details: I am wondering if this is because my master node is a
> t1.micro - either it is an older generation and not updated, or doesn't
> have enough memory to run the queue? When doing my initial tests, running
> thousands of simple jobs, it worked fine, and the load balancer added and
> deleted nodes as expected. However, when running slightly more intensive
> jobs, including the python module networkx, the jobs give this error and
> then SGE dies:
> OpenBLAS : Your OS does not support AVX instructions. OpenBLAS is using
> Nehalem kernels as a fallback, which may give poorer performance.
> Killed
>
> I would really like to have a very cheap master node since I expect to
> keep it running 24/7, but only use the cluster in bursts.
>
> On Mon, Sep 22, 2014 at 5:13 PM, Amanda Joy Kedaigle <mandyjoy_at_mit.edu>
> wrote:
>
>> Hi,
>>
>> I am trying to run starcluster's loadbalancer to keep only one node
>> running until jobs are submitted to the cluster. I know it's an
>> experimental feature, but I'm wondering if anyone has run into this error
>> before, or has any suggestions. The cluster has been whittled down to 1
>> node after a weekend of inactivity, and now it seems that when jobs are
>> submitted to the queue, instead of adding nodes, SGE fails.
>>
>> >>> Loading full job history
>> *** WARNING - Failed to retrieve stats (1/5):
>> Traceback (most recent call last):
>> File
>> "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py",
>> line 552, in get_stats
>> return self._get_stats()
>> File
>> "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py",
>> line 522, in _get_stats
>> qhostxml = '\n'.join(master.ssh.execute('qhost -xml'))
>> File
>> "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/sshutils.py",
>> line 578, in execute
>> msg, command, exit_status, out_str)
>> RemoteCommandFailed: remote command 'source /etc/profile && qhost -xml'
>> failed with status 1:
>> error: commlib error: got select error (Connection refused)
>> error: unable to send message to qmaster using port 63231 on host
>> "master": got send error
>>
>> Thanks for any help!
>> Amanda
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
Received on Tue Sep 23 2014 - 13:11:23 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject