StarCluster - Mailing List Archive

Re: FW: commlib error

From: Rayson Ho <no email>
Date: Tue, 23 Sep 2014 15:09:15 -0400

t2 support was added back in July, I believe you need to use the latest git
clone as the code is not in a release yet.

Also, t2 only runs in a VPC, but if your account is new enough, then all
your instances by default runs in a default-vpc. Otherwise, you will need
to set up a VPC.

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html

On Tue, Sep 23, 2014 at 2:56 PM, Amanda Joy Kedaigle <mandyjoy_at_mit.edu>
wrote:

> Thanks for that plugin suggestion! I tried avoiding the master node
> through "qsub -l", and so far that seems to be helping!
>
> As for using a t2.micro or small, those don't seem to be an option in
> starcluster? They are not listed as instance options in my config file, and
> when I previously tried an option outside of those listed, I got an error.
> If anyone knows a way around that, I'd be interested!
>
> Thanks to both of you
> Amanda
>
> ------------------------------
> *From:* MacMullan, Hugh [hughmac_at_wharton.upenn.edu]
> *Sent:* Tuesday, September 23, 2014 2:50 PM
> *To:* Amanda Joy Kedaigle
> *Cc:* starcluster_at_mit.edu
> *Subject:* RE: [StarCluster] FW: commlib error
>
> Amanda:
>
>
>
> I agree with Rajat's t2 suggestion … even just a t2.micro will help over a
> t1.micro … and it's cheaper!
>
>
>
> And are you running jobs on the master as well as the nodes? If so, you
> could disable that:
>
>
>
> [cluster mycluster]
>
> DISABLE_QUEUE=True
>
> PLUGINS = sge
>
>
>
> [plugin sge]
>
> setup_class = starcluster.plugins.sge.SGEPlugin
>
> master_is_exec_host = False
>
>
>
> That might help with stability a good bit.
>
>
>
> You can also use spot pricing for the master to get a beefier master for a
> much lower price … but of course with the risk of losing the whole cluster
> if you are outbid.
>
>
>
> Good luck with the project!
>
>
>
> -Hugh
>
>
>
> *From:* starcluster-bounces_at_mit.edu [mailto:starcluster-bounces_at_mit.edu] *On
> Behalf Of *Rajat Banerjee
> *Sent:* Tuesday, September 23, 2014 1:11 PM
> *To:* Amanda Joy Kedaigle
> *Cc:* starcluster_at_mit.edu
> *Subject:* Re: [StarCluster] FW: commlib error
>
>
>
> HI Amanda,
>
> I googled your error and found a few pages that suggest that sge service
> on the master node went down:
>
>
> http://verahill.blogspot.com/2012/08/sun-gridengine-commlib-error-got-select.html
>
> https://supcom.hgc.jp/english/utili_info/manual/faq.html
>
> http://comments.gmane.org/gmane.comp.clustering.gridengine.users/17283
>
> If your OpenBLAS command is killing the process on master that could cause
> your issues according to those authors. Sorry I don't have anything more
> helpful, but the t2.small is still less than $.03 per hour now. That may
> not increase your costs too much.
>
> Raj
>
>
>
> On Tue, Sep 23, 2014 at 12:55 PM, Amanda Joy Kedaigle <mandyjoy_at_mit.edu>
> wrote:
>
> Thanks, Raj. I can communicate with the master node, it just looks
> like SGE is failing. I restarted the cluster and everything seemed to be
> working, but then it just failed in the same way again.
>
>
>
> > starcluster listclusters (should list status of all your active clusters
> and running nodes)
>
>
>
> -----------------------------------------------------
>
> fraenkelcluster (security group: _at_sc-fraenkelcluster)
>
> -----------------------------------------------------
>
> Launch time: 2014-09-23 11:59:43
>
> Uptime: 0 days, 00:45:58
>
> VPC: vpc-c71f0fa5
>
> Subnet: subnet-e6b8c8ce
>
> Zone: us-east-1c
>
> Keypair: fraenkel-keypair
>
> EBS volumes:
>
> vol-5e75ba11 on master:/dev/sdz (status: attached)
>
> Cluster nodes:
>
> master running i-acc76242 ec2-54-164-81-80.compute-1.amazonaws.com
>
> node001 running i-5177ddbf ec2-54-164-98-38.compute-1.amazonaws.com
>
> node002 running i-9976c077 ec2-54-164-88-184.compute-1.amazonaws.com
>
> node003 running i-9e76c070 ec2-54-164-38-146.compute-1.amazonaws.com
>
> node004 running i-1776c0f9 ec2-54-86-252-119.compute-1.amazonaws.com
>
> node005 running i-1676c0f8 ec2-54-165-66-3.compute-1.amazonaws.com
>
> Total nodes: 6
>
>
>
> > starcluster sshmaster <your cluster name>
>
>
>
> works just fine, I am ssh'd into master under root user.
>
>
>
> Some more details: I am wondering if this is because my master node is a
> t1.micro - either it is an older generation and not updated, or doesn't
> have enough memory to run the queue? When doing my initial tests, running
> thousands of simple jobs, it worked fine, and the load balancer added and
> deleted nodes as expected. However, when running slightly more intensive
> jobs, including the python module networkx, the jobs give this error and
> then SGE dies:
>
> OpenBLAS : Your OS does not support AVX instructions. OpenBLAS is using
> Nehalem kernels as a fallback, which may give poorer performance.
>
> Killed
>
>
>
> I would really like to have a very cheap master node since I expect to
> keep it running 24/7, but only use the cluster in bursts.
>
>
>
> On Mon, Sep 22, 2014 at 5:13 PM, Amanda Joy Kedaigle <mandyjoy_at_mit.edu>
> wrote:
>
> Hi,
>
> I am trying to run starcluster's loadbalancer to keep only one node
> running until jobs are submitted to the cluster. I know it's an
> experimental feature, but I'm wondering if anyone has run into this error
> before, or has any suggestions. The cluster has been whittled down to 1
> node after a weekend of inactivity, and now it seems that when jobs are
> submitted to the queue, instead of adding nodes, SGE fails.
>
> >>> Loading full job history
> *** WARNING - Failed to retrieve stats (1/5):
> Traceback (most recent call last):
> File
> "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py",
> line 552, in get_stats
> return self._get_stats()
> File
> "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py",
> line 522, in _get_stats
> qhostxml = '\n'.join(master.ssh.execute('qhost -xml'))
> File
> "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/sshutils.py",
> line 578, in execute
> msg, command, exit_status, out_str)
> RemoteCommandFailed: remote command 'source /etc/profile && qhost -xml'
> failed with status 1:
> error: commlib error: got select error (Connection refused)
> error: unable to send message to qmaster using port 63231 on host
> "master": got send error
>
> Thanks for any help!
> Amanda
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
Received on Tue Sep 23 2014 - 15:09:18 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject