Bug!

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

From: Sergio Mafra <no email>
Date: Fri, 9 Aug 2013 09:42:13 -0300

Hi fellows,

Here´s the situation:

I was running a 15 node cluster, using spot instances, for some calculation
work. It seems that the my bid price was low for the market and I lost all
slave nodes this night.
No problem since there was no jobs running at that time.
Today, I tried to add the lost nodes to the cluster.. and boom.. here comes
the bug.
Well, After a holly restart, no problems anymore.

Anyone can tell me what happened?

I´m litlle worried since this is happening a lot.

All the best,

Sergio

---
ubuntu_at_domU-12-31-39-15-11-FA:~$ starcluster addnode -n 14 decomp
StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster_at_mit.edu
>>> Launching node(s): node001, node002, node003, node004, node005,
node006, nod
                                                                e007,
node008, node009, node010, node011, node012, node013, node014
SpotInstanceRequest:sir-db61cc34
SpotInstanceRequest:sir-25ced835
SpotInstanceRequest:sir-83cb3634
SpotInstanceRequest:sir-f3012e34
SpotInstanceRequest:sir-2ab42232
SpotInstanceRequest:sir-ebcf5434
SpotInstanceRequest:sir-743a2e35
SpotInstanceRequest:sir-f461de34
SpotInstanceRequest:sir-78e95e32
SpotInstanceRequest:sir-0f44ee35
SpotInstanceRequest:sir-91364835
SpotInstanceRequest:sir-7dc6b635
SpotInstanceRequest:sir-ae0ac635
SpotInstanceRequest:sir-46e40a35
>>> Waiting for node(s) to come up... (updating every 30s)
>>> Waiting for open spot requests to become active...
14/14 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Waiting for all nodes to be in a 'running' state...
15/15 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Waiting for SSH to come up on all nodes...
15/15 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Waiting for cluster to come up took 9.552 mins
>>> Running plugin starcluster.clustersetup.DefaultClusterSetup
>>> Configuring hostnames...
1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Configuring /etc/hosts on each node
15/15 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Configuring NFS exports path(s):
/home
>>> Mounting all NFS export path(s) on 1 worker node(s)
1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
!!! ERROR - Error occured while running plugin
'starcluster.clustersetup.Default
        ClusterSetup':
!!! ERROR - error occurred in job (id=node001): remote command 'source
/etc/prof
                                                            ile && mount
/home' failed with status 32:
mount: master:/home failed, reason given by server: Permission denied
Traceback (most recent call last):
  File
"/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/star
                                                cluster/threadpool.py",
line 31, in run
    job.run()
  File
"/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/star
                                                cluster/threadpool.py",
line 58, in run
    r = self.method(*self.args, **self.kwargs)
  File
"/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/star
                                                cluster/node.py", line 689,
in mount_nfs_shares
    self.ssh.execute('mount %s' % path)
  File
"/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/star
cluster/sshutils/__init__.py", line 538, in execute
    msg, command, exit_status, out_str)
RemoteCommandFailed: remote command 'source /etc/profile && mount /home'
failed
                                                           with status 32:
mount: master:/home failed, reason given by server: Permission denied
!!! ERROR - Oops! Looks like you've found a bug in StarCluster
!!! ERROR - Crash report written to:
/home/ubuntu/.starcluster/logs/crash-report
                  -25643.txt
!!! ERROR - Please remove any sensitive data from the crash report
!!! ERROR - and submit it to starcluster_at_mit.edu
*
* Moved to holly restart and everything ran fine
*
ubuntu_at_domU-12-31-39-15-11-FA:~$ starcluster restart decomp
StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster_at_mit.edu
>>> Running plugin starcluster.plugins.sge.SGEPlugin
>>> Running plugin starcluster.clustersetup.DefaultClusterSetup
>>> Rebooting cluster...
>>> Sleeping for 20 seconds...
>>> Waiting for cluster to come up... (updating every 30s)
>>> Waiting for all nodes to be in a 'running' state...
15/15 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Waiting for SSH to come up on all nodes...
15/15 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Waiting for cluster to come up took 3.998 mins
>>> The master node is ec2-23-23-241-73.compute-1.amazonaws.com
>>> Configuring cluster...
>>> Volume vol-fb714ca1 already attached to master...skipping
>>> Running plugin starcluster.clustersetup.DefaultClusterSetup
>>> Configuring hostnames...
15/15 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Mounting EBS volume vol-fb714ca1 on /home...
>>> Creating cluster user: sgeadmin (uid: 1001, gid: 1001)
15/15 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Configuring scratch space for user(s): sgeadmin
15/15 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Configuring /etc/hosts on each node
15/15 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Starting NFS server on master
>>> Configuring NFS exports path(s):
/home
>>> Mounting all NFS export path(s) on 14 worker node(s)
14/14 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Setting up NFS took 0.083 mins
>>> Configuring passwordless ssh for root
>>> Configuring passwordless ssh for sgeadmin
>>> Running plugin starcluster.plugins.sge.SGEPlugin
>>> Configuring SGE...
>>> Configuring NFS exports path(s):
/opt/sge6
>>> Mounting all NFS export path(s) on 14 worker node(s)
14/14 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Setting up NFS took 0.018 mins
>>> Removing previous SGE installation...
>>> Installing Sun Grid Engine...
14/14 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Creating SGE parallel environment 'orte'
15/15 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Adding parallel environment 'orte' to queue 'all.q'
>>> Configuring cluster took 0.779 mins
>>> Restarting cluster took 5.162 mins

Received on Fri Aug 09 2013 - 08:45:36 EDT

This message: [ Message body ]
Next message: Alex Gaudio: "Prefixing hostnames with cluster tag"
Previous message: Michael McClellan: "Re: Spot Instances"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

This archive was generated by hypermail 2.3.0.

Bug!

Search:

Sort all by:

Navigation

Bug!

Search:

Sort all by:

Navigation