StarCluster - Mailing List Archive

Re: Multiple node cluster problems

From: Justin Riley <no email>
Date: Thu, 12 Dec 2013 12:00:13 -0500

Hi Joydeep/Dan,

Is this still an issue for you? I'm not able to reproduce this in
us-east-1 (EC2-classic for me) or us-west-2 (VPC for me). This is a
*major* issue if this change is indeed permanent. What region are you
running into this with?

Thanks for reporting.

~Justin


On Wed, Nov 27, 2013 at 12:29:22PM +0530, Joydeep Sen Sarma wrote:
> Hi Daniel and StarClusterers,
> Qubole uses a fork of Starcluster and our service has been widely affected
> by this problem.
> What happened is that AWS broke their API that returns the user data file
> for nodes launched by starcluster. I believe we can now only get the user
> data file for the first instance in the launch list. As a result - calls
> to get alias (which read the user data file of the node) broke.
> We have filed a case with AWS - but they have been unusually sloppy in
> fixing it. They haven't even acknowledged the problem in their status
> page. Meanwhile - we have been busy coding up workaround hacks.
> If the StarCluster community can independently complain to AWS - that
> might help (perhaps). The workarounds aren't pleasent.
> - Joydeep
>
> On Tue, Nov 26, 2013 at 12:44 AM, Daniel Polhamus <[1]danp_at_metrumrg.com>
> wrote:
>
> Hi all,
> We're seeing issues with using clusters consisting of multiple nodes
> today.  Launch of clusters with >=3 nodes fails, with report of being
> unable to assign aliases to nodes other than "master".  The same problem
> is seen with addnode.  Adding one node is fine, but adding more than one
> gives the alias problem again.  Terminating these clusters fails due to
> the missing alias as well, you have to use the ec2toolkit to shut down
> the offending nodes that were not named.
> I'm on the latest developmental version, and I've noticed that there's a
> lot of gibberish in the node user data (as viewed through the web
> console) as of today.
> Debug at the end, and thanks for the help.
> Dan
> > starcluster -d start -c testing brokenCluster -s 3
> ... 
> ...
> >>> Waiting for all nodes to be in a 'running' state...
> 2013-11-25 14:13:06,323 cluster.py:734 - DEBUG - existing nodes: {}
> 2013-11-25 14:13:06,323 cluster.py:742 - DEBUG - adding node i-323f504f
> to self._nodes list
> 2013-11-25 14:13:06,839 cluster.py:742 - DEBUG - adding node i-2c3f5051
> to self._nodes list
> 2013-11-25 14:13:07,001 node.py:147 - DEBUG - invalid aliases file in
> user_data:
> 3/3 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%
> !!! ERROR - instance i-2c3f5051 has no alias
> 2013-11-25 14:13:07,003 cli.py:301 - DEBUG - instance i-2c3f5051 has no
> alias
> Traceback (most recent call last):
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cli.py",
> line 274, in main
>     sc.execute(args)
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/commands/start.py",
> line 220, in execute
>     validate_running=validate_running)
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 1534, in start
>     return self._start(create=create, create_only=create_only)
>   File "<string>", line 2, in _start
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/utils.py",
> line 111, in wrap_f
>     res = func(*arg, **kargs)
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 1557, in _start
>     self.setup_cluster()
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 1565, in setup_cluster
>     self.wait_for_cluster()
>   File "<string>", line 2, in wait_for_cluster
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/utils.py",
> line 111, in wrap_f
>     res = func(*arg, **kargs)
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 1350, in wait_for_cluster
>     self.wait_for_running_instances()
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 1305, in wait_for_running_instances
>     nodes = nodes or self.get_nodes_or_raise()
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 754, in get_nodes_or_raise
>     nodes = self.nodes
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 744, in nodes
>     if n.is_master():
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/node.py",
> line 898, in is_master
>     return self.alias == 'master' or self.alias.endswith("-master")
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/node.py",
> line 150, in alias
>     "instance %s has no alias" % [2]self.id)
> BaseException: instance i-2c3f5051 has no alias
> --
> Daniel G Polhamus, PhD
> Metrum Research Group, LLC
> _______________________________________________
> StarCluster mailing list
> [3]StarCluster_at_mit.edu
> [4]http://mailman.mit.edu/mailman/listinfo/starcluster
>
> References
>
> Visible links
> 1. mailto:danp_at_metrumrg.com
> 2. http://self.id/
> 3. mailto:StarCluster_at_mit.edu
> 4. http://mailman.mit.edu/mailman/listinfo/starcluster

> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster




Received on Thu Dec 12 2013 - 12:00:19 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject