StarCluster - Mailing List Archive

Re: [Starcluster] StarCluster timeout problem

From: Nasser Alansari <no email>
Date: Wed, 31 Mar 2010 03:59:12 +1100

Hi Justin,

thanks for the quick respond.

I've download the development version and I've notice many new
changes/features
compared to version 0.91.

However, I've find 2 bugs.

Bug-1:
If there is no volume is been specified to a cluster in the configure file,
the
starcluster will crash with:

 TypeError: 'NoneType' object is not iterable
>

Cause:
After tracking the problem: the "self.VOLUMES" in "cluster.py"
have a value of None. And, there are two functions(setup_ebs, setup_nfs) in

"clustersetup.py" are tying to for-loop it.

A Quick-Fix:
Cluster.py( line: 154): I've added the following after line 154:

if not volumes:
> self.VOLUMES = []
>

I'm sure there is a better way to fix this.



Bug-2:
Start a cluster and tagged as "bug2", then stop the cluster.
Then, start another cluster and tagged as "bug2", the startcluster will
crash.

Cause:
Starcluster is trying to SSH (ssh.py:49) the terminated instances from the
first cluster where the terminated instances have no hostname.

Quck-fix:
In cluster.py(self.nodes): I've add the following after line 278 to filter
out the
 terminated instances:

                if node.state == 'terminated':
> continue
>


Again, thanks for your great effort
Nasser


On Tue, Mar 30, 2010 at 5:39 AM, Justin Riley <jtriley_at_mit.edu> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi Nasser
>
> I've cc'd the starcluster mailing list, hope you don't mind.
>
> BTW, I'd like to invite you to join the starcluster mailing list. It's a
> good place to keep up with things and submit issues
> such as these. You can join the list here:
>
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
> Thanks for reporting this issue. I've made a quick-fix change in the
> development version of the code on github by bumping the timeout to 5
> sec. This still might not help you if the latency is really bad.
>
> My current thinking on this is to 'throttle' the timeout time the longer
> it takes for the cluster to appear to be up. So, at first it would
> attempt a 5 second timeout, and then incrementally raise it up to 15
> seconds as necessary. After a maximum of 15 seconds and enough retries,
> it would likely just error out.
>
> This is on my list for the next version.
>
> Thanks for reporting!
>
> ~Justin
>
>
>
> > Problem:
> > I've installed & configured StartCluster correctly. However, when I
> try to start it with "startcluster -s", everything goes fine until it
> reach the line ">>> Waiting for cluster to start..." and that when it
> run forever(infinite loop). Even after all the instances are in
> "running" state.
> >
> > Solution:
> > After debugging, I found out that the value of socket's timeout(0.25) in:
> >
> > File: starcluster/ec2utils.py
> > Function: is_ssh_up()
> > Line: s.settimeout(0.25)
> >
> > is too small for my connection; due to a latency issue.
> >
> > So I've commented, as a quick fix, that line and everything work fine.
> >
> > A bigger value would solve this.
> >
> > Thanks for your great work and keep it up
> > Nasser
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.14 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAkuw8/YACgkQ4llAkMfDcrmlBwCePfX/zZoQjqlh9dQS7xo4geQm
> wn4AoJHE0/AdvRbAMB4EIz5yvompZsRt
> =kjHp
> -----END PGP SIGNATURE-----
>
Received on Tue Mar 30 2010 - 12:59:13 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject