StarCluster - Mailing List Archive

Re: [Starcluster] StarCluster timeout problem

From: Justin Riley <no email>
Date: Tue, 30 Mar 2010 14:54:57 -0400

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Nasser,

Have you joined the mailing list yet? I'm getting your messages but have
to approve them through the admin. If you could join, we'd love to have you.

I'm glad you're testing the development version for me, this is very useful.

Concerning the bugs you found:

Bug1: I knew about this bug and had fixed it at home but forgot to
commit, it's commited now to github.

Bug2: Also knew about this bug, fix is committed to github.

If you could pull the latest changes from github and test these changes
out for me that would be really helpful.

Thanks a lot for reporting these bugs and please don't hesitate to let
me know about other issues you find.

~Justin


On 03/30/2010 12:59 PM, Nasser Alansari wrote:
> Hi Justin,
>
> thanks for the quick respond.
>
> I've download the development version and I've notice many new
> changes/features
> compared to version 0.91.
>
> However, I've find 2 bugs.
>
> Bug-1:
> If there is no volume is been specified to a cluster in the configure
> file, the
> starcluster will crash with:
>
> TypeError: 'NoneType' object is not iterable
>
>
> Cause:
> After tracking the problem: the "self.VOLUMES" in "cluster.py"
> have a value of None. And, there are two functions(setup_ebs,
> setup_nfs) in
> "clustersetup.py" are tying to for-loop it.
>
> A Quick-Fix:
> Cluster.py( line: 154): I've added the following after line 154:
>
> if not volumes:
> self.VOLUMES = []
>
>
> I'm sure there is a better way to fix this.
>
>
>
> Bug-2:
> Start a cluster and tagged as "bug2", then stop the cluster.
> Then, start another cluster and tagged as "bug2", the startcluster will
> crash.
>
> Cause:
> Starcluster is trying to SSH (ssh.py:49) the terminated instances from the
> first cluster where the terminated instances have no hostname.
>
> Quck-fix:
> In cluster.py(self.nodes): I've add the following after line 278 to
> filter out the
> terminated instances:
>
> if node.state == 'terminated':
> continue
>
>
>
> Again, thanks for your great effort
> Nasser
>
>
> On Tue, Mar 30, 2010 at 5:39 AM, Justin Riley <jtriley_at_mit.edu
> <mailto:jtriley_at_mit.edu>> wrote:
>
> Hi Nasser
>
> I've cc'd the starcluster mailing list, hope you don't mind.
>
> BTW, I'd like to invite you to join the starcluster mailing list. It's a
> good place to keep up with things and submit issues
> such as these. You can join the list here:
>
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
> Thanks for reporting this issue. I've made a quick-fix change in the
> development version of the code on github by bumping the timeout to 5
> sec. This still might not help you if the latency is really bad.
>
> My current thinking on this is to 'throttle' the timeout time the longer
> it takes for the cluster to appear to be up. So, at first it would
> attempt a 5 second timeout, and then incrementally raise it up to 15
> seconds as necessary. After a maximum of 15 seconds and enough retries,
> it would likely just error out.
>
> This is on my list for the next version.
>
> Thanks for reporting!
>
> ~Justin
>
>
>
>> Problem:
>> I've installed & configured StartCluster correctly. However, when I
> try to start it with "startcluster -s", everything goes fine until it
> reach the line ">>> Waiting for cluster to start..." and that when it
> run forever(infinite loop). Even after all the instances are in
> "running" state.
>
>> Solution:
>> After debugging, I found out that the value of socket's
> timeout(0.25) in:
>
>> File: starcluster/ec2utils.py
>> Function: is_ssh_up()
>> Line: s.settimeout(0.25)
>
>> is too small for my connection; due to a latency issue.
>
>> So I've commented, as a quick fix, that line and everything work fine.
>
>> A bigger value would solve this.
>
>> Thanks for your great work and keep it up
>> Nasser
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkuySQEACgkQ4llAkMfDcrkm7ACaA3GhyWAyrnU2PqLy9wAyiMQl
EzoAn17PJ/6LnkS0lowdnCQolNIUgMgi
=1/Sr
-----END PGP SIGNATURE-----
Received on Tue Mar 30 2010 - 14:54:59 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject