StarCluster - Mailing List Archive

Re: [Starcluster] create_image.py

From: Dan Yamins <no email>
Date: Fri, 8 Jan 2010 15:15:33 -0500

On Fri, Jan 8, 2010 at 1:46 PM, Mark J. Pearrow <mjp_at_mit.edu> wrote:

> Hi Dan,
>
> Thanks for that pointer. I was able to create my own AMI from the
> starcluster x64 base, aptitude upgrade it, upload, register, and instantiate
> it via the instructions you reference. SGE worked properly on it at that
> point. So that seemed quite hopeful.
>
> Once I launched the new instance, and customized it a bit (adding a new
> couple repositories for apt and installing some applications), then creating
> a new ec2 volume, uploading and registering it,
>

Yes, I'm aware of this problem, I'm sorry I didnt include a warning about it
in my previous email. I have two thoughts about this:

1) If the original bug in create_image.py could be figured out (by Justin)
then perhaps this second bug is caused by the same or similar problem and
could be fixed then as well.

2) On the other hand: I think you're trying to do something that is
basically ALWAYS a no-no: rebundling an already-rebundled AMI. I've
amost never been able to get an Amazon AMI that I've re-re-bundled from a
previous re-bundled AMI to work stably. (Maybe it's worked one time I tried
it.) I've had this problem with multiple AMIs, not just from starcluster's
AMIs, and not just with SGE.

I've written to the EC2 mailing lists about this several times but have
never received a response. I know it sounds sort of hokey and mysterious
to put it this way, but something degrades when you do a rebundling, and
that degradation seems to get progressively as you iteratively rebundle.
Anecdotally, I feel like the problem always has something to do with the
nature of the startup procedures, especially SSL / ssh handling, but I just
don't understand enough about the natures either of SSH setup or server
startup to pinpoint the problems more specifically. [I really wish I could
pin down an Amazon AWS engineer in person about this problem and force them
to go through a few cases of why this occurs and show me how to fix them.]

Maybe Justin can look into this more closely and fix it ... or maybe someone
at Alestic would understand the problem.

But for now, probably your best option is to build the whole image from
scratch every time you want to add to or modify it. I know that sounds
annoying, and it is ... but I have never found another solution.

Dan








> I logged into it and I'm back in the same boat: system hangs at the ">>>
> Installing Sun Grid Engine..." message. When I looked at the ps listing for
> that instance, I could see that there was a "source /etc/profile && qconf
> -Aq /tmp/pe.txt" running. But at that point, qconf won't work since sge_*
> isn't running.
>
> hummmm.
>
> mjp
>
>
>
>
> On Jan 8, 2010, at 9:55 AM, Dan Yamins wrote:
>
>
> However, when I followed the directions for rebundling an AMI directly from
> Amazon's AWS site (
> http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/creating-an-image.html)
> I was able to create very stable working AMIs from the starcluster base
> images (both 32bit and 64 bit). Have you tried these directions? If
> not, they might work better that create_image.py.
>
>
> _______________________________________________
> Starcluster mailing list
> Starcluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
Received on Fri Jan 08 2010 - 15:15:34 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject