StarCluster - Mailing List Archive

Re: starcluster starts but not all nodes added as exec nodes

From: Justin Riley <no email>
Date: Wed, 06 Apr 2011 15:04:10 -0400

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jeff,

That's great news! I'm glad it's working well for you so far. Let me
know if you discover issues during your testing in the next few days.
I've also updated the README with the latest dependencies.

~Justin

On 04/06/2011 02:23 PM, Jeff White wrote:
> Justin,
>
> I don't use the installer. I just untar the package and set my PATH and
> PYTHONPATH appropriately. I didn't notice the updated boto requirement
> (README.rst needs an update).
> After updating boto, it works fantastically. I tested 25 and then 50
> nodes. Only 9 minutes to create a 50-node cluster of small instances.
> And I verified they are all registered as exec nodes.
> So far so good. We'll be giving it more rigorous testing over the next
> few days.
>
> Thanks again for your effort in fixing the issue and making it so much
> faster.
>
> jeff
>
> On Wed, Apr 6, 2011 at 10:49 AM, Justin Riley <jtriley_at_mit.edu
> <mailto:jtriley_at_mit.edu>> wrote:
>
> Hi Jeff,
>
> Did you reinstall after pulling the latest code using "python setup.py
> install"? If so, what version of boto do you have installed? You can
> check this with:
>
> % python -c 'import boto; print boto.Version'
> 2.0b4
>
> The version should be 2.0b4 as above.
>
> ~Justin
>
> On 04/06/2011 01:04 PM, Jeff White wrote:
>> Hi Justin,
>
>> Thanks much for your effort on this. I got this error upon running
>> 'starcluster -s 25 start jswtest'. I have not altered my config file
>> from the one I sent you previously.
>
>> PID: 5530 config.py:515 - DEBUG - Loading config
>> PID: 5530 config.py:108 - DEBUG - Loading file:
>> /home/jsw/.starcluster/config
>> PID: 5530 config.py:515 - DEBUG - Loading config
>> PID: 5530 config.py:108 - DEBUG - Loading file:
>> /home/jsw/.starcluster/config
>> PID: 5530 awsutils.py:54 - DEBUG - creating self._conn w/
>> connection_authenticator kwargs = {'path': '/', 'region': None,
> 'port':
>> None, 'is_secure': True}
>> PID: 5530 start.py:167 - INFO - Using default cluster template:
> smallcluster
>> PID: 5530 cluster.py:1310 - INFO - Validating cluster template
> settings...
>> PID: 5530 cli.py:184 - DEBUG - Traceback (most recent call last):
>> File "/home/jsw/jtriley-StarCluster-dfba6ef/starcluster/cli.py",
> line
>> 160, in main
>> sc.execute(args)
>> File
>> "/home/jsw/jtriley-StarCluster-dfba6ef/starcluster/commands/start.py",
>> line 175, in execute
>> scluster._validate(validate_running=validate_running)
>> File "/home/jsw/jtriley-StarCluster-dfba6ef/starcluster/cluster.py",
>> line 1322, in _validate
>> self._validate_instance_types()
>> File "/home/jsw/jtriley-StarCluster-dfba6ef/starcluster/cluster.py",
>> line 1458, in _validate_instance_types
>> self.__check_platform(node_image_id, node_instance_type)
>> File "/home/jsw/jtriley-StarCluster-dfba6ef/starcluster/cluster.py",
>> line 1419, in __check_platform
>> image_is_hvm = (image.virtualization_type == "hvm")
>> AttributeError: 'Image' object has no attribute 'virtualization_type'
>
>> PID: 5530 cli.py:129 - ERROR - Oops! Looks like you've found a bug in
>> StarCluster
>> PID: 5530 cli.py:130 - ERROR - Debug file written to:
>> /tmp/starcluster-debug-jsw.log
>> PID: 5530 cli.py:131 - ERROR - Look for lines starting with PID: 5530
>> PID: 5530 cli.py:132 - ERROR - Please submit this file, minus any
>> private information,
>> PID: 5530 cli.py:133 - ERROR - to starcluster_at_mit.edu
> <mailto:starcluster_at_mit.edu>
>> <mailto:starcluster_at_mit.edu <mailto:starcluster_at_mit.edu>>
>
>
>
>> On Wed, Apr 6, 2011 at 8:09 AM, Justin Riley
> <justin.t.riley_at_gmail.com <mailto:justin.t.riley_at_gmail.com>
>> <mailto:justin.t.riley_at_gmail.com
> <mailto:justin.t.riley_at_gmail.com>>> wrote:
>
>> Jeff/Joseph,
>
>> Sorry for taking so long to follow up with this but I believe I've
>> fixed this issue for good and you should now be able to launch 50+
>> node clusters without issue. My original feeling was that the SGE
>> install script was at fault, however, after several hours of
> digging I
>> discovered that ssh-keyscan was failing when there were a
> large number
>> of nodes. Long story short this meant that passwordless-ssh wasn't
>> being setup fully for all nodes and so the SGE installer
> script could
>> not connect to those nodes to add them to the queue. I found a
> much
>> better way to populate the known_hosts file with all the nodes
> using
>> paramiko instead of ssh-keyscan which is much faster in this case.
>
>> If you haven't already please re-run 'python setup.py install'
> after
>> pulling the latest code to test out the latest changes. I've also
>> updated StarCluster perform the setup on all nodes
> concurrently using
>> a thread pool so you should notice it's much faster for larger
>> clusters. Please let me know if you have issues.
>
>> Thanks,
>
>> ~Justin
>
>> On Wed, Mar 16, 2011 at 1:37 PM, Kyeong Soo (Joseph) Kim
>> <kyeongsoo.kim_at_gmail.com <mailto:kyeongsoo.kim_at_gmail.com>
> <mailto:kyeongsoo.kim_at_gmail.com <mailto:kyeongsoo.kim_at_gmail.com>>>
> wrote:
>> > Justin,
>> > Please, find attached the said file.
>> >
>> > Regards,
>> > Joseph
>> >
>> >
>> > On Wed, Mar 16, 2011 at 4:38 PM, Justin Riley
> <jtriley_at_mit.edu <mailto:jtriley_at_mit.edu>
>> <mailto:jtriley_at_mit.edu <mailto:jtriley_at_mit.edu>>> wrote:
>> Joseph,
>
>> Great thanks, can you also send me the /opt/sge6/ec2_sge.conf
>>> file please?
>
>> ~Justin
>
>> On 03/16/2011 12:29 PM, Kyeong Soo (Joseph) Kim wrote:
>>> >>> Hi Justin,
>>> >>>
>>> >>> Please, find attached the gzipped tar file of the
> logfiles under
>>> >>> install_logs directory.
>>> >>>
>>> >>> Note that the configuration is for 25-node (1 master and 24
>>> slaves) cluster.
>>> >>>
>>> >>> Below is the time-sorted listing of log files under the same
>>> directory:
>>> >>>
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 13:23
>>> >>> execd_install_node024_2011-03-16_13:23:11.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node023_2011-03-16_11:13:37.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node022_2011-03-16_11:13:36.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node021_2011-03-16_11:13:36.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node020_2011-03-16_11:13:32.log
>>> >>> -rw-r--r-- 1 kks kks 18K 2011-03-16 11:13
>>> >>> execd_install_master_2011-03-16_11:13:10.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node017_2011-03-16_11:13:27.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node018_2011-03-16_11:13:27.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node019_2011-03-16_11:13:28.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node016_2011-03-16_11:13:26.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node014_2011-03-16_11:13:25.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node015_2011-03-16_11:13:26.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node012_2011-03-16_11:13:24.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node013_2011-03-16_11:13:25.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node010_2011-03-16_11:13:23.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node011_2011-03-16_11:13:24.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node008_2011-03-16_11:13:22.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node009_2011-03-16_11:13:22.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node006_2011-03-16_11:13:21.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node007_2011-03-16_11:13:21.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node004_2011-03-16_11:13:20.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node005_2011-03-16_11:13:20.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node003_2011-03-16_11:13:19.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node001_2011-03-16_11:13:18.log
>>> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>> >>> execd_install_node002_2011-03-16_11:13:19.log
>>> >>> -rw-r--r-- 1 kks kks 3.1K 2011-03-16 11:13
>>> >>> execd_install_master_2011-03-16_11:13:17.log
>>> >>> -rw-r--r-- 1 kks kks 8.4K 2011-03-16 11:13
>>> >>> qmaster_install_master_2011-03-16_11:13:05.log
>>> >>>
>>> >>> As you can see, the installation of master has been
> duplicated
>>> and it
>>> >>> ended up with master, node001~node023; the top-most log
> for node024
>>> >>> was for the manual addition through "addnode" command
> later (i.e., 1
>>> >>> hour 10 mins after).
>>> >>>
>>> >>> Even with this slimmed down version of configurations
> (compared
>>> to the
>>> >>> original 125-node one), the chances that all nodes are
> properly
>>> >>> installed (i.e., 25 out of 25) were about 50% (last night
> and this
>>> >>> morning, I tried it about 10 times to set total five of
> 25-node
>>> >>> clusters).
>>> >>>
>>> >>> Regards,
>>> >>> Joseph
>>> >>>
>>> >>>
>>> >>> On Wed, Mar 16, 2011 at 3:57 PM, Justin Riley
> <jtriley_at_mit.edu <mailto:jtriley_at_mit.edu>
>>> <mailto:jtriley_at_mit.edu <mailto:jtriley_at_mit.edu>>> wrote:
>>> >>> Hi Jeff/Joseph,
>>> >>>
>>> >>> I just requested to up my EC2 instance limit so that I
> can test
>>> things
>>> >>> out at this scale and see what the issue is. In the mean time
>>> would you
>>> >>> mind sending me any logs found in
>>> /opt/sge6/default/common/install_logs
>>> >>> and also the /opt/sge6/ec2_sge.conf for a failed run?
>>> >>>
>>> >>> Also if this happens again you could try reinstalling SGE
> manually
>>> >>> assuming all the nodes are up:
>>> >>>
>>> >>> $ starcluster sshmaster mycluster
>>> >>> $ cd /opt/sge6
>>> >>> $ ./inst_sge -m -x -auto ./ec2_sge.conf
>>> >>>
>>> >>> ~Justin
>>> >>>
>>> >>> On 03/15/2011 06:30 PM, Kyeong Soo (Joseph) Kim wrote:
>>> >>>>>> Hi Jeff,
>>> >>>>>>
>>> >>>>>> I experienced the same thing with my 50-node configuration
>>> (c1.xlarge).
>>> >>>>>> Out of 50 nodes, only 29 nodes are successfully
> identified by
>>> the SGE.
>>> >>>>>>
>>> >>>>>> Regards,
>>> >>>>>> Joseph
>>> >>>>>>
>>> >>>>>> On Sat, Mar 5, 2011 at 10:15 PM, Jeff White
> <jeff_at_decide.com <mailto:jeff_at_decide.com>
>>> <mailto:jeff_at_decide.com <mailto:jeff_at_decide.com>>> wrote:
>>> >>>>>>> I can frequently reproduce an issue where 'starcluster
>>> start' completes
>>> >>>>>>> without error, but not all nodes are added to the SGE
> pool,
>>> which I verify
>>> >>>>>>> by running 'qconf -sel' on the master. The latest
> example I
>>> have is creating
>>> >>>>>>> a 25-node cluster, where only the first 12 nodes are
>>> successfully installed.
>>> >>>>>>> The remaining instances are running and I can ssh to them
>>> but they aren't
>>> >>>>>>> running sge_execd. There are only install log files
> for the
>>> first 12 nodes
>>> >>>>>>> in /opt/sge6/default/common/install_logs. I have not
> found
>>> any clues in the
>>> >>>>>>> starcluster debug log or the logs inside
> master:/opt/sge6/.
>>> >>>>>>>
>>> >>>>>>> I am running starcluster development snapshot 8ef48a3
>>> downloaded on
>>> >>>>>>> 2011-02-15, with the following relevant settings:
>>> >>>>>>>
>>> >>>>>>> NODE_IMAGE_ID=ami-8cf913e5
>>> >>>>>>> NODE_INSTANCE_TYPE = m1.small
>>> >>>>>>>
>>> >>>>>>> I have seen this behavior with the latest 32-bit and
> 64-bit
>>> starcluster
>>> >>>>>>> AMIs. Our workaround is to start a small cluster and
>>> progressively add nodes
>>> >>>>>>> one at a time, which is time-consuming.
>>> >>>>>>>
>>> >>>>>>> Has anyone else noticed this and have a better
> workaround or
>>> an idea for a
>>> >>>>>>> fix?
>>> >>>>>>>
>>> >>>>>>> jeff
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> _______________________________________________
>>> >>>>>>> StarCluster mailing list
>>> >>>>>>> StarCluster_at_mit.edu <mailto:StarCluster_at_mit.edu>
> <mailto:StarCluster_at_mit.edu <mailto:StarCluster_at_mit.edu>>
>>> >>>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>> _______________________________________________
>>> >>>>>> StarCluster mailing list
>>> >>>>>> StarCluster_at_mit.edu <mailto:StarCluster_at_mit.edu>
> <mailto:StarCluster_at_mit.edu <mailto:StarCluster_at_mit.edu>>
>>> >>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>> >>>
>>> >>>>
>
>> >>
>> >
>> > _______________________________________________
>> > StarCluster mailing list
>> > StarCluster_at_mit.edu <mailto:StarCluster_at_mit.edu>
> <mailto:StarCluster_at_mit.edu <mailto:StarCluster_at_mit.edu>>
>> > http://mailman.mit.edu/mailman/listinfo/starcluster
>> >
>> >
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2cuSoACgkQ4llAkMfDcrlnpACfSN/fLO60Y5epxTu/0XtpQgE/
AzUAn2ewv6D4ao2f1CwYbTuNNH7hO3KD
=L+u/
-----END PGP SIGNATURE-----
Received on Wed Apr 06 2011 - 15:04:09 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject