StarCluster - Mailing List Archive

Re: starcluster starts but not all nodes added as exec nodes

From: Jeff White <no email>
Date: Wed, 6 Apr 2011 11:23:25 -0700

Justin,

I don't use the installer. I just untar the package and set my PATH and
PYTHONPATH appropriately. I didn't notice the updated boto requirement
(README.rst needs an update).
After updating boto, it works fantastically. I tested 25 and then 50 nodes.
Only 9 minutes to create a 50-node cluster of small instances. And I
verified they are all registered as exec nodes.
So far so good. We'll be giving it more rigorous testing over the next few
days.

Thanks again for your effort in fixing the issue and making it so much
faster.

jeff

On Wed, Apr 6, 2011 at 10:49 AM, Justin Riley <jtriley_at_mit.edu> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi Jeff,
>
> Did you reinstall after pulling the latest code using "python setup.py
> install"? If so, what version of boto do you have installed? You can
> check this with:
>
> % python -c 'import boto; print boto.Version'
> 2.0b4
>
> The version should be 2.0b4 as above.
>
> ~Justin
>
> On 04/06/2011 01:04 PM, Jeff White wrote:
> > Hi Justin,
> >
> > Thanks much for your effort on this. I got this error upon running
> > 'starcluster -s 25 start jswtest'. I have not altered my config file
> > from the one I sent you previously.
> >
> > PID: 5530 config.py:515 - DEBUG - Loading config
> > PID: 5530 config.py:108 - DEBUG - Loading file:
> > /home/jsw/.starcluster/config
> > PID: 5530 config.py:515 - DEBUG - Loading config
> > PID: 5530 config.py:108 - DEBUG - Loading file:
> > /home/jsw/.starcluster/config
> > PID: 5530 awsutils.py:54 - DEBUG - creating self._conn w/
> > connection_authenticator kwargs = {'path': '/', 'region': None, 'port':
> > None, 'is_secure': True}
> > PID: 5530 start.py:167 - INFO - Using default cluster template:
> smallcluster
> > PID: 5530 cluster.py:1310 - INFO - Validating cluster template
> settings...
> > PID: 5530 cli.py:184 - DEBUG - Traceback (most recent call last):
> > File "/home/jsw/jtriley-StarCluster-dfba6ef/starcluster/cli.py", line
> > 160, in main
> > sc.execute(args)
> > File
> > "/home/jsw/jtriley-StarCluster-dfba6ef/starcluster/commands/start.py",
> > line 175, in execute
> > scluster._validate(validate_running=validate_running)
> > File "/home/jsw/jtriley-StarCluster-dfba6ef/starcluster/cluster.py",
> > line 1322, in _validate
> > self._validate_instance_types()
> > File "/home/jsw/jtriley-StarCluster-dfba6ef/starcluster/cluster.py",
> > line 1458, in _validate_instance_types
> > self.__check_platform(node_image_id, node_instance_type)
> > File "/home/jsw/jtriley-StarCluster-dfba6ef/starcluster/cluster.py",
> > line 1419, in __check_platform
> > image_is_hvm = (image.virtualization_type == "hvm")
> > AttributeError: 'Image' object has no attribute 'virtualization_type'
> >
> > PID: 5530 cli.py:129 - ERROR - Oops! Looks like you've found a bug in
> > StarCluster
> > PID: 5530 cli.py:130 - ERROR - Debug file written to:
> > /tmp/starcluster-debug-jsw.log
> > PID: 5530 cli.py:131 - ERROR - Look for lines starting with PID: 5530
> > PID: 5530 cli.py:132 - ERROR - Please submit this file, minus any
> > private information,
> > PID: 5530 cli.py:133 - ERROR - to starcluster_at_mit.edu
> > <mailto:starcluster_at_mit.edu>
> >
> >
> >
> > On Wed, Apr 6, 2011 at 8:09 AM, Justin Riley <justin.t.riley_at_gmail.com
> > <mailto:justin.t.riley_at_gmail.com>> wrote:
> >
> > Jeff/Joseph,
> >
> > Sorry for taking so long to follow up with this but I believe I've
> > fixed this issue for good and you should now be able to launch 50+
> > node clusters without issue. My original feeling was that the SGE
> > install script was at fault, however, after several hours of digging
> I
> > discovered that ssh-keyscan was failing when there were a large
> number
> > of nodes. Long story short this meant that passwordless-ssh wasn't
> > being setup fully for all nodes and so the SGE installer script could
> > not connect to those nodes to add them to the queue. I found a much
> > better way to populate the known_hosts file with all the nodes using
> > paramiko instead of ssh-keyscan which is much faster in this case.
> >
> > If you haven't already please re-run 'python setup.py install' after
> > pulling the latest code to test out the latest changes. I've also
> > updated StarCluster perform the setup on all nodes concurrently using
> > a thread pool so you should notice it's much faster for larger
> > clusters. Please let me know if you have issues.
> >
> > Thanks,
> >
> > ~Justin
> >
> > On Wed, Mar 16, 2011 at 1:37 PM, Kyeong Soo (Joseph) Kim
> > <kyeongsoo.kim_at_gmail.com <mailto:kyeongsoo.kim_at_gmail.com>> wrote:
> > > Justin,
> > > Please, find attached the said file.
> > >
> > > Regards,
> > > Joseph
> > >
> > >
> > > On Wed, Mar 16, 2011 at 4:38 PM, Justin Riley <jtriley_at_mit.edu
> > <mailto:jtriley_at_mit.edu>> wrote:
> > Joseph,
> >
> > Great thanks, can you also send me the /opt/sge6/ec2_sge.conf
> >> file please?
> >
> > ~Justin
> >
> > On 03/16/2011 12:29 PM, Kyeong Soo (Joseph) Kim wrote:
> >> >>> Hi Justin,
> >> >>>
> >> >>> Please, find attached the gzipped tar file of the logfiles under
> >> >>> install_logs directory.
> >> >>>
> >> >>> Note that the configuration is for 25-node (1 master and 24
> >> slaves) cluster.
> >> >>>
> >> >>> Below is the time-sorted listing of log files under the same
> >> directory:
> >> >>>
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 13:23
> >> >>> execd_install_node024_2011-03-16_13:23:11.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node023_2011-03-16_11:13:37.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node022_2011-03-16_11:13:36.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node021_2011-03-16_11:13:36.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node020_2011-03-16_11:13:32.log
> >> >>> -rw-r--r-- 1 kks kks 18K 2011-03-16 11:13
> >> >>> execd_install_master_2011-03-16_11:13:10.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node017_2011-03-16_11:13:27.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node018_2011-03-16_11:13:27.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node019_2011-03-16_11:13:28.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node016_2011-03-16_11:13:26.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node014_2011-03-16_11:13:25.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node015_2011-03-16_11:13:26.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node012_2011-03-16_11:13:24.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node013_2011-03-16_11:13:25.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node010_2011-03-16_11:13:23.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node011_2011-03-16_11:13:24.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node008_2011-03-16_11:13:22.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node009_2011-03-16_11:13:22.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node006_2011-03-16_11:13:21.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node007_2011-03-16_11:13:21.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node004_2011-03-16_11:13:20.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node005_2011-03-16_11:13:20.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node003_2011-03-16_11:13:19.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node001_2011-03-16_11:13:18.log
> >> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >> >>> execd_install_node002_2011-03-16_11:13:19.log
> >> >>> -rw-r--r-- 1 kks kks 3.1K 2011-03-16 11:13
> >> >>> execd_install_master_2011-03-16_11:13:17.log
> >> >>> -rw-r--r-- 1 kks kks 8.4K 2011-03-16 11:13
> >> >>> qmaster_install_master_2011-03-16_11:13:05.log
> >> >>>
> >> >>> As you can see, the installation of master has been duplicated
> >> and it
> >> >>> ended up with master, node001~node023; the top-most log for
> node024
> >> >>> was for the manual addition through "addnode" command later
> (i.e., 1
> >> >>> hour 10 mins after).
> >> >>>
> >> >>> Even with this slimmed down version of configurations (compared
> >> to the
> >> >>> original 125-node one), the chances that all nodes are properly
> >> >>> installed (i.e., 25 out of 25) were about 50% (last night and
> this
> >> >>> morning, I tried it about 10 times to set total five of 25-node
> >> >>> clusters).
> >> >>>
> >> >>> Regards,
> >> >>> Joseph
> >> >>>
> >> >>>
> >> >>> On Wed, Mar 16, 2011 at 3:57 PM, Justin Riley <jtriley_at_mit.edu
> >> <mailto:jtriley_at_mit.edu>> wrote:
> >> >>> Hi Jeff/Joseph,
> >> >>>
> >> >>> I just requested to up my EC2 instance limit so that I can test
> >> things
> >> >>> out at this scale and see what the issue is. In the mean time
> >> would you
> >> >>> mind sending me any logs found in
> >> /opt/sge6/default/common/install_logs
> >> >>> and also the /opt/sge6/ec2_sge.conf for a failed run?
> >> >>>
> >> >>> Also if this happens again you could try reinstalling SGE
> manually
> >> >>> assuming all the nodes are up:
> >> >>>
> >> >>> $ starcluster sshmaster mycluster
> >> >>> $ cd /opt/sge6
> >> >>> $ ./inst_sge -m -x -auto ./ec2_sge.conf
> >> >>>
> >> >>> ~Justin
> >> >>>
> >> >>> On 03/15/2011 06:30 PM, Kyeong Soo (Joseph) Kim wrote:
> >> >>>>>> Hi Jeff,
> >> >>>>>>
> >> >>>>>> I experienced the same thing with my 50-node configuration
> >> (c1.xlarge).
> >> >>>>>> Out of 50 nodes, only 29 nodes are successfully identified by
> >> the SGE.
> >> >>>>>>
> >> >>>>>> Regards,
> >> >>>>>> Joseph
> >> >>>>>>
> >> >>>>>> On Sat, Mar 5, 2011 at 10:15 PM, Jeff White <jeff_at_decide.com
> >> <mailto:jeff_at_decide.com>> wrote:
> >> >>>>>>> I can frequently reproduce an issue where 'starcluster
> >> start' completes
> >> >>>>>>> without error, but not all nodes are added to the SGE pool,
> >> which I verify
> >> >>>>>>> by running 'qconf -sel' on the master. The latest example I
> >> have is creating
> >> >>>>>>> a 25-node cluster, where only the first 12 nodes are
> >> successfully installed.
> >> >>>>>>> The remaining instances are running and I can ssh to them
> >> but they aren't
> >> >>>>>>> running sge_execd. There are only install log files for the
> >> first 12 nodes
> >> >>>>>>> in /opt/sge6/default/common/install_logs. I have not found
> >> any clues in the
> >> >>>>>>> starcluster debug log or the logs inside master:/opt/sge6/.
> >> >>>>>>>
> >> >>>>>>> I am running starcluster development snapshot 8ef48a3
> >> downloaded on
> >> >>>>>>> 2011-02-15, with the following relevant settings:
> >> >>>>>>>
> >> >>>>>>> NODE_IMAGE_ID=ami-8cf913e5
> >> >>>>>>> NODE_INSTANCE_TYPE = m1.small
> >> >>>>>>>
> >> >>>>>>> I have seen this behavior with the latest 32-bit and 64-bit
> >> starcluster
> >> >>>>>>> AMIs. Our workaround is to start a small cluster and
> >> progressively add nodes
> >> >>>>>>> one at a time, which is time-consuming.
> >> >>>>>>>
> >> >>>>>>> Has anyone else noticed this and have a better workaround or
> >> an idea for a
> >> >>>>>>> fix?
> >> >>>>>>>
> >> >>>>>>> jeff
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> _______________________________________________
> >> >>>>>>> StarCluster mailing list
> >> >>>>>>> StarCluster_at_mit.edu <mailto:StarCluster_at_mit.edu>
> >> >>>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>> _______________________________________________
> >> >>>>>> StarCluster mailing list
> >> >>>>>> StarCluster_at_mit.edu <mailto:StarCluster_at_mit.edu>
> >> >>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
> >> >>>
> >> >>>>
> >
> > >>
> > >
> > > _______________________________________________
> > > StarCluster mailing list
> > > StarCluster_at_mit.edu <mailto:StarCluster_at_mit.edu>
> > > http://mailman.mit.edu/mailman/listinfo/starcluster
> > >
> > >
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.17 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk2cp6sACgkQ4llAkMfDcrkEPQCeKo3XQ8ilWlv89E76NTReqBaz
> k68AoIur9985wTYnBKP4+cnKkKwMyL9i
> =QBz+
> -----END PGP SIGNATURE-----
>
Received on Wed Apr 06 2011 - 14:23:28 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject