StarCluster - Mailing List Archive

Re: starcluster starts but not all nodes added as exec nodes

From: Joseph <Kyeong>
Date: Wed, 16 Mar 2011 17:37:04 +0000

Justin,
Please, find attached the said file.

Regards,
Joseph


On Wed, Mar 16, 2011 at 4:38 PM, Justin Riley <jtriley_at_mit.edu> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Joseph,
>
> Great thanks, can you also send me the /opt/sge6/ec2_sge.conf file please?
>
> ~Justin
>
> On 03/16/2011 12:29 PM, Kyeong Soo (Joseph) Kim wrote:
>> Hi Justin,
>>
>> Please, find attached the gzipped tar file of the logfiles under
>> install_logs directory.
>>
>> Note that the configuration is for 25-node (1 master and 24 slaves) cluster.
>>
>> Below is the time-sorted listing of log files under the same directory:
>>
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 13:23
>> execd_install_node024_2011-03-16_13:23:11.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node023_2011-03-16_11:13:37.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node022_2011-03-16_11:13:36.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node021_2011-03-16_11:13:36.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node020_2011-03-16_11:13:32.log
>> -rw-r--r-- 1 kks kks  18K 2011-03-16 11:13
>> execd_install_master_2011-03-16_11:13:10.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node017_2011-03-16_11:13:27.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node018_2011-03-16_11:13:27.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node019_2011-03-16_11:13:28.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node016_2011-03-16_11:13:26.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node014_2011-03-16_11:13:25.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node015_2011-03-16_11:13:26.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node012_2011-03-16_11:13:24.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node013_2011-03-16_11:13:25.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node010_2011-03-16_11:13:23.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node011_2011-03-16_11:13:24.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node008_2011-03-16_11:13:22.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node009_2011-03-16_11:13:22.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node006_2011-03-16_11:13:21.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node007_2011-03-16_11:13:21.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node004_2011-03-16_11:13:20.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node005_2011-03-16_11:13:20.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node003_2011-03-16_11:13:19.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node001_2011-03-16_11:13:18.log
>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>> execd_install_node002_2011-03-16_11:13:19.log
>> -rw-r--r-- 1 kks kks 3.1K 2011-03-16 11:13
>> execd_install_master_2011-03-16_11:13:17.log
>> -rw-r--r-- 1 kks kks 8.4K 2011-03-16 11:13
>> qmaster_install_master_2011-03-16_11:13:05.log
>>
>> As you can see, the installation of master has been duplicated and it
>> ended up with master, node001~node023; the top-most log for node024
>> was for the manual addition through "addnode" command later (i.e., 1
>> hour 10 mins after).
>>
>> Even with this slimmed down version of configurations (compared to the
>> original 125-node one), the chances that all nodes are properly
>> installed (i.e., 25 out of 25) were about 50% (last night and this
>> morning, I tried it about 10 times to set total five of 25-node
>> clusters).
>>
>> Regards,
>> Joseph
>>
>>
>> On Wed, Mar 16, 2011 at 3:57 PM, Justin Riley <jtriley_at_mit.edu> wrote:
>> Hi Jeff/Joseph,
>>
>> I just requested to up my EC2 instance limit so that I can test things
>> out at this scale and see what the issue is. In the mean time would you
>> mind sending me any logs found in /opt/sge6/default/common/install_logs
>> and also the /opt/sge6/ec2_sge.conf for a failed run?
>>
>> Also if this happens again you could try reinstalling SGE manually
>> assuming all the nodes are up:
>>
>> $ starcluster sshmaster mycluster
>> $ cd /opt/sge6
>> $ ./inst_sge -m -x -auto ./ec2_sge.conf
>>
>> ~Justin
>>
>> On 03/15/2011 06:30 PM, Kyeong Soo (Joseph) Kim wrote:
>>>>> Hi Jeff,
>>>>>
>>>>> I experienced the same thing with my 50-node configuration (c1.xlarge).
>>>>> Out of 50 nodes, only 29 nodes are successfully identified by the SGE.
>>>>>
>>>>> Regards,
>>>>> Joseph
>>>>>
>>>>> On Sat, Mar 5, 2011 at 10:15 PM, Jeff White <jeff_at_decide.com> wrote:
>>>>>> I can frequently reproduce an issue where 'starcluster start' completes
>>>>>> without error, but not all nodes are added to the SGE pool, which I verify
>>>>>> by running 'qconf -sel' on the master. The latest example I have is creating
>>>>>> a 25-node cluster, where only the first 12 nodes are successfully installed.
>>>>>> The remaining instances are running and I can ssh to them but they aren't
>>>>>> running sge_execd. There are only install log files for the first 12 nodes
>>>>>> in /opt/sge6/default/common/install_logs. I have not found any clues in the
>>>>>> starcluster debug log or the logs inside master:/opt/sge6/.
>>>>>>
>>>>>> I am running starcluster development snapshot 8ef48a3 downloaded on
>>>>>> 2011-02-15, with the following relevant settings:
>>>>>>
>>>>>> NODE_IMAGE_ID=ami-8cf913e5
>>>>>> NODE_INSTANCE_TYPE = m1.small
>>>>>>
>>>>>> I have seen this behavior with the latest 32-bit and 64-bit starcluster
>>>>>> AMIs. Our workaround is to start a small cluster and progressively add nodes
>>>>>> one at a time, which is time-consuming.
>>>>>>
>>>>>> Has anyone else noticed this and have a better workaround or an idea for a
>>>>>> fix?
>>>>>>
>>>>>> jeff
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> StarCluster mailing list
>>>>>> StarCluster_at_mit.edu
>>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> StarCluster mailing list
>>>>> StarCluster_at_mit.edu
>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.17 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk2A550ACgkQ4llAkMfDcrlR2gCeOoYMzl9U+z1owIq98JHBgLHi
> IngAniUwV6nq/hN6/TfxCBu1d2/MO5Ru
> =tXep
> -----END PGP SIGNATURE-----
>



Received on Wed Mar 16 2011 - 13:37:05 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject