StarCluster - Mailing List Archive

Re: grid engine not initialize on gpu hvm image

From: Jesse Lu <no email>
Date: Mon, 27 Aug 2012 14:59:11 -0700

Okay, figured out that using ami-999d49f0 for non-HVM master
and ami-4583572c for HVM nodes makes SGE work well. It's my fault for not
looking at the available public starcluster images carefully enough.



On Mon, Aug 27, 2012 at 2:26 PM, Jesse Lu <jesselu_at_stanford.edu> wrote:

> Sorry for the spam, but here's another follow-up.
>
> I found that this only happens when I use a non HVM-EBS AMI for the
> master, but an HWM-EBS for the master.
>
> This is probably because StarCluster copies the sge install from the
> master to the nodes, and this doesn't play nice when the nodes are CentOS
> based but the master is Ubuntu based.
>
> Any ideas for a work-around?
>
>
> On Mon, Aug 27, 2012 at 2:07 PM, Jesse Lu <jesselu_at_stanford.edu> wrote:
>
>> Follow-up,
>>
>> Here are the contents of the installation log file (for grid engine)
>>
>> cat
>> /opt/sge6/default/common/install_logs/execd_install_node001_2012-08-27_14:04:29.log
>>
>>
>> Your $SGE_ROOT directory: /opt/sge6
>>
>>
>> Using cell: >default<
>>
>>
>>
>>
>>
>> Using local execd spool directory
>> [/opt/sge6/default/spool/exec_spool_local]
>>
>> Creating local configuration for host >node001<
>> sgeadmin_at_node001 modified "node001" in configuration list
>> Local configuration for host >node001< created.
>>
>> Host >master< already in submit host list!
>> Host >node001< already in submit host list!
>>
>>
>> starting sge_execd
>>
>>
>> No modification because "node001" already exists in "hostlist" of
>> "hostgroup"
>> root_at_node001 modified "_at_allhosts" in host group list
>> root_at_node001 modified "all.q" in cluster queue list
>>
>> got select error: Connection refused
>> got select error: closing "node001/execd/1"
>> Execd on host node001 is not started!
>>
>>
>> On Mon, Aug 27, 2012 at 1:37 PM, Jesse Lu <jesselu_at_stanford.edu> wrote:
>>
>>> ami-12b6477b produces the folowing error on cluster startup
>>>
>>> !!! ERROR - command 'cd /opt/sge6 && TERM=rxvt ./inst_sge -x -noremote
>>> -auto ./ec2_sge.conf' failed with status 1
>>>
>>> I'm guessing the sge6 installation is faulty? Can anyone help? Thanks!
>>>
>>> Jesse
>>>
>>
>>
>
Received on Mon Aug 27 2012 - 17:59:13 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject