StarCluster - Mailing List Archive

Fwd: Fwd: Integration of MPICH2 plugin with SGE

From: Sergio Mafra <no email>
Date: Mon, 19 Aug 2013 15:47:55 -0300

---------- Forwarded message ----------
From: Sergio Mafra <sergiohmafra_at_gmail.com>
Date: Mon, Aug 19, 2013 at 3:47 PM
Subject: Re: [StarCluster] Fwd: Integration of MPICH2 plugin with SGE
To: Hyokun Yun <yun3_at_purdue.edu>


Hi Hyokun,

Here we go:

1. This indicates, that you application tries to use a node in the cluster,
which wasn't granted to this job by OGE.
2. OGE works well but I guess that this is more for OpenMPI (the default
MPI of StarCluster)... Which version of MPICH2 are you using? Is it the
last one.. 1.4? Did you compile your app using this version?
3. MIT StarCluster has changed the default allocation strategy from
$round_robin to $fill_up on this last release.
4. The only thing that can be related to the AMI is the Mpich2 version.

All best,

Sergio


On Mon, Aug 19, 2013 at 3:10 PM, Hyokun Yun <yun3_at_purdue.edu> wrote:

> Sergio,
>
>
> Thanks for the advice!
>
> I have read the document, but why would daemons reject the task if it is
> configured $fill_up?
> Shouldn't OGE work for both choices? The document doesn't say I should
> not use $fill_up.
>
> I think I gave $round_robin a try, but I will try once again and let you
> know whether I had success.
>
> Also, is it possible that this is a problem specific to the AMI I am using?
>
>
> Best,
> Hyokun Yun
>
>
>
> On Mon, Aug 19, 2013 at 10:38 AM, Sergio Mafra <sergiohmafra_at_gmail.com>wrote:
>
>> Hi Hyokun,
>>
>> I´m a user of MPICH2 and OGE.
>>
>> It seems that you´re using $fill_up instead of $round_robin. If so, try
>> to change it to $round_robin with $ qconf -mp orte
>> You can learn more here:
>> http://star.mit.edu/cluster/docs/latest/plugins/sge.html#using-the-plugin
>>
>> Let me know if this help you.
>>
>> All best.
>>
>> Sergio
>>
>>
>> On Mon, Aug 19, 2013 at 1:53 AM, Hyokun Yun <yun3_at_purdue.edu> wrote:
>>
>>> Dear starcluster users,
>>>
>>>
>>> I am experiencing a problem using MPICH2 plugin with SGE.
>>>
>>> I am using the following image: ami-52a0c53b which uses Ubuntu 12.04
>>>
>>> When I use mpich2 plugin, it seems like mpich2 and SGE are not tightly
>>> integrated: when I execute my script using qsub, I get the following error
>>> message.
>>>
>>> error: executing task of job 1 failed: execution daemon on host
>>> "node001" didn't accept task
>>> error: executing task of job 1 failed: execution daemon on host
>>> "node002" didn't accept task
>>> error: executing task of job 1 failed: execution daemon on host
>>> "node003" didn't accept task
>>> error: executing task of job 1 failed: execution daemon on host
>>> "nodef004" didn't accept task
>>>
>>> It runs fine when I simply execute 'mpirun' myself, instead of relying
>>> on SGE.
>>> Also, the same script runs fine as well when I use OpenMPI instead of
>>> MPICH2. That's why I suspect it is MPICH2 & SGE integration issue.
>>>
>>> The problem is that I need multi-thread support, and it is by default
>>> disabled in OpenMPI. I also prefer to use MPICH2 instead of OpenMPI.
>>>
>>> I was able to reproduce the problem when I restarted the cluster from
>>> scratch. Would any of you please take a look on the problem by trying the
>>> same image with MPICH2 plugin?
>>>
>>>
>>> Thanks,
>>> Hyokun Yun
>>>
>>> _______________________________________________
>>> StarCluster mailing list
>>> StarCluster_at_mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>
>>>
>>
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
>
>
> --
> *Hyokun Yun *( http://www.stat.purdue.edu/~yun3 )
> Ph.D Candidate
> Department of Statistics
> Purdue University
>
>
Received on Mon Aug 19 2013 - 14:48:00 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject