StarCluster - Mailing List Archive

Re: Integration of MPICH2 plugin with SGE

From: Hyokun Yun <no email>
Date: Mon, 19 Aug 2013 17:01:08 -0700

Sergio,



Thanks for the pointer! I will try to contact them as well.
However, my setting is pretty much vanilla starcluster except I installed
gcc-4.7 and boost packages,
so I consider it more of a starcluster issue than an OGE issue. I believe
my problem should be
reproducible by others who are using the same ami and mpich2 plugin.



Regarding your earlier message: I am using mpich2 1.4.1, and compiled the
software using it.

Actually when I ran qconf -mp orte, it was set as $round_robin instead of
$fill_up by default.
I just re-created the cluster to confirm this.

I am using starcluster 0.93.3, with ami ami-52a0c53b (Ubuntu 12.04), on
cc2.8xlarge machines.


Below is how my qsub file looks like:

!/bin/csh
#$ -cwd
#$ -pe orte 4
#$ -N ttt
#$ -e ../auto_output/ttt.err
#$ -o ../auto_output/ttt.out
mpirun executable_name > ../auto_logs/ttt.txt



Below is what I get from mpirun --version

HYDRA build details:
    Version: 1.4.1
    Release Date: Wed Aug 24 14:40:04 CDT 2011
    CC: gcc -D_FORTIFY_SOURCE=2
 -Wl,-Bsymbolic-functions -Wl,-z,relro
    CXX: c++ -D_FORTIFY_SOURCE=2
 -Wl,-Bsymbolic-functions -Wl,-z,relro
    F77: gfortran -Wl,-Bsymbolic-functions
-Wl,-z,relro
    F90: gfortran -Wl,-Bsymbolic-functions
-Wl,-z,relro
    Configure options: '--build=x86_64-linux-gnu'
'--includedir=${prefix}/include' '--mandir=${prefix}/share/man'
'--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var'
'--libexecdir=${prefix}/lib/mpich2' '--srcdir=.'
'--disable-maintainer-mode' '--disable-dependency-tracking'
'--disable-silent-rules' '--enable-shared' '--prefix=/usr' '--enable-fc'
'--disable-rpath' '--sysconfdir=/etc/mpich2'
'--includedir=/usr/include/mpich2' '--docdir=/usr/share/doc/mpich2'
'--with-hwloc-prefix=system' '--enable-checkpointing'
'--with-hydra-ckpointlib=blcr' 'build_alias=x86_64-linux-gnu'
'MPICH2LIB_CFLAGS=-g -O2 -fstack-protector --param=ssp-buffer-size=4
-Wformat -Wformat-security -g -O2 -fstack-protector
--param=ssp-buffer-size=4 -Wformat -Wformat-security
-Werror=format-security -Wall' 'MPICH2LIB_CXXFLAGS=-g -O2 -fstack-protector
--param=ssp-buffer-size=4 -Wformat -Wformat-security -g -O2
-fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security
-Werror=format-security -Wall' 'MPICH2LIB_FFLAGS=-g -O2'
'MPICH2LIB_FCFLAGS=' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro '
'CPPFLAGS=-D_FORTIFY_SOURCE=2 -I/build/buildd/mpich2-1.4.1/src/mpl/include
-I/build/buildd/mpich2-1.4.1/src/mpl/include
-I/build/buildd/mpich2-1.4.1/src/openpa/src
-I/build/buildd/mpich2-1.4.1/src/openpa/src
-I/build/buildd/mpich2-1.4.1/src/mpid/ch3/include
-I/build/buildd/mpich2-1.4.1/src/mpid/ch3/include
-I/build/buildd/mpich2-1.4.1/src/mpid/common/datatype
-I/build/buildd/mpich2-1.4.1/src/mpid/common/datatype
-I/build/buildd/mpich2-1.4.1/src/mpid/common/locks
-I/build/buildd/mpich2-1.4.1/src/mpid/common/locks
-I/build/buildd/mpich2-1.4.1/src/mpid/ch3/channels/nemesis/include
-I/build/buildd/mpich2-1.4.1/src/mpid/ch3/channels/nemesis/include
-I/build/buildd/mpich2-1.4.1/src/mpid/ch3/channels/nemesis/nemesis/include
-I/build/buildd/mpich2-1.4.1/src/mpid/ch3/channels/nemesis/nemesis/include
-I/build/buildd/mpich2-1.4.1/src/mpid/ch3/channels/nemesis/nemesis/utils/monitor
-I/build/buildd/mpich2-1.4.1/src/mpid/ch3/channels/nemesis/nemesis/utils/monitor
-I/build/buildd/mpich2-1.4.1/src/util/wrappers
-I/build/buildd/mpich2-1.4.1/src/util/wrappers' 'FFLAGS= -g -O2 -O2'
'FC=gfortran' 'CFLAGS= -g -O2 -fstack-protector --param=ssp-buffer-size=4
-Wformat -Wformat-security -g -O2 -fstack-protector
--param=ssp-buffer-size=4 -Wformat -Wformat-security
-Werror=format-security -Wall -O2' 'CXXFLAGS= -g -O2 -fstack-protector
--param=ssp-buffer-size=4 -Wformat -Wformat-security -g -O2
-fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security
-Werror=format-security -Wall -O2' '--disable-option-checking' 'CC=gcc'
'LIBS=-lrt -lcr -lpthread '
    Process Manager: pmi
    Launchers available: ssh rsh fork slurm ll lsf sge
manual persist
    Topology libraries available: hwloc plpa
    Resource management kernels available: user slurm ll lsf sge pbs
    Checkpointing libraries available: blcr
    Demux engines available: poll select



Thanks,
Hyokun Yun


On Mon, Aug 19, 2013 at 11:52 AM, Sergio Mafra <sergiohmafra_at_gmail.com>wrote:

> Hyokun,
>
> Other source that you can take advantage of is this forum dedicated to
> OGE: http://gridengine.org/blog/2011/01/27/gridengine-users-mailing-list/
>
> All best,
>
> Sergio
>
>
> On Mon, Aug 19, 2013 at 1:53 AM, Hyokun Yun <yun3_at_purdue.edu> wrote:
>
>> Dear starcluster users,
>>
>>
>> I am experiencing a problem using MPICH2 plugin with SGE.
>>
>> I am using the following image: ami-52a0c53b which uses Ubuntu 12.04
>>
>> When I use mpich2 plugin, it seems like mpich2 and SGE are not tightly
>> integrated: when I execute my script using qsub, I get the following error
>> message.
>>
>> error: executing task of job 1 failed: execution daemon on host "node001"
>> didn't accept task
>> error: executing task of job 1 failed: execution daemon on host "node002"
>> didn't accept task
>> error: executing task of job 1 failed: execution daemon on host "node003"
>> didn't accept task
>> error: executing task of job 1 failed: execution daemon on host
>> "nodef004" didn't accept task
>>
>> It runs fine when I simply execute 'mpirun' myself, instead of relying on
>> SGE.
>> Also, the same script runs fine as well when I use OpenMPI instead of
>> MPICH2. That's why I suspect it is MPICH2 & SGE integration issue.
>>
>> The problem is that I need multi-thread support, and it is by default
>> disabled in OpenMPI. I also prefer to use MPICH2 instead of OpenMPI.
>>
>> I was able to reproduce the problem when I restarted the cluster from
>> scratch. Would any of you please take a look on the problem by trying the
>> same image with MPICH2 plugin?
>>
>>
>> Thanks,
>> Hyokun Yun
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
>


-- 
*Hyokun Yun *( http://www.stat.purdue.edu/~yun3 )
Ph.D Candidate
Department of Statistics
Purdue University
Received on Mon Aug 19 2013 - 20:01:10 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject