StarCluster - Mailing List Archive

Re: compiling MPI applications on starcluster

From: Gonçalo Albuquerque <no email>
Date: Mon, 28 Apr 2014 14:06:50 +0200

Hi,

When using AMI ami-6b211202 in us-east I stumbled across the same issue
you're experiencing.

The symbolic links in the alternatives system are mixing MPICH and OpenMPI:

root_at_master:/etc/alternatives# update-alternatives --display mpi
mpi - auto mode
  link currently points to /usr/include/mpich2
/usr/include/mpich2 - priority 40
  slave libmpi++.so: /usr/lib/libmpichcxx.so
  slave libmpi.so: /usr/lib/libmpich.so
  slave libmpif77.so: /usr/lib/libfmpich.so
  slave libmpif90.so: /usr/lib/libmpichf90.so
  slave mpic++: /usr/bin/mpic++.mpich2
  slave mpic++.1.gz: /usr/share/man/man1/mpic++.mpich2.1.gz
  slave mpicc: /usr/bin/mpicc.mpich2
  slave mpicc.1.gz: /usr/share/man/man1/mpicc.mpich2.1.gz
  slave mpicxx: /usr/bin/mpicxx.mpich2
  slave mpicxx.1.gz: /usr/share/man/man1/mpicxx.mpich2.1.gz
  slave mpif77: /usr/bin/mpif77.mpich2
  slave mpif77.1.gz: /usr/share/man/man1/mpif77.mpich2.1.gz
  slave mpif90: /usr/bin/mpif90.mpich2
  slave mpif90.1.gz: /usr/share/man/man1/mpif90.mpich2.1.gz
/usr/lib/openmpi/include - priority 40
  slave libmpi++.so: /usr/lib/openmpi/lib/libmpi_cxx.so
  slave libmpi.so: /usr/lib/openmpi/lib/libmpi.so
  slave libmpif77.so: /usr/lib/openmpi/lib/libmpi_f77.so
  slave libmpif90.so: /usr/lib/openmpi/lib/libmpi_f90.so
  slave mpiCC: /usr/bin/mpic++.openmpi
  slave mpiCC.1.gz: /usr/share/man/man1/mpiCC.openmpi.1.gz
  slave mpic++: /usr/bin/mpic++.openmpi
  slave mpic++.1.gz: /usr/share/man/man1/mpic++.openmpi.1.gz
  slave mpicc: /usr/bin/mpicc.openmpi
  slave mpicc.1.gz: /usr/share/man/man1/mpicc.openmpi.1.gz
  slave mpicxx: /usr/bin/mpic++.openmpi
  slave mpicxx.1.gz: /usr/share/man/man1/mpicxx.openmpi.1.gz
  slave mpif77: /usr/bin/mpif77.openmpi
  slave mpif77.1.gz: /usr/share/man/man1/mpif77.openmpi.1.gz
  slave mpif90: /usr/bin/mpif90.openmpi
  slave mpif90.1.gz: /usr/share/man/man1/mpif90.openmpi.1.gz
Current 'best' version is '/usr/include/mpich2'.
root_at_master:/etc/alternatives# update-alternatives --display mpirun
mpirun - auto mode
  link currently points to /usr/bin/mpirun.openmpi
/usr/bin/mpirun.mpich2 - priority 40
  slave mpiexec: /usr/bin/mpiexec.mpich2
  slave mpiexec.1.gz: /usr/share/man/man1/mpiexec.mpich2.1.gz
  slave mpirun.1.gz: /usr/share/man/man1/mpirun.mpich2.1.gz
/usr/bin/mpirun.openmpi - priority 50
  slave mpiexec: /usr/bin/mpiexec.openmpi
  slave mpiexec.1.gz: /usr/share/man/man1/mpiexec.openmpi.1.gz
  slave mpirun.1.gz: /usr/share/man/man1/mpirun.openmpi.1.gz
Current 'best' version is '/usr/bin/mpirun.openmpi'.

You do compile it with MPICH and try to run with OpenMPI. The solution is
to change the symbolic links by using the update-alternatives command. For
the runtime link (mpirun), it must be done in all the nodes of the cluster.

No doubt this will be corrected in upcoming versions of the AMIs.

Regards,

Gonçalo


On Mon, Apr 28, 2014 at 1:09 PM, Torstein Fjermestad
<tfjermestad_at_gmail.com>wrote:

> Dear Justin,
>
> during the compilation, the cluster only consisted of the master node
> which is of instance type c3.large. In order to run a test parallel
> calculation, I added a node of instance type c3.4xlarge (16 processors).
>
> The cluster is created form the following AMI:
> [0] ami-044abf73 eu-west-1 starcluster-base-ubuntu-13.04-x86_64 (EBS)
>
> Executing the application outside the queuing system like
>
> mpirun -np 2 -hostfile hosts ./pw.x -in inputfile.inp
>
> did not change anything.
>
> The output of the command "mpirun --version" is the following:
>
> mpirun (Open MPI) 1.4.5
>
> Report bugs to http://www.open-mpi.org/community/help/
>
> After investigating the matter a little bit, I found that mpif90 is likely
> compiled with an MPI version different from mpirun.
> The first line of the output of the command "mpif90 -v" is the following:
>
> mpif90 for MPICH2 version 1.4.1
>
> Furthermore, the output of the command "ldd pw.x" indicates that pw.x is
> compiled with mpich2 and not with Open MPI. The output is the following:
>
> linux-vdso.so.1 => (0x00007fffd35fe000)
> liblapack.so.3 => /usr/lib/liblapack.so.3 (0x00007ff38fb18000)
> libopenblas.so.0 => /usr/lib/libopenblas.so.0 (0x00007ff38e2f5000)
> *libmpich.so.3 *=> /usr/lib/libmpich.so.3 (0x00007ff38df16000)
> libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0
> (0x00007ff38dcf9000)
> libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3
> (0x00007ff38d9e5000)
> libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ff38d6df000)
> libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1
> (0x00007ff38d4c9000)
> libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff38d100000)
> librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007ff38cef7000)
> libcr.so.0 => /usr/lib/libcr.so.0 (0x00007ff38cced000)
> libmpl.so.1 => /usr/lib/libmpl.so.1 (0x00007ff38cae8000)
> /lib64/ld-linux-x86-64.so.2 (0x00007ff390820000)
> libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0
> (0x00007ff38c8b2000)
> libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ff38c6ae000)
>
> The feedback I got from the Quantum Espresso mailing list suggested that
> the cause of the error could be that pw.x (the executable) was not compiled
> with the same version of mpi as mpirun.
> The output of the commands "mpirun --version", "mpif90 -v" and "ldd pw.x"
> above have lead me to suspect that this is indeed the case.
>
> I therefore wonder whether it is possible to control which mpi version I
> compile my applications with.
>
> If, with the current mpi installation, the applications are compiled with
> a different mpi version than mpirun, then I will likely have similar
> problems when compiling other applications as well. I would therefore very
> much appreciate if you could give me some hints on how I can solve this
> problem.
>
> Thanks in advance.
>
> Regards,
> Torstein
>
>
>
>
>
>
> On Thu, Apr 24, 2014 at 5:13 PM, Justin Riley <jtriley_at_mit.edu> wrote:
>
>> Hi Torstein,
>>
>> Can you please describe your cluster configuration (ie size, image id(s),
>> instance type(s))? Also, you're currently using the SGE/OpenMPI
>> integration. Have you tried just using mpirun only as described in the
>> first part of:
>>
>>
>> http://star.mit.edu/cluster/docs/latest/guides/sge.html#submitting-openmpi-jobs-using-a-parallel-environment
>>
>> Also, what does 'mpirun --version' show?
>>
>> ~Justin
>>
>> On Thu, Apr 17, 2014 at 07:19:28PM +0200, Torstein Fjermestad wrote:
>> > Dear all,
>> >
>> > I recently tried to compile an application (Quantum Espresso,
>> > [1]http://www.quantum-espresso.org/) to be used for parallel
>> computations
>> > on StarCluster. The installation procedure of the application
>> consists of
>> > the standard "./configure + make" steps. At the end of the output
>> from
>> > ./configure, the statement "Parallel environment detected
>> successfully.\
>> > Configured for compilation of parallel executables." appears.
>> >
>> > The compilation with "make" completes without errors. I then run the
>> > application in the following way:
>> >
>> > I first write a submit script (submit.sh) with the following content:
>> >
>> > cp /path/to/executable/pw.x .
>> > mpirun ./pw.x -in input.inp
>> > I then submit the job to the queueing system with the following
>> command
>> >
>> > qsub -cwd -pe orte 16 ./submit.sh
>> >
>> > However, in the output of the calculation, the following line is
>> repeated
>> > 16 times:
>> >
>> > Parallel version (MPI), running on 1 processors
>> >
>> > It therefore seems like the program runs 16 1 processor calculations
>> that
>> > all write to the same output.
>> >
>> > I wrote about this problem to the mailing list of Quantum Espresso,
>> and I
>> > got the suggestion that perhaps the mpirun belonged to a different
>> MPI
>> > library than pw.x (a particular package of Quantum Espresso) was
>> compiled
>> > with.
>> >
>> > I compiled pw.x on the same cluster as I executed mpirun. Are there
>> > several versions of openMPI on the AMIs provided by StarCluster? In
>> that
>> > case, how can I choose the correct one.
>> >
>> > Perhaps the problem has a different cause. Does anyone have
>> suggestions on
>> > how to solve it?
>> >
>> > Thanks in advance for your help.
>> >
>> > Yours sincerely,
>> > Torstein Fjermestad
>> >
>> > References
>> >
>> > Visible links
>> > 1. http://www.quantum-espresso.org/
>>
>> > _______________________________________________
>> > StarCluster mailing list
>> > StarCluster_at_mit.edu
>> > http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
Received on Mon Apr 28 2014 - 08:07:12 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject