Re: Error when running mpich2 plugin using Starcluster

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

From: Subbarao Kota <no email>
Date: Mon, 23 Apr 2012 20:12:24 -0400

Hi Justin, Thanks so much.

I was able to use one of the 32-bit public AMIs. MPICH2 plugin appeared to have worked. And also the passwordless login into the nodes. I was also able to execute the benchmarks in the head node. But I couldn't confirm if the benchmark was executed in the cluster or just the head-node. Attached is an example of the log when attempted for STREAM benchmark. It was run as root from master node. I passed only master, node001 in the arguments for mpirun command but number of nodes (processes) -np as 4 just to check.

The result indicates the benchmark ran on 4 nodes. But I only have master node and 1 worker node in the cluster. How can I confirm if the benchmark really ran on all the nodes?

Please advise,
Subbarao Kota.

SS-MBP:~ sinsub$ vim /.starcluster/config
SS-MBP:~ sinsub$ starcluster start t1-micro-AMI-cluster
StarCluster - (http://web.mit.edu/starcluster) (v. 0.93.1)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster_at_mit.edu

>>> Using default cluster template: t1-micro-trial-cluster
>>> Validating cluster template settings...
>>> Cluster template settings are valid
>>> Starting cluster...
>>> Launching a 2-node cluster...
>>> Creating security group _at_sc-t1-micro-AMI-cluster...Reservation:r-d7f0c3b7
>>> Waiting for cluster to come up... (updating every 30s)
>>> Waiting for all nodes to be in a 'running' state...
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Waiting for SSH to come up on all nodes...
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Waiting for cluster to come up took 1.167 mins
>>> The master node is ec2-107-20-59-39.compute-1.amazonaws.com
>>> Setting up the cluster...
>>> Configuring hostnames...
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Creating cluster user: ec2-user (uid: 1001, gid: 1001)
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Configuring scratch space for user: ec2-user
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Configuring /etc/hosts on each node
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Configuring NFS...
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Setting up NFS took 0.610 mins
>>> Configuring passwordless ssh for root
>>> Configuring passwordless ssh for ec2-user
>>> Installing Sun Grid Engine...
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Creating SGE parallel environment 'orte'
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Adding parallel environment 'orte' to queue 'all.q'
>>> Shutting down threads...
20/20 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Running plugin mpich2
>>> Creating MPICH2 hosts file
>>> Configuring MPICH2 profile
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Setting MPICH2 as default MPI on all nodes
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> MPICH2 is now ready to use
>>> Use mpicc, mpif90, mpirun, etc. to compile and run your MPI apps
>>> Configuring cluster took 1.256 mins
>>> Starting cluster took 2.448 mins

The cluster is now ready to use. To login to the master node
as root, run:

    $ starcluster sshmaster t1-micro-AMI-cluster

When you are finished using the cluster and wish to
terminate it and stop paying for service:

    $ starcluster terminate t1-micro-AMI-cluster

NOTE: Terminating an EBS cluster will destroy all EBS
volumes backing the nodes.

Alternatively, if the cluster uses EBS instances, you can
use the 'stop' command to put all nodes into a 'stopped'
state:

    $ starcluster stop t1-micro-AMI-cluster

NOTE: Any data stored in ephemeral storage (usually /mnt)
will be lost!

This will shutdown all nodes in the cluster and put them in
a 'stopped' state that preserves the EBS volumes backing the
nodes. A 'stopped' cluster may then be restarted at a later
time, without losing data on the local disks, by passing the
-x option to the 'start' command:

    $ starcluster start -x t1-micro-AMI-cluster

This will start all 'stopped' EBS instances and reconfigure
the cluster.

SS-MBP:~ sinsub$
SS-MBP:~ sinsub$
SS-MBP:~ sinsub$ starcluster sshmaster t1-micro-AMI-cluster
StarCluster - (http://web.mit.edu/starcluster) (v. 0.93.1)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster_at_mit.edu

The authenticity of host 'ec2-107-20-59-39.compute-1.amazonaws.com (107.20.59.39)' can't be established.
RSA key fingerprint is 1a:40:4d:db:b9:f8:25:0d:3c:e7:05:8d:a4:66:c4:c3.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ec2-107-20-59-39.compute-1.amazonaws.com,107.20.59.39' (RSA) to the list of known hosts.
          _ _ _
__/\_____| |_ __ _ _ __ ___| |_ _ ___| |_ ___ _ __
\ / __| __/ _` | '__/ __| | | | / __| __/ _ \ '__|
/_ _\__ \ || (_| | | | (__| | |_| \__ \ || __/ |
  \/ |___/\__\__,_|_| \___|_|\__,_|___/\__\___|_|

StarCluster Ubuntu 11.10 AMI
Software Tools for Academics and Researchers (STAR)
Homepage: http://web.mit.edu/starcluster
Documentation: http://web.mit.edu/starcluster/docs/latest
Code: https://github.com/jtriley/StarCluster
Mailing list: starcluster_at_mit.edu

This AMI Contains:

  * Custom-Compiled Atlas, Numpy, Scipy, etc
  * Open Grid Scheduler (OGS) queuing system
  * Condor workload management system
  * OpenMPI compiled with Open Grid Scheduler support
  * IPython 0.12 with parallel support
  * and more! (use 'dpkg -l' to show all installed packages)

Open Grid Scheduler/Condor cheat sheet:

  * qstat/condor_q - show status of batch jobs
  * qhost/condor_status- show status of hosts, queues, and jobs
  * qsub/condor_submit - submit batch jobs (e.g. qsub -cwd ./jobscript.sh)
  * qdel/condor_rm - delete batch jobs (e.g. qdel 7)
  * qconf - configure Open Grid Scheduler system

Current System Stats:

  System load: 0.06 Processes: 72
  Usage of /: 30.3% of 9.84GB Users logged in: 0
  Memory usage: 9% IP address for eth0: 10.211.33.29
  Swap usage: 0%

root_at_master:~# cat /etc/hosts
127.0.0.1 ubuntu

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
# Added by cloud-init
127.0.1.1 domU-12-31-39-0A-22-D3.compute-1.internal domU-12-31-39-0A-22-D3
10.211.33.29 master
10.211.7.62 node001

root_at_master:/home/ec2-user# mpirun -np 4 -host master,node001 ./stream_exe
-------------------------------------------------------------
STREAM version $Revision: 5.8 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be be 1 microseconds.
Each test below will take on the order of 16160 microseconds.
   (= 16160 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 1393.6590 0.0247 0.0230 0.0297
Scale: 1661.8098 0.0236 0.0193 0.0255
Add: 1504.8405 0.0340 0.0319 0.0399
Triad: 1526.1842 0.0335 0.0315 0.0377
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
No. of nodes 4; nodes with errors: 0
Minimum Copy MB/s 1393.66
Average Copy MB/s 1651.20
Maximum Copy MB/s 1740.92
Minimum Scale MB/s 1661.81
Average Scale MB/s 1685.83
Maximum Scale MB/s 1702.75
Minimum Add MB/s 1504.84
Average Add MB/s 1520.84
Maximum Add MB/s 1549.79
Minimum Triad MB/s 1517.50
Average Triad MB/s 1598.18
Maximum Triad MB/s 1776.34

root_at_master:/home/ec2-user#

Subbarao Kota.On Feb 22, 2012, at 11:54 AM, Justin Riley wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Also, use the 'listpublic' command to get a list of StarCluster
> supported AMIs which you can use in the node_image_id setting. Keep in
> mind that 32bit AMIs can only be used with 32bit-compatible instance
> types and similarly for 64bit AMIs. StarCluster will let you know when
> you've specified an incompatible AMI/instance type combo.
>
> ~Justin
>
> On 02/22/2012 11:52 AM, Justin Riley wrote:
>> Subbarao,
>>
>> You're missing a required option in the config file. Open your
>> StarCluster config ($HOME/.starcluster/config) and search for
>> "t1-micro-trial-cluster". In that [cluster] section you are
>> missing 'node_image_id' setting. This setting specifies the AMI id
>> to use for all worker nodes (also applies for master node if
>> master_image_id is not set).
>>
>> HTH,
>>
>> ~Justin
>>
>> On 02/20/2012 03:18 PM, Subbarao Kota wrote:
>>> Hello star cluster:
>>
>>> Can you please help me fix this error? Where should set the
>>> option node_image_id in the config file?
>>
>>> $ starcluster runplugin mpich2 t1-micro-trial-cluster StarCluster
>>> - (http://web.mit.edu/starcluster) (v. 0.93.1) Software Tools
>>> for Academics and Researchers (STAR) Please submit bug reports
>>> to starcluster_at_mit.edu
>>
>>> !!! ERROR - missing required option node_image_id in section
>>> "cluster t1-micro-trial-cluster"
>>
>>
>>> Thanks. _______________________________________________
>>> StarCluster mailing list StarCluster_at_mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>> _______________________________________________ StarCluster mailing
>> list StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.17 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk9FHdgACgkQ4llAkMfDcrljbgCeJJ2kaukbngu24yxm9JvCt2Pv
> YocAoIVsyjI6dN/Bw7UHg3yABR0GGIVB
> =HZfv
> -----END PGP SIGNATURE-----
Received on Mon Apr 23 2012 - 20:12:18 EDT

This message: [ Message body ]
Next message: Rayson Ho: "Re: Error when running mpich2 plugin using Starcluster"
Previous message: Erik Gafni: "Re: R on StarCluster"
In reply to: Justin Riley: "Re: Error when running mpich2 plugin using Starcluster"
Next in thread: Rayson Ho: "Re: Error when running mpich2 plugin using Starcluster"
Reply: Rayson Ho: "Re: Error when running mpich2 plugin using Starcluster"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Navigation

Re: Error when running mpich2 plugin using Starcluster

Search:

Sort all by:

Navigation