Classification: UNCLASSIFIED
Caveats: FOUO
Dustin,
Thanks again for your comments. I stumble around a lot, but I'm hoping to stumble in the right direction finally. I will try the /mnt approach one more time, putting the HYCOM executable on /mnt instead of on my EBS volume /sharedWork, and if that doesn't improve things, start learning how to stripe disks.
Tom Oppe
-----Original Message-----
From: Dustin Machi [mailto:dmachi_at_vbi.vt.edu]
Sent: Thursday, December 20, 2012 11:11 AM
To: Oppe, Thomas C ERDC-RDE-ITL-MS Contractor
Cc: starcluster_at_mit.edu
Subject: Re: [StarCluster] Terminating a cluster does not remove its security group (UNCLASSIFIED)
There are a lot of variables here that could contribute to poor write performance. Generally, if the volume you are using is less than 1TB (max size of a volume), then it may get shared with other users of the underlying physical piece of hardware. The provisioned i/o instances address this to some extent, but the limits are per instance type (Not all instance types offer the ability to configure the i/o performance, but they all rate what their 'default' performance is). Cluster compute instances (cc2.8xlarge) for example, don't have optimized EBS, but they have high bandwidth anyway (10gigE), so they can handle more (and you can still allocate provisioned IOPS on the volumes themselves). You can see the comparisons here
http://aws.amazon.com/ec2/instance-types/
For your workload, I'd think you would want cc2.8xlarge or hi1.4xlarge.
With the latter, there is 1024gb of SSD based disk, which eliminates the need to have an additional volume at all. However, that SSD is ephemeral, so make sure to copy the data off before you restart things!
If you were to use the serial i/o, and then make sure the SSDs (probably on /mnt) are where you read/write to, you should be eliminating most of the networking in terms of data i/o, and your MPI communications will occur over 10gbE instead of gigE.
In terms of buffering, that would be something in the code (MPI serial on master) that is serializing the data to disk. Probably would have to search google as I'm no MPI expert and given your benchmarking situation, not sure you could change it anyway unless its a configurable parameter of an MPI setup.
I don't think you want to use GlusterFS here. While aggregate write performance can be improved when writing from many clients simultaneously, it is a much more complicated setup that probably won't net you enough improvement to be worth it.
I see your options as either a) using one of the above instance types and making sure the ephemeral disk is your shared volume (which is SSD) or b) writing a starcluster plugin which can setup and configure a striped raid volume on your local disk (your original question from this thread).
Dustin
On 20 Dec 2012, at 11:31, Oppe, Thomas C ERDC-RDE-ITL-MS Contractor
wrote:
> Classification: UNCLASSIFIED
> Caveats: FOUO
>
> Dustin,
>
> HYCOM does use fairly big input files. When I do "ls -l" in the
> /input directory, I get a total of 15319088 KB, so 14.61 GB is the
> total size of all input files. However, reading the input files at
> the start of the run seems to go pretty quickly. It is the writing of
> several big binary output files that seems to go slowly. The model
> simulates one day, and at every simulated hour there is a file of size
> 3798728704 bytes written. At the 23rd hour, in addition to this
> hourly file, there is a huge file of size 57455771648 bytes written.
> Then there is a final hourly file of size 3798728704 bytes written.
> For my fastest run using 336 MPI processes on a standard EBS volume,
> here is the size and time of last modification for each of these large
> output files:
>
> 3798728704 Dec 16 03:49 archv.0001_001_01.a
> 3798728704 Dec 16 03:58 archv.0001_001_02.a
> 3798728704 Dec 16 04:07 archv.0001_001_03.a
> 3798728704 Dec 16 04:17 archv.0001_001_04.a
> 3798728704 Dec 16 04:25 archv.0001_001_05.a
> 3798728704 Dec 16 04:34 archv.0001_001_06.a
> 3798728704 Dec 16 04:48 archv.0001_001_07.a
> 3798728704 Dec 16 04:56 archv.0001_001_08.a
> 3798728704 Dec 16 05:04 archv.0001_001_09.a
> 3798728704 Dec 16 05:13 archv.0001_001_10.a
> 3798728704 Dec 16 05:21 archv.0001_001_11.a
> 3798728704 Dec 16 05:29 archv.0001_001_12.a
> 3798728704 Dec 16 05:38 archv.0001_001_13.a
> 3798728704 Dec 16 05:46 archv.0001_001_14.a
> 3798728704 Dec 16 05:54 archv.0001_001_15.a
> 3798728704 Dec 16 06:03 archv.0001_001_16.a
> 3798728704 Dec 16 06:11 archv.0001_001_17.a
> 3798728704 Dec 16 06:19 archv.0001_001_18.a
> 3798728704 Dec 16 06:28 archv.0001_001_19.a
> 3798728704 Dec 16 06:36 archv.0001_001_20.a
> 3798728704 Dec 16 06:44 archv.0001_001_21.a
> 3798728704 Dec 16 06:53 archv.0001_001_22.a
> 3798728704 Dec 16 07:01 archv.0001_001_23.a
> 57455771648 Dec 16 07:32 archm.0001_001_12.a
> 3798728704 Dec 16 07:36 archv.0001_002_00.a
>
> Notice that each hour time step, including writing the "archv.*.a"
> file, takes 8-10 minutes (except at 6hr, where extra calculations are
> made), and in fact the writes of the first 23 "archv.*.a" files goes
> pretty quickly (15-30 seconds to write one as I have timed it). But
> something disastrous happens at the write of the big "archm.*.a" file,
> which is only 15.125 times larger than an "archv.*.a" file. It is
> written 31 minutes after the previous "archv.0001_001_23.a" is
> written. Assuming 9 minutes of this is the inter-hour computation,
> that means it took 22 minutes to write. I have seen it take much
> longer. This run was for a serial I/O run, in which all write I/O is
> done by rank 0. For a parallel I/O run, in which selected MPI
> processes participate in the writing of these files using MPI-2 I/O
> calls, the situation is much worse. I have seen the "archm.*.a" file
> take more than an hour to write in parallel I/O mode. The HYCOM
> developer says that there is no computation at all between the writing
> of the "archm.*.a" file and the last "archv.*.a" file, and thus it
> takes 4 minutes to write the last "archv.*.a" file. So something bad
> happened in the writing of the "archm.*.a" file to affect the I/O
> write performance of the last "archv.*.a" file.
>
> Your suggestion of keeping the input files on the EBS volume, and just
> redirecting the output files to /mnt is good, but I'm not sure how to
> do that. For HYCOM, the output files always appear in the same
> directory as the input files. As a benchmarker of HYCOM, I am not
> really allowed to modify the source code to redirect the output files
> elsewhere. Hence I had to stand on my head to locate the input files
> on /mnt. I will ask the HYCOM developer if it is possible to have the
> input and output files in different directories or filesystems.
>
> Your suggestion of using buffered I/O and increasing the size of the
> buffer chunks is fine, but again I am not sure how to do that with
> binary sequential access files. In my SGE script file, I have:
>
> export FORT_BUFFERED='YES'
>
> as an Intel Fortran compiler-related environment variable, but I think
> it affects only ASCII files, not the big binary files "archv.*.a" and
> "archm.*.a". Can I set the buffer size myself? As mentioned above, I
> am suspicious that the I/O write performance deteriorates at the
> writing of the "archm.*.a" file and affects the last "archv.*.a" file.
> Perhaps the buffer chunk size is getting reset to smaller with the
> writing of the big file.
>
> You mentioned using GlusterFS with 64k buffer chunks. I am using
> either standard EBS volumes or what Amazon calls "provisioned IOPS"
> EBS volumes. I think they are NFS mounted and shared among the nodes.
> Do you think GlusterFS would be faster? Can you point me to
> information about how to ask for a GlusterFS file system of 200-300 GB
> size and how to specify the buffer chunk size?
>
> Amazon warns that the "cc2.8xlarge" instances cannot take full
> advantage of the provisioned IOPS EBS volumes, and instead I should
> use "m2.4xlarge" instances, but I think those type instances are
> slower than the "cc2.8xlarge" instances. Perhaps the "m2.4xlarge"
> instances are in clusters that have a separate network for I/O so that
> MPI messages and I/O traffic are on separate networks.
>
> Thank you for the information. As you can see, I need to know more
> about different file systems available on AWS, how to set buffer chunk
> sizes, and most of all, how the big files are being written. All I
> know is that in serial I/O mode, rank 0 uses collective MPI operations
> to gather data from all the other MPI processes and then writes the
> data to one big file. Perhaps some parameter associated with the MPI
> collective operation needs to be resized.
>
> Tom Oppe
>
>
> -----Original Message-----
> From: Dustin Machi [mailto:dmachi_at_vbi.vt.edu]
> Sent: Thursday, December 20, 2012 9:12 AM
> To: Oppe, Thomas C ERDC-RDE-ITL-MS Contractor
> Cc: starcluster_at_mit.edu
> Subject: Re: [StarCluster] Terminating a cluster does not remove its
> security group (UNCLASSIFIED)
>
> Comments Inline.
>
>> I'm running a HYCOM job now using the /mnt disk on the master node.
>> The problem is that big files are being written. One of the files is
>> 57GB, and 24 other files of size 4GB are also written. The job is
>> running more slowly than when run on an EBS volume, but that may be
>> because the MPI executable is on an EBS volume and is being run from
>> there.
>
> When you say there are 153GB of data written, is that the output or is
> that the size of the input files you are copying to /mnt/? The total
> time to copy 153gb from the NFS share to each individual node plus the
> time to do the actual work may in fact be longer than the original of
> reading of the NFS share. I wasn't really suggesting that any of the
> input files be relocated or that the mph binary should need to be
> moved (I don't think the latter would have much of an impact at all,
> aside from initially launching). I was only suggesting that the
> output of you MPI scripts write to /mnt/ and then copy those output
> files back over to the shared filesystem so you can get the final
> results.
>
> If the input files are relatively static, you could do something like
> this:
>
> 1. Create a volume and put all the data on it.
> 2. Make a snapshot of this volume.
> 3. For each of the nodes, create a new volume from the snapshot in #2,
> and mount this.
>
> At this point all the nodes will have all the data on their EBS volume
> at launch and no copying is necessary.
>
> When the data is updated, you can either a) update the data, create a
> new snapshot and relaunch the cluster so volumes are generated from
> the new snapshot, or b) Use rsync to update the base volume from a new
> copy whenever a node boots. This will make the node startup a little
> bit slower, but should be pretty fast if only a subset of the data has
> changed.
>
> Another question in regards to the write functionality of your MPI
> program, is what size chunks does it try to write with. Buffering the
> output and writing in larger chunks instead of a lot of tiny writes is
> probably going to perform better for you. I use GlusterFS as a shared
> filesystem in my setup, and the key for it is to have write in 64k
> chunks (as opposed to the more common 4kb chunks that things like
> rsync use by default).
>
>> I gave up trying to login to each node of the cluster using
>> "starcluster sshnode cluster <node>", mainly because I was getting
>> ssh errors that were preventing me from logging in to several nodes.
>> The error message was something like "you may be in a
>> man-in-the-middle attack" or something like that.
>>
> This is normal for ssh. When you ssh into a machine you've never
> ssh'd into before, it doesn't know if the host's signature is valid,
> so it warns you before saving it associated with the host you are
> connecting to. Thereafter, if you connect to a machine with the same
> name, it will warn you if the signature doesn't match what you have on
> file (which might be a man-in-the-middle attack). You can delete the
> signature in ~/.ssh/known_hosts by removing the line with same
> hostname.
>
> In general, I would try to do some benchmarking to identify where the
> bottleneck is. Is it Input Read or Output Write that is slow?
>
> Dustin
>
>> My technique is very kludgy, but this is what I did. I wrote a bash
>> script to copy four HYCOM input files that each MPI process needs to
>> be able to read to the /mnt disk. /sharedWork is the name of my
>> standard EBS volume.
>>
>> root_at_master > more copyit
>> #!/bin/bash
>> cd /sharedWork/ABTP/ded/hycom/input
>> /bin/cp blkdat.input limits ports.input /mnt /bin/cp
>> patch.input_00336
>> /mnt/patch.input ls /mnt exit
>>
>> Then I wrote a small Fortran MPI program that executes this script
>> with a system call. Since there are 16 MPI processes to an instance
>> (cc2.8xlarge), I only use one MPI process on each node to do the
>> copying.
>>
>> root_at_master > more copy.f
>> program copy
>> include 'mpif.h'
>> call MPI_Init (ierror)
>> call MPI_Comm_rank (MPI_COMM_WORLD, myid, ier) if (mod(myid, 16) .eq.
>> 0) then
>> call system ('/sharedWork/copyit')
>> endif
>> call MPI_Finalize (ierror)
>> stop
>> end
>>
>> I compiled "copy.f" into an executable called "copy" using
>> "mpiifort".
>> Then in my HYCOM batch script, I run this small MPI job before
>> running the real job:
>>
>> mpiexec -n 336 /sharedWork/copy
>>
>> Unfortunately, I think I made an error in that the HYCOM executable
>> is being run from /sharedWork, the EBS volume.
>>
>> mpiexec -n 336 /sharedWork/hycom
>>
>> The job is running slowly, more slowly than if all files were on
>> /sharedWork. For the next run, I think I will modify the "copyit"
>> script to copy the HYCOM executable to /mnt also for each instance
>> and change the mpiexec line to:
>>
>> mpiexec -n 336 /mnt/hycom
>>
>> HYCOM needs a lot of input files, and they are all available on /mnt
>> corresponding to the master node (rank 0). However, most of the
>> input files are read only by rank 0 and only four input files need to
>> be read by all MPI processes.
>>
>> I'm surprised this works at all. It is mind-bending for me. If you
>> have a better way for running on /mnt, let me know your suggestion.
>>
>> Above all, thank you very much. I think once the executable is
>> placed on each /mnt disk for each node, maybe the run will go much
>> faster.
>>
>> Tom Oppe
>>
>> -----Original Message-----
>> From: Dustin Machi [mailto:dmachi_at_vbi.vt.edu]
>> Sent: Thursday, December 20, 2012 5:53 AM
>> To: Oppe, Thomas C ERDC-RDE-ITL-MS Contractor
>> Cc: Paolo Di Tommaso; starcluster_at_mit.edu
>> Subject: Re: [StarCluster] Terminating a cluster does not remove its
>> security group (UNCLASSIFIED)
>>
>>> I'm looking at using the /mnt local disk of the "master" node and
>>> copying input files to each node's /mnt disk for those files that
>>> each MPI process needs to read. It's very cumbersome. I have to
>>> login to each node.
>>
>> It is a little cumbersome to do so, but you shouldn't have to login
>> to each node. Just submit/run an SGE job that copies/rsyncs the NFS
>> volume mounted on each machine to /mnt/. Is your problem a read
>> problem or a write problem?
>>
>> Dustin
>>
>> Classification: UNCLASSIFIED
>> Caveats: FOUO
>
> Classification: UNCLASSIFIED
> Caveats: FOUO
Classification: UNCLASSIFIED
Caveats: FOUO
Received on Thu Dec 20 2012 - 12:17:23 EST