StarCluster - Mailing List Archive

Re: CG1 plus StarCluster Questions

From: Rayson Ho <no email>
Date: Thu, 24 May 2012 16:08:06 -0400

On Thu, May 24, 2012 at 12:33 PM, Justin Riley <justin.t.riley_at_gmail.com> wrote:
> I like the idea of restricting GPU access to GPU jobs. Obviously we'll
> want to extend this trick to support restricting access to multiple GPUs
> for a single job...

Agreed!


> My only question with this whole consumable + load sensor setup for GPU
> is how does the job script know which device(s) to use? Ideally there
> should be a variable passed down by SGE in the job's environment that
> communicates which device(s) to use. Does this stuff do that or no? It's
> hard to tell from the generic howtos on consumble vs load sensor...

There are multiple hooks in Grid Engine that would allow us to do this
sort of stuff...

>
> In any case here's my view on how gpu jobs *should* work:
>
> 1. User uses qsub to submit a job requesting one or more gpus:
>
> $ qsub -l gpus=2 mygpujob.sh
>
> 2. SGE determines which nodes have 'N' gpus available and starts the job
> with a variable, say $GPU_DEVICES, in the job's environment that
> communicates which devices the job has access to. This $GPU_DEVICES
> variable could, for example, provide a space-separated list of device
> numbers to use. For example, for a 2-GPU job:
>
> GPU_DEVICES="0 1"

NVidia has the CUDA_VISIBLE_DEVICES for this purpose:

Specific GPUs can be made invisible with the CUDA_VISIBLE_DEVICES
environment variable. Visible devices should be included as a
comma-separated list in terms of the system-wide list of devices. For
example, to use only devices 0 and 2 from the system-wide list of
devices, set CUDA_VISIBLE_DEVICES equal to "0,2" before launching the
application. The application will then enumerate these devices as
device 0 and device 1.


> Is this possible? It's also not clear to me whether SGE will execute
> multiple jobs on a single GPU or whether jobs have exclusive access to
> the GPU(s) they run on?

William apparently uses lock files instead of multiple queue instances
and thus multiple GPU jobs are already supported in his method:

http://gridengine.org/pipermail/users/2012-May/003617.html

Rayson

================================
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/

>
> ~Justin
>
>
> On Thu, May 24, 2012 at 08:46:40AM -0700, Ron Chen wrote:
>> There is also a SGE way of handling it, which is to use prolog & epilog to set GPUs in exclusive mode & permission.
>>
>> 1. With this setup, 2 queue instances are needed per CG1 host. A queue instance is a container for job execution (but not a queue), and each logically owns a GPU board.
>>
>> 2. As SGE attaches an external and extra GID for process identification (we need to find out which process belongs to which job), so we can set the /dev/nvidiaX device permission to only allow processes that are in that GID to use the device in the prolog.
>>
>> 3. In the epilog, reset permission.
>>
>> This method was contributed by William Hay on the Grid Engine mailing list.
>>
>>  -Ron
>>
>>
>>
>> ----- Original Message -----
>> From: Rayson Ho <raysonlogin_at_gmail.com>
>> To: Scott Le Grand <varelse2005_at_gmail.com>
>> Cc: "starcluster_at_mit.edu" <starcluster_at_mit.edu>
>> Sent: Thursday, May 24, 2012 12:11 AM
>> Subject: Re: [StarCluster] CG1 plus StarCluster Questions
>>
>> Hi Scott,
>>
>> We just rely on the internal resource accounting of Grid Engine - ie.
>> GPU jobs need to explicitly request for GPU devices when the users
>> qsub the job - ie. qsub -l gpu=1 JobScript.sh, and then Grid Engine
>> will make sure that only 2 jobs are scheduled on a host with 2 GPU
>> devices.
>>
>>
>> We also have the cgroups integration in Open Grid Scheduler/Grid
>> Engine 2011.11 update 1 (ie. OGS/GE 2011.11u1), but we are not taking
>> advantage of the device whitelisting controller yet:
>>
>> http://blogs.scalablelogic.com/2012/05/grid-engine-cgroups-integration.html
>>
>> http://www.kernel.org/doc/Documentation/cgroups/devices.txt
>>
>> Rayson
>>
>> ================================
>> Open Grid Scheduler / Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>> Scalable Grid Engine Support Program
>> http://www.scalablelogic.com/
>>
>>
>> On Thu, May 24, 2012 at 12:04 AM, Scott Le Grand <varelse2005_at_gmail.com> wrote:
>> > This is really cool but it raises a question: how does the sensor avoid race
>> > conditions for executables that don't immediately grab the GPU?
>> >
>> >
>> > On Wed, May 23, 2012 at 8:46 PM, Rayson Ho <raysonlogin_at_gmail.com> wrote:
>> >>
>> >> Hi Justin & Scott,
>> >>
>> >> I played with a spot CG1 instance last week, and I was able to use
>> >> both NVML & OpenCL APIs to pull information from the GPU devices.
>> >>
>> >> Open Grid Scheduler's GPU load sensor (which uses the NVML API) is for
>> >> monitoring the health of GPU devices, and it is very similar to the
>> >> GPU monitoring product from Bright Computing - but note that Bright
>> >> has a very nice GUI (and we are not planning to compete against
>> >> Bright, so most likely the Open Grid Scheduler project will not try to
>> >> implement a GUI front-end for our GPU load sensor).
>> >>
>> >>
>> >> http://www.brightcomputing.com/NVIDIA-GPU-Cluster-Management-Monitoring.php
>> >>
>> >>
>> >> Note that with the information from the GPU load sensor, we can use
>> >> the StarCluster load balancer to shutdown nodes that have unhealthy
>> >> GPUs - ie. GPUs that are too hot, have too many ECC errors, etc.
>> >>
>> >> However, we currently don't put the GPUs in exclusive mode yet - which
>> >> is what Scott has already done and it is on our ToDo list - with that
>> >> it makes the Open Grid Scheduler/Grid Engine-GPU integration more
>> >> complete.
>> >>
>> >>
>> >> Anyway, here's how to compile & run the gpu load sensor:
>> >>
>> >>  * URL:
>> >> https://gridscheduler.svn.sourceforge.net/svnroot/gridscheduler/trunk/source/dist/gpu/gpu_sensor.c
>> >>
>> >>  * you can compile it in STANDALONE mode by adding -DSTANDALONE
>> >>
>> >>  * if you don't run it in standalone mode, you can press ENTER so
>> >> simulate the internal Grid Engine load sensor environment
>> >>
>> >> % cc gpu_sensor.c -I/usr/local/cuda/CUDAToolsSDK/NVML/
>> >> -L/usr/lib/nvidia-current/ -lnvidia-ml
>> >> % ./gpu_ls
>> >>
>> >> begin
>> >> ip-10-16-21-185:gpu.0.name:Tesla M2050
>> >> ip-10-16-21-185:gpu.0.busId:0000:00:03.0
>> >> ip-10-16-21-185:gpu.0.fanspeed:0
>> >> ip-10-16-21-185:gpu.0.clockspeed:270
>> >> ip-10-16-21-185:gpu.0.memfree:2811613184
>> >> ip-10-16-21-185:gpu.0.memused:6369280
>> >> ip-10-16-21-185:gpu.0.memtotal:2817982464
>> >> ip-10-16-21-185:gpu.0.utilgpu:0
>> >> ip-10-16-21-185:gpu.0.utilmem:0
>> >> ip-10-16-21-185:gpu.0.sbiteccerror:0
>> >> ip-10-16-21-185:gpu.0.dbiteccerror:0
>> >> ip-10-16-21-185:gpu.1.name:Tesla M2050
>> >> ip-10-16-21-185:gpu.1.busId:0000:00:04.0
>> >> ip-10-16-21-185:gpu.1.fanspeed:0
>> >> ip-10-16-21-185:gpu.1.clockspeed:270
>> >> ip-10-16-21-185:gpu.1.memfree:2811613184
>> >> ip-10-16-21-185:gpu.1.memused:6369280
>> >> ip-10-16-21-185:gpu.1.memtotal:2817982464
>> >> ip-10-16-21-185:gpu.1.utilgpu:0
>> >> ip-10-16-21-185:gpu.1.utilmem:0
>> >> ip-10-16-21-185:gpu.1.prevhrsbiteccerror:0
>> >> ip-10-16-21-185:gpu.1.prevhrdbiteccerror:0
>> >> ip-10-16-21-185:gpu.1.sbiteccerror:0
>> >> ip-10-16-21-185:gpu.1.dbiteccerror:0
>> >> end
>> >>
>> >> And we can also use NVidia's SMI:
>> >>
>> >> % nvidia-smi
>> >> Sun May 20 01:21:42 2012
>> >> +------------------------------------------------------+
>> >> | NVIDIA-SMI 2.290.10   Driver Version: 290.10         |
>> >>
>> >> |-------------------------------+----------------------+----------------------+
>> >> | Nb.  Name                     | Bus Id        Disp.  | Volatile ECC SB /
>> >> DB |
>> >> | Fan   Temp   Power Usage /Cap | Memory Usage         | GPU Util. Compute
>> >> M. |
>> >>
>> >> |===============================+======================+======================|
>> >> | 0.  Tesla M2050               | 0000:00:03.0  Off    |         0
>> >>  0 |
>> >> |  N/A    N/A  P1    Off /  Off |   0%    6MB / 2687MB |    0%     Default
>> >>    |
>> >>
>> >> |-------------------------------+----------------------+----------------------|
>> >> | 1.  Tesla M2050               | 0000:00:04.0  Off    |         0
>> >>  0 |
>> >> |  N/A    N/A  P1    Off /  Off |   0%    6MB / 2687MB |    0%     Default
>> >>    |
>> >>
>> >> |-------------------------------+----------------------+----------------------|
>> >> | Compute processes:                                               GPU
>> >> Memory |
>> >> |  GPU  PID     Process name                                       Usage
>> >>    |
>> >>
>> >> |=============================================================================|
>> >> |  No running compute processes found
>> >>     |
>> >>
>> >> +-----------------------------------------------------------------------------+
>> >>
>> >> And note that Ganglia also has a plugin for NVML:
>> >>
>> >> https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia
>> >>
>> >>
>> >> And the complex setup is for internal accounting inside Grid Engine -
>> >> ie. it tells GE how many GPU cards there are and how many are in use.
>> >> We can set up a consumable resource and let Grid Engine do the
>> >> accounting... ie. we model GPUs like any other consumable resources,
>> >> eg. software licenses, disk space, etc... and we can then use the same
>> >> techniques to manage the GPUs:
>> >>
>> >> http://gridscheduler.sourceforge.net/howto/consumable.html
>> >> http://gridscheduler.sourceforge.net/howto/loadsensor.html
>> >>
>> >> Rayson
>> >>
>> >> ================================
>> >> Open Grid Scheduler / Grid Engine
>> >> http://gridscheduler.sourceforge.net/
>> >>
>> >> Scalable Grid Engine Support Program
>> >> http://www.scalablelogic.com/
>> >>
>> >>
>> >>
>> >> On Wed, May 23, 2012 at 4:31 PM, Justin Riley <jtriley_at_mit.edu> wrote:
>> >> > Just curious, were you able to get the GPU consumable resource/load
>> >> > sensor to work with the SC HVM/GPU AMI? I will eventually experiment
>> >> > with this myself when I create new 12.X AMIs soon but it would be
>> >> > helpful to have a condensed step-by-step overview if you were able to
>> >> > get things working. No worries if not.
>> >> >
>> >> > Thanks!
>> >> >
>> >> > ~Justin
>> >>
>> >>
>> >>
>> >> --
>> >> ==================================================
>> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> http://gridscheduler.sourceforge.net/
>> >
>> >
>>
>>
>>
>> --
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster



-- 
==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
Received on Thu May 24 2012 - 16:08:08 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject