Rayson/Ron/Scott,
I like the idea of restricting GPU access to GPU jobs. Obviously we'll
want to extend this trick to support restricting access to multiple GPUs
for a single job...
My only question with this whole consumable + load sensor setup for GPU
is how does the job script know which device(s) to use? Ideally there
should be a variable passed down by SGE in the job's environment that
communicates which device(s) to use. Does this stuff do that or no? It's
hard to tell from the generic howtos on consumble vs load sensor...
In any case here's my view on how gpu jobs *should* work:
1. User uses qsub to submit a job requesting one or more gpus:
$ qsub -l gpus=2 mygpujob.sh
2. SGE determines which nodes have 'N' gpus available and starts the job
with a variable, say $GPU_DEVICES, in the job's environment that
communicates which devices the job has access to. This $GPU_DEVICES
variable could, for example, provide a space-separated list of device
numbers to use. For example, for a 2-GPU job:
GPU_DEVICES="0 1"
Also the device(s) would ideally be restricted to the job as Ron
mentioned.
3. The user's job script then uses the $GPU_DEVICES variables from SGE
to run their code on the appropriate GPU(s).
Is this possible? It's also not clear to me whether SGE will execute
multiple jobs on a single GPU or whether jobs have exclusive access to
the GPU(s) they run on?
~Justin
On Thu, May 24, 2012 at 08:46:40AM -0700, Ron Chen wrote:
> There is also a SGE way of handling it, which is to use prolog & epilog to set GPUs in exclusive mode & permission.
>
> 1. With this setup, 2 queue instances are needed per CG1 host. A queue instance is a container for job execution (but not a queue), and each logically owns a GPU board.
>
> 2. As SGE attaches an external and extra GID for process identification (we need to find out which process belongs to which job), so we can set the /dev/nvidiaX device permission to only allow processes that are in that GID to use the device in the prolog.
>
> 3. In the epilog, reset permission.
>
> This method was contributed by William Hay on the Grid Engine mailing list.
>
> -Ron
>
>
>
> ----- Original Message -----
> From: Rayson Ho <raysonlogin_at_gmail.com>
> To: Scott Le Grand <varelse2005_at_gmail.com>
> Cc: "starcluster_at_mit.edu" <starcluster_at_mit.edu>
> Sent: Thursday, May 24, 2012 12:11 AM
> Subject: Re: [StarCluster] CG1 plus StarCluster Questions
>
> Hi Scott,
>
> We just rely on the internal resource accounting of Grid Engine - ie.
> GPU jobs need to explicitly request for GPU devices when the users
> qsub the job - ie. qsub -l gpu=1 JobScript.sh, and then Grid Engine
> will make sure that only 2 jobs are scheduled on a host with 2 GPU
> devices.
>
>
> We also have the cgroups integration in Open Grid Scheduler/Grid
> Engine 2011.11 update 1 (ie. OGS/GE 2011.11u1), but we are not taking
> advantage of the device whitelisting controller yet:
>
> http://blogs.scalablelogic.com/2012/05/grid-engine-cgroups-integration.html
>
> http://www.kernel.org/doc/Documentation/cgroups/devices.txt
>
> Rayson
>
> ================================
> Open Grid Scheduler / Grid Engine
> http://gridscheduler.sourceforge.net/
>
> Scalable Grid Engine Support Program
> http://www.scalablelogic.com/
>
>
> On Thu, May 24, 2012 at 12:04 AM, Scott Le Grand <varelse2005_at_gmail.com> wrote:
> > This is really cool but it raises a question: how does the sensor avoid race
> > conditions for executables that don't immediately grab the GPU?
> >
> >
> > On Wed, May 23, 2012 at 8:46 PM, Rayson Ho <raysonlogin_at_gmail.com> wrote:
> >>
> >> Hi Justin & Scott,
> >>
> >> I played with a spot CG1 instance last week, and I was able to use
> >> both NVML & OpenCL APIs to pull information from the GPU devices.
> >>
> >> Open Grid Scheduler's GPU load sensor (which uses the NVML API) is for
> >> monitoring the health of GPU devices, and it is very similar to the
> >> GPU monitoring product from Bright Computing - but note that Bright
> >> has a very nice GUI (and we are not planning to compete against
> >> Bright, so most likely the Open Grid Scheduler project will not try to
> >> implement a GUI front-end for our GPU load sensor).
> >>
> >>
> >> http://www.brightcomputing.com/NVIDIA-GPU-Cluster-Management-Monitoring.php
> >>
> >>
> >> Note that with the information from the GPU load sensor, we can use
> >> the StarCluster load balancer to shutdown nodes that have unhealthy
> >> GPUs - ie. GPUs that are too hot, have too many ECC errors, etc.
> >>
> >> However, we currently don't put the GPUs in exclusive mode yet - which
> >> is what Scott has already done and it is on our ToDo list - with that
> >> it makes the Open Grid Scheduler/Grid Engine-GPU integration more
> >> complete.
> >>
> >>
> >> Anyway, here's how to compile & run the gpu load sensor:
> >>
> >> * URL:
> >> https://gridscheduler.svn.sourceforge.net/svnroot/gridscheduler/trunk/source/dist/gpu/gpu_sensor.c
> >>
> >> * you can compile it in STANDALONE mode by adding -DSTANDALONE
> >>
> >> * if you don't run it in standalone mode, you can press ENTER so
> >> simulate the internal Grid Engine load sensor environment
> >>
> >> % cc gpu_sensor.c -I/usr/local/cuda/CUDAToolsSDK/NVML/
> >> -L/usr/lib/nvidia-current/ -lnvidia-ml
> >> % ./gpu_ls
> >>
> >> begin
> >> ip-10-16-21-185:gpu.0.name:Tesla M2050
> >> ip-10-16-21-185:gpu.0.busId:0000:00:03.0
> >> ip-10-16-21-185:gpu.0.fanspeed:0
> >> ip-10-16-21-185:gpu.0.clockspeed:270
> >> ip-10-16-21-185:gpu.0.memfree:2811613184
> >> ip-10-16-21-185:gpu.0.memused:6369280
> >> ip-10-16-21-185:gpu.0.memtotal:2817982464
> >> ip-10-16-21-185:gpu.0.utilgpu:0
> >> ip-10-16-21-185:gpu.0.utilmem:0
> >> ip-10-16-21-185:gpu.0.sbiteccerror:0
> >> ip-10-16-21-185:gpu.0.dbiteccerror:0
> >> ip-10-16-21-185:gpu.1.name:Tesla M2050
> >> ip-10-16-21-185:gpu.1.busId:0000:00:04.0
> >> ip-10-16-21-185:gpu.1.fanspeed:0
> >> ip-10-16-21-185:gpu.1.clockspeed:270
> >> ip-10-16-21-185:gpu.1.memfree:2811613184
> >> ip-10-16-21-185:gpu.1.memused:6369280
> >> ip-10-16-21-185:gpu.1.memtotal:2817982464
> >> ip-10-16-21-185:gpu.1.utilgpu:0
> >> ip-10-16-21-185:gpu.1.utilmem:0
> >> ip-10-16-21-185:gpu.1.prevhrsbiteccerror:0
> >> ip-10-16-21-185:gpu.1.prevhrdbiteccerror:0
> >> ip-10-16-21-185:gpu.1.sbiteccerror:0
> >> ip-10-16-21-185:gpu.1.dbiteccerror:0
> >> end
> >>
> >> And we can also use NVidia's SMI:
> >>
> >> % nvidia-smi
> >> Sun May 20 01:21:42 2012
> >> +------------------------------------------------------+
> >> | NVIDIA-SMI 2.290.10 Driver Version: 290.10 |
> >>
> >> |-------------------------------+----------------------+----------------------+
> >> | Nb. Name | Bus Id Disp. | Volatile ECC SB /
> >> DB |
> >> | Fan Temp Power Usage /Cap | Memory Usage | GPU Util. Compute
> >> M. |
> >>
> >> |===============================+======================+======================|
> >> | 0. Tesla M2050 | 0000:00:03.0 Off | 0
> >> 0 |
> >> | N/A N/A P1 Off / Off | 0% 6MB / 2687MB | 0% Default
> >> |
> >>
> >> |-------------------------------+----------------------+----------------------|
> >> | 1. Tesla M2050 | 0000:00:04.0 Off | 0
> >> 0 |
> >> | N/A N/A P1 Off / Off | 0% 6MB / 2687MB | 0% Default
> >> |
> >>
> >> |-------------------------------+----------------------+----------------------|
> >> | Compute processes: GPU
> >> Memory |
> >> | GPU PID Process name Usage
> >> |
> >>
> >> |=============================================================================|
> >> | No running compute processes found
> >> |
> >>
> >> +-----------------------------------------------------------------------------+
> >>
> >> And note that Ganglia also has a plugin for NVML:
> >>
> >> https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia
> >>
> >>
> >> And the complex setup is for internal accounting inside Grid Engine -
> >> ie. it tells GE how many GPU cards there are and how many are in use.
> >> We can set up a consumable resource and let Grid Engine do the
> >> accounting... ie. we model GPUs like any other consumable resources,
> >> eg. software licenses, disk space, etc... and we can then use the same
> >> techniques to manage the GPUs:
> >>
> >> http://gridscheduler.sourceforge.net/howto/consumable.html
> >> http://gridscheduler.sourceforge.net/howto/loadsensor.html
> >>
> >> Rayson
> >>
> >> ================================
> >> Open Grid Scheduler / Grid Engine
> >> http://gridscheduler.sourceforge.net/
> >>
> >> Scalable Grid Engine Support Program
> >> http://www.scalablelogic.com/
> >>
> >>
> >>
> >> On Wed, May 23, 2012 at 4:31 PM, Justin Riley <jtriley_at_mit.edu> wrote:
> >> > Just curious, were you able to get the GPU consumable resource/load
> >> > sensor to work with the SC HVM/GPU AMI? I will eventually experiment
> >> > with this myself when I create new 12.X AMIs soon but it would be
> >> > helpful to have a condensed step-by-step overview if you were able to
> >> > get things working. No worries if not.
> >> >
> >> > Thanks!
> >> >
> >> > ~Justin
> >>
> >>
> >>
> >> --
> >> ==================================================
> >> Open Grid Scheduler - The Official Open Source Grid Engine
> >> http://gridscheduler.sourceforge.net/
> >
> >
>
>
>
> --
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
- application/pgp-signature attachment: stored
Received on Thu May 24 2012 - 12:34:06 EDT