StarCluster - Mailing List Archive

Re: CG1 plus StarCluster Questions

From: Scott Le Grand <no email>
Date: Wed, 23 May 2012 21:04:58 -0700

This is really cool but it raises a question: how does the sensor avoid
race conditions for executables that don't immediately grab the GPU?



On Wed, May 23, 2012 at 8:46 PM, Rayson Ho <raysonlogin_at_gmail.com> wrote:

> Hi Justin & Scott,
>
> I played with a spot CG1 instance last week, and I was able to use
> both NVML & OpenCL APIs to pull information from the GPU devices.
>
> Open Grid Scheduler's GPU load sensor (which uses the NVML API) is for
> monitoring the health of GPU devices, and it is very similar to the
> GPU monitoring product from Bright Computing - but note that Bright
> has a very nice GUI (and we are not planning to compete against
> Bright, so most likely the Open Grid Scheduler project will not try to
> implement a GUI front-end for our GPU load sensor).
>
> http://www.brightcomputing.com/NVIDIA-GPU-Cluster-Management-Monitoring.php
>
>
> Note that with the information from the GPU load sensor, we can use
> the StarCluster load balancer to shutdown nodes that have unhealthy
> GPUs - ie. GPUs that are too hot, have too many ECC errors, etc.
>
> However, we currently don't put the GPUs in exclusive mode yet - which
> is what Scott has already done and it is on our ToDo list - with that
> it makes the Open Grid Scheduler/Grid Engine-GPU integration more
> complete.
>
>
> Anyway, here's how to compile & run the gpu load sensor:
>
> * URL:
> https://gridscheduler.svn.sourceforge.net/svnroot/gridscheduler/trunk/source/dist/gpu/gpu_sensor.c
>
> * you can compile it in STANDALONE mode by adding -DSTANDALONE
>
> * if you don't run it in standalone mode, you can press ENTER so
> simulate the internal Grid Engine load sensor environment
>
> % cc gpu_sensor.c -I/usr/local/cuda/CUDAToolsSDK/NVML/
> -L/usr/lib/nvidia-current/ -lnvidia-ml
> % ./gpu_ls
>
> begin
> ip-10-16-21-185:gpu.0.name:Tesla M2050
> ip-10-16-21-185:gpu.0.busId:0000:00:03.0
> ip-10-16-21-185:gpu.0.fanspeed:0
> ip-10-16-21-185:gpu.0.clockspeed:270
> ip-10-16-21-185:gpu.0.memfree:2811613184
> ip-10-16-21-185:gpu.0.memused:6369280
> ip-10-16-21-185:gpu.0.memtotal:2817982464
> ip-10-16-21-185:gpu.0.utilgpu:0
> ip-10-16-21-185:gpu.0.utilmem:0
> ip-10-16-21-185:gpu.0.sbiteccerror:0
> ip-10-16-21-185:gpu.0.dbiteccerror:0
> ip-10-16-21-185:gpu.1.name:Tesla M2050
> ip-10-16-21-185:gpu.1.busId:0000:00:04.0
> ip-10-16-21-185:gpu.1.fanspeed:0
> ip-10-16-21-185:gpu.1.clockspeed:270
> ip-10-16-21-185:gpu.1.memfree:2811613184
> ip-10-16-21-185:gpu.1.memused:6369280
> ip-10-16-21-185:gpu.1.memtotal:2817982464
> ip-10-16-21-185:gpu.1.utilgpu:0
> ip-10-16-21-185:gpu.1.utilmem:0
> ip-10-16-21-185:gpu.1.prevhrsbiteccerror:0
> ip-10-16-21-185:gpu.1.prevhrdbiteccerror:0
> ip-10-16-21-185:gpu.1.sbiteccerror:0
> ip-10-16-21-185:gpu.1.dbiteccerror:0
> end
>
> And we can also use NVidia's SMI:
>
> % nvidia-smi
> Sun May 20 01:21:42 2012
> +------------------------------------------------------+
> | NVIDIA-SMI 2.290.10 Driver Version: 290.10 |
>
> |-------------------------------+----------------------+----------------------+
> | Nb. Name | Bus Id Disp. | Volatile ECC SB /
> DB |
> | Fan Temp Power Usage /Cap | Memory Usage | GPU Util. Compute
> M. |
>
> |===============================+======================+======================|
> | 0. Tesla M2050 | 0000:00:03.0 Off | 0
> 0 |
> | N/A N/A P1 Off / Off | 0% 6MB / 2687MB | 0% Default
> |
>
> |-------------------------------+----------------------+----------------------|
> | 1. Tesla M2050 | 0000:00:04.0 Off | 0
> 0 |
> | N/A N/A P1 Off / Off | 0% 6MB / 2687MB | 0% Default
> |
>
> |-------------------------------+----------------------+----------------------|
> | Compute processes: GPU
> Memory |
> | GPU PID Process name Usage
> |
>
> |=============================================================================|
> | No running compute processes found
> |
>
> +-----------------------------------------------------------------------------+
>
> And note that Ganglia also has a plugin for NVML:
>
> https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia
>
>
> And the complex setup is for internal accounting inside Grid Engine -
> ie. it tells GE how many GPU cards there are and how many are in use.
> We can set up a consumable resource and let Grid Engine do the
> accounting... ie. we model GPUs like any other consumable resources,
> eg. software licenses, disk space, etc... and we can then use the same
> techniques to manage the GPUs:
>
> http://gridscheduler.sourceforge.net/howto/consumable.html
> http://gridscheduler.sourceforge.net/howto/loadsensor.html
>
> Rayson
>
> ================================
> Open Grid Scheduler / Grid Engine
> http://gridscheduler.sourceforge.net/
>
> Scalable Grid Engine Support Program
> http://www.scalablelogic.com/
>
>
>
> On Wed, May 23, 2012 at 4:31 PM, Justin Riley <jtriley_at_mit.edu> wrote:
> > Just curious, were you able to get the GPU consumable resource/load
> > sensor to work with the SC HVM/GPU AMI? I will eventually experiment
> > with this myself when I create new 12.X AMIs soon but it would be
> > helpful to have a condensed step-by-step overview if you were able to
> > get things working. No worries if not.
> >
> > Thanks!
> >
> > ~Justin
>
>
>
> --
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
>
Received on Thu May 24 2012 - 00:05:20 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject