StarCluster - Mailing List Archive

Re: New Grid Engine Hadoop Integration HOWTO

From: Rayson Ho <no email>
Date: Fri, 1 Jun 2012 15:59:29 -0400

Hi Paul,

I started a new mail thread as our setup is different than what you
have - with the Hadoop Grid Engine integration documented in the
HOWTO, we are not using Hadoop with R. With your setup, in the rmr
integration, R invokes Hadoop directly, and thus it needs to skip the
batch queuing capabilities of Grid Engine.

I believe with your setup, the best thing to do is to change the
Hadoop StarCluster Plugin, and see if you can create the needed
mapred-site.xml file before the Hadoop daemons are brought up.

I believe you can parse /proc/cpuinfo or run Grid Engine's loadcheck
on each node to get the number of processors.

% loadcheck
...
num_proc 4
m_socket 1
m_core 2
m_topology SCTTCTT
load_short 0.00
load_medium 0.00
load_long 0.00

Then add that to the mapred-site.xml file in the following XML format:

<property>
 <name>mapred.tasktracker.map.tasks.maximum</name>
 <value> No. of Processors </value>
</property>

Rayson

================================
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/



On Fri, Jun 1, 2012 at 3:31 PM, Paul McDonagh <mcdonaghpd_at_gmail.com> wrote:
> Thanks Rayson,
>
> This and the previous email are a couple of really good suggestions. I'll try 'em out and see what happens.
>
> Best,
> Paul.
>
> On Jun 1, 2012, at 14:52, Rayson Ho wrote:
>
>> If you are running Hadoop on StarCluster, you may also be interested
>> in this new method contributed by Prakashan Korambath of UCLA.
>>
>> http://gridscheduler.sourceforge.net/howto/GridEngineHadoop.html
>>
>> The difference between the original SGE 6.2u5 method vs the new one is
>> that with Prakashan's approach, Grid Engine is used for resource
>> allocation, and the Hadoop job scheduler/Job Tracker is used to handle
>> all the MapReduce operations. A Hadoop cluster is created on demand
>> with Prakashan's approach, but in the original SGE 6.2u5 method Grid
>> Engine replaces the Hadoop job scheduler.
>>
>> As standard Grid Engine PEs are used in this new approach, one can
>> call "qrsh -inherit" and use Grid Engine's method to start Hadoop
>> services on remote nodes, and thus get full job control, job
>> accounting, and cleanup at terminate benefits like any other tight PE
>> jobs!
>>
>> Rayson
>>
>> ================================
>> Open Grid Scheduler / Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>> Scalable Grid Engine Support Program
>> http://www.scalablelogic.com/
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>



-- 
==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
Received on Fri Jun 01 2012 - 15:59:30 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject