StarCluster - Mailing List Archive

Re: Is StarCluster still under active development?

From: Tony Robinson <no email>
Date: Fri, 1 Apr 2016 16:08:09 +0100

Hi Raj and all,

I think that there is another problem as well, one that I haven't
tracked down yet. I have three sorts of jobs, all of which should occur
in the same numbers, but when I measure what's in the cache then one job
name is massively under represented.

We have:

lookback_window = 3

which means we pull in three hours of history (by default). How about
we just call qacct every 5 mins, or if the qacct buffer is empty. I
don't think every 5 mins is a big overhead and the "if empty" means that
we can power up a new cluster and it'll just be a bit slower before it
populates the job stats (but not that much slower as it's parsing an
empty buffer). Also I don't see the need to continually be
recalculating stats - these could be done every time qacct is called and
stored. If this is going to break something then do let me know.

I don't know when I'll next get time for this but when I get it working
I'll report back my findings (I have an AWS cluster where nodes are
brought up or down every few minutes so there is plenty of data to try
this out on).


Tony

On 01/04/16 15:44, Rajat Banerjee wrote:
> I see what you are saying but did not test the case with a range of
> taskid's , so I did not see the problem you mentioned. The cache may
> have been a premature optimization to avoid doing large pulls from
> jobstat once every 30-60 seconds. When Justin and I were designing it,
> it seemed wise to cache some amount of SGE's output instead of doing a
> full pull every time, and it got very slow when there were >100 jobs.
>
> Raj
>
> On Sat, Mar 26, 2016 at 3:29 PM, Tony Robinson <tonyr_at_speechmatics.com
> <mailto:tonyr_at_speechmatics.com>> wrote:
>
> Following up (and hopefully not talking to myself), I've found at
> least one problem with jobstats[]. The code says:
>
> if l.find('jobnumber') != -1:
> job_id = int(l[13:len(l)])
> ...
> hash = {'jobname': jobname, 'queued': qd, 'start':
> start, 'end': end}
> self.jobstats[job_id % self.jobstat_cachesize] = hash
>
> So it doesn't take into account array jobs which have a range of
> taskid - it just counts one instance.
>
> That explains why the estimated job duration is wrong. The most
> obvious solution is just to get rid of the cache. If compute
> time is a problem, keep ru_wallclock as at the moment most time is
> spent in converting time formats.
>
> I'm also working on a gridEngine scheduler that works well with
> starcluster, that is it keeps the most recently booted nodes (and
> master) the most loaded so when you get to nearly the end of the
> hour there's nodes free you can take down. I've just got this
> going. Next I'd like to distribute the load evenly across all
> nodes that are up (these are vCPU, lightly loaded runs much
> faster) unless they are near the end of the hour, and in that case
> make sure the ones nearest the end are empty. I'm happy to go
> into details but I fear there aren't that many users of
> starcluster who really care about getting things going
> efficiently for short running jobs (or the above bug would have
> been fixed) so I'm talking to myself.
>
>
> Tony
>
>
> On 25/03/16 19:56, Tony Robinson wrote:
>> Hi Rajat,
>>
>> The main issue that I have with the load balancer is sometimes
>> bringing up a node or taking down a node fails and this caused
>> the loadbalancer to fall over. This is almost certainly an
>> issue with boto - I just haven't looked into it enough.
>>
>> I'm working on the loadbalancer right now. I'm running a few
>> different sorts of jobs, some take half a minute some take five
>> minutes. It takes me about five minutes to bring a node up, so
>> load balancing is quite a hard task, certainly what's there at
>> the moment isn't optimal.
>>
>> In your masters thesis you had a go at anticipating the future
>> load based on the queue, although I see no trace of this in the
>> current code. What seems like the most obvious approach to me
>> is to look at what's running and in the queue and see if it's all
>> going to complete within some specified period. If it is, then
>> fine, if not assume you are going to bring n nodes up (start at
>> n=1) and then see if it'll complete, if not then increment n.
>>
>> I've got a version of this running but it isn't completed because
>> avg_job_duration() consistently under reports. I'm doing some
>> debugging, it seems that jobstats[] has a bug, I have three type
>> of job, a start, middle and end, and as they are all run in
>> sequence then jobstats[] should have equal numbers of each. It
>> doesn't.
>>
>> This is a weekend (with unreliable time) activity for me. If
>> you or anyone else wants to help:
>>
>> a) getting avg_job_duration() working which probably means
>> fixing jobstats[]
>> b) getting a clean simple predictive load balancer working
>>
>> then please contact me.
>>
>>
>> Tony
>>
>> On 25/03/16 17:17, Rajat Banerjee wrote:
>>> I'll fix any issues with the load balancer if they come up.
>>
>>
>> --
>> Speechmatics is a trading name of Cantab Research Limited
>> We are hiring: www.speechmatics.com/careers
>> <https://www.speechmatics.com/careers>
>> Dr A J Robinson, Founder, Cantab Research Ltd
>> Phone direct: 01223 794096, office: 01223 794497
>> Company reg no GB 05697423, VAT reg no 925606030
>> 51 Canterbury Street, Cambridge, CB4 3QG, UK
>
>
> --
> Speechmatics is a trading name of Cantab Research Limited
> We are hiring: www.speechmatics.com/careers
> <https://www.speechmatics.com/careers>
> Dr A J Robinson, Founder, Cantab Research Ltd
> Phone direct: 01223 794096, office: 01223 794497
> Company reg no GB 05697423, VAT reg no 925606030
> 51 Canterbury Street, Cambridge, CB4 3QG, UK
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu <mailto:StarCluster_at_mit.edu>
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>


-- 
Speechmatics is a trading name of Cantab Research Limited
We are hiring: www.speechmatics.com/careers 
<http:www.speechmatics.com/careers>
Dr A J Robinson, Founder, Cantab Research Ltd
Phone direct: 01223 794096, office: 01223 794497
Company reg no GB 05697423, VAT reg no 925606030
51 Canterbury Street, Cambridge, CB4 3QG, UK
Received on Fri Apr 01 2016 - 11:08:18 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject