Following up (and hopefully not talking to myself), I've found at least
one problem with jobstats[]. The code says:
if l.find('jobnumber') != -1:
job_id = int(l[13:len(l)])
...
hash = {'jobname': jobname, 'queued': qd, 'start': start,
'end': end}
self.jobstats[job_id % self.jobstat_cachesize] = hash
So it doesn't take into account array jobs which have a range of taskid
- it just counts one instance.
That explains why the estimated job duration is wrong. The most
obvious solution is just to get rid of the cache. If compute time is a
problem, keep ru_wallclock as at the moment most time is spent in
converting time formats.
I'm also working on a gridEngine scheduler that works well with
starcluster, that is it keeps the most recently booted nodes (and
master) the most loaded so when you get to nearly the end of the hour
there's nodes free you can take down. I've just got this going. Next
I'd like to distribute the load evenly across all nodes that are up
(these are vCPU, lightly loaded runs much faster) unless they are near
the end of the hour, and in that case make sure the ones nearest the end
are empty. I'm happy to go into details but I fear there aren't that
many users of starcluster who really care about getting things going
efficiently for short running jobs (or the above bug would have been
fixed) so I'm talking to myself.
Tony
On 25/03/16 19:56, Tony Robinson wrote:
> Hi Rajat,
>
> The main issue that I have with the load balancer is sometimes
> bringing up a node or taking down a node fails and this caused the
> loadbalancer to fall over. This is almost certainly an issue with
> boto - I just haven't looked into it enough.
>
> I'm working on the loadbalancer right now. I'm running a few
> different sorts of jobs, some take half a minute some take five
> minutes. It takes me about five minutes to bring a node up, so load
> balancing is quite a hard task, certainly what's there at the moment
> isn't optimal.
>
> In your masters thesis you had a go at anticipating the future load
> based on the queue, although I see no trace of this in the current
> code. What seems like the most obvious approach to me is to look at
> what's running and in the queue and see if it's all going to complete
> within some specified period. If it is, then fine, if not assume you
> are going to bring n nodes up (start at n=1) and then see if it'll
> complete, if not then increment n.
>
> I've got a version of this running but it isn't completed because
> avg_job_duration() consistently under reports. I'm doing some
> debugging, it seems that jobstats[] has a bug, I have three type of
> job, a start, middle and end, and as they are all run in sequence then
> jobstats[] should have equal numbers of each. It doesn't.
>
> This is a weekend (with unreliable time) activity for me. If you or
> anyone else wants to help:
>
> a) getting avg_job_duration() working which probably means fixing
> jobstats[]
> b) getting a clean simple predictive load balancer working
>
> then please contact me.
>
>
> Tony
>
> On 25/03/16 17:17, Rajat Banerjee wrote:
>> I'll fix any issues with the load balancer if they come up.
>
>
> --
> Speechmatics is a trading name of Cantab Research Limited
> We are hiring: www.speechmatics.com/careers
> <https://www.speechmatics.com/careers>
> Dr A J Robinson, Founder, Cantab Research Ltd
> Phone direct: 01223 794096, office: 01223 794497
> Company reg no GB 05697423, VAT reg no 925606030
> 51 Canterbury Street, Cambridge, CB4 3QG, UK
--
Speechmatics is a trading name of Cantab Research Limited
We are hiring: www.speechmatics.com/careers
<https://www.speechmatics.com/careers>
Dr A J Robinson, Founder, Cantab Research Ltd
Phone direct: 01223 794096, office: 01223 794497
Company reg no GB 05697423, VAT reg no 925606030
51 Canterbury Street, Cambridge, CB4 3QG, UK
Received on Sat Mar 26 2016 - 15:29:53 EDT