Sorry for the spam, I should note that I actually ran this command with a
timout:
ar.wait_interactive(interval=1.0, timeoout=1000)
63999/66230 tasks finished after 2181 s
done
And the task did eventually finish after a few more minutes.
On Thu, Jul 30, 2015 at 6:34 PM Christopher Clearfield <
chris.clearfield_at_system-logic.com> wrote:
> There shouldn't be too much I/O be unless I'm missing something.
>
> In iPython, I read the data from an HDF store on each node (once), then
> instantiate a class on each node with the data:
>
> %%px
> store = pd.HDFStore(data_file, 'r') rows = store.select('results',
> ['cv_score_mean > 0']) rows = rows.sort('cv_score_mean', ascending=False)
> rows['results_index'] = rows.index
>
> # This doesn't take too long.
> model_analytics = ResultsAnalytics(rows, store['data_model'])
> ---
> ## This dispatch takes between 1.5 min to 5 min
> ## 66K jobs
> ar = lview.map(lambda x: model_analytics.generate_prediction_heuristic(x),
> rows_index)
> ---
> ar.wait_interactive(interval=1.0)
>
> 63999/66230 tasks finished after 2181 s
> done
>
> So the whole run takes awhile, though each job itself is relatively short.
> But I don't understand why CPU isn't the limiting factor.
>
> Rajat, thanks for recommending dstat.
>
> Best,
> Chris
>
>
>
>
>
>
> On Thu, Jul 30, 2015 at 10:52 AM Jacob Barhak <jacob.barhak_at_gmail.com>
> wrote:
>
>> Hi Christopher,
>>
>> Do you have a lot of I/O? For example writing and reading many files to
>> the same NFS location?
>>
>> This may explain things.
>>
>> Jacob
>> On Jul 30, 2015 2:34 AM, "Christopher Clearfield" <
>> chris.clearfield_at_system-logic.com> wrote:
>>
>>> Hi All,
>>> I'm running a set of about 60K relatively short jobs that take 30
>>> minutes to run. This is through ipython parallel.
>>>
>>> Yet my CPU utilization levels are relatively small:
>>>
>>> queuename qtype resv/used/tot. load_avg arch states
>>> ---------------------------------------------------------------------------------
>>> all.q_at_master BIP 0/0/2 0.98 linux-x64
>>> ---------------------------------------------------------------------------------
>>> all.q_at_node001 BIP 0/0/8 8.01 linux-x64
>>> ---------------------------------------------------------------------------------
>>> all.q_at_node002 BIP 0/0/8 8.07 linux-x64
>>> ---------------------------------------------------------------------------------
>>> all.q_at_node003 BIP 0/0/8 7.96 linux-x64
>>>
>>> (I disabled the ipython engines on master because I was having heartbeat
>>> timeout issues with the worker engines on my nodes, which explains why that
>>> is so low).
>>>
>>> But ~8% utilization on the nodes. Is that expected?
>>>
>>> Thanks,
>>> Chris
>>>
>>>
>>> _______________________________________________
>>> StarCluster mailing list
>>> StarCluster_at_mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>
>>>
Received on Thu Jul 30 2015 - 21:35:50 EDT
This archive was generated by
hypermail 2.3.0.