Re: [Starcluster] Load Balancer Problems
Hey Rajat,
Just to update you on the testing progress. Im currently running a job and
it seems to be working as expected. We also got one error that didnt seem to
change anything : ssh.py:248 - ERROR - command source /etc/profile && qacct
-j -b 201008021652 failed with status 1. The balancer looks to be working
great.
Best,
Amaro Taylor
RES Group, Inc.
1 Broadway • Cambridge, MA 02142 • U.S.A.
Tel: 310 880-1906 (Direct) • Fax: 617-812-8042 • Email:
amaro.taylor_at_resgroupinc.com
Disclaimer: The information contained in this email message may be
confidential. Please be careful if you forward, copy or print this message.
If you have received this email in error, please immediately notify the
sender and delete the message.
On Mon, Aug 2, 2010 at 12:59 PM, Amaro Taylor
<amaro.taylor_at_resgroupinc.com>wrote:
> Hey Guys,
>
> As far as the node idle time I think we just misinterpreted what was
> happening. The modulus statement was what we wanted.
>
> Thanks
>
> Amaro Taylor
> RES Group, Inc.
> 1 Broadway • Cambridge, MA 02142 • U.S.A.
> Tel: 310 880-1906 (Direct) • Fax: 617-812-8042 • Email:
> amaro.taylor_at_resgroupinc.com
>
> Disclaimer: The information contained in this email message may be
> confidential. Please be careful if you forward, copy or print this message.
> If you have received this email in error, please immediately notify the
> sender and delete the message.
>
>
> On Mon, Aug 2, 2010 at 12:30 PM, Justin Riley <jtriley_at_mit.edu> wrote:
>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Raj,
>>
>> > 2. What is your preference for how long a job should stay idle before
>> > being killed?
>>
>> I think you meant *node* not job...
>>
>> > I usually don't check how long it has been idle. If it
>> > is idle now and the queue is empty then kill it. I could add code to
>> > check how long it has been idle, if it seems useful. Is there a use
>> > case?
>>
>> Also, the node must be up for the "majority of the hour" before it can
>> be considered for removal. This provides flexibility for the queue to
>> stabilize and also saves money given that you pay for the entire
>> instance hour anyway.
>>
>> As far as the "code to check how long a node has been idle" goes I'm not
>> sure I understand the use case/context either. Mind bringing the list up
>> to date on this discussion?
>>
>> ~Justin
>>
>> On 08/02/2010 02:38 PM, Rajat Banerjee wrote:
>> > Hey Amaro,
>> > Cool thanks. I called Brian and got info regarding the array of jobs.
>> > I checked in some test code that works fine on my (simple) cluster
>> > with qsub -t 1-20:1. I'd appreciate it if you'd test and let me know
>> > how it goes. Just committed to github:
>> >
>> http://github.com/rqbanerjee/StarCluster/commit/17998a68feab3d1440aa5d9edc2e74697e43ef54
>> >
>> > Making requests during a business day has its rewards :)
>> >
>> > Regarding the host that had been inactive for a short time:
>> > 1. If the "tasks" field was properly recognized , as it is now, the
>> > queue should be recognized as full, and that node probably wouldn't
>> > have been killed.
>> > 2. What is your preference for how long a job should stay idle before
>> > being killed? I usually don't check how long it has been idle. If it
>> > is idle now and the queue is empty then kill it. I could add code to
>> > check how long it has been idle, if it seems useful. Is there a use
>> > case?
>> >
>> > Thanks,
>> > Rajat
>> > _______________________________________________
>> > Starcluster mailing list
>> > Starcluster_at_mit.edu
>> > http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v2.0.15 (GNU/Linux)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>>
>> iEYEARECAAYFAkxXHOcACgkQ4llAkMfDcrnbvACghwwDpZn2uMUcr88lqH/bFdAr
>> MAIAn39LoXOe4j1iJ0x0crm4IsSI5kZC
>> =TQh9
>> -----END PGP SIGNATURE-----
>>
>
>
Received on Mon Aug 02 2010 - 16:51:32 EDT
This archive was generated by
hypermail 2.3.0.