StarCluster - Mailing List Archive

Re: Error Report

From: Justin Riley <no email>
Date: Tue, 23 Aug 2011 17:21:00 -0400

Hash: SHA1


Apparently I didn't read your original post carefully enough - you
already used the restart command. In either case it should be safe to
call restart again once your connection improves if that was the issue.
Are you getting the "Connection reset by peer" error consistently?

Also, concerning your jobs, which I'm assuming you're using SGE for, are
you still having issues? If so, it's hard to tell what might be the
issue without more details. Have you checked that the jobs are not in an
error state with "qstat -f" and have you tried logging onto the
execution host running the jobs to check that the processes are running,
proper files being generated, etc? Just some starting points...


On 08/23/2011 05:13 PM, Justin Riley wrote:
> Hi Josh,
> My apologies for the delay. The error you submitted could be related
> to:
> 1. A spotty Internet connection 2. DOS attacks on your instances' SSH
> daemons
> In the first case there's not much I can do; you really need to have
> a solid Internet connection for StarCluster to work. I've played
> around with auto-reconnecting but it's a hack and likely to break in
> other more extravagant ways so I've been hesitant to add it in.
> However, if you lose your connection during a 'start' command there's
> no need to destroy the cluster, just run restart instead:
> $ starcluster restart mycluster
> This will simply reboot the instances and reconfigure the cluster
> all over again rather than terminating the instances and wasting
> instance hours.
> I'm working on a solution for the second case which is basically to
> restrict SSH access to only each IP address that attempts to connect
> with valid credentials. Essentially StarCluster would:
> 1. Figure out your current IP 2. Modify the security group
> permissions if necesssary to allow SSH access from your current IP
> This would happen for each new IP you try to
> start/sshmaster/ssnode/etc from.
> HTH,
> ~Justin
> On 08/05/2011 03:05 PM, josh katz wrote:
>> This error occurred after I had submitted 10 jobs 2 of which where
>> never completed. So i deleted them and tried again but thos ejobs
>> were also never finished. Then when I tried to restart the cluster
>> this error appeared and asked me to send it. Thus this email.
>> Thanks, Josh
> _______________________________________________ StarCluster mailing
> list

Version: GnuPG v2.0.17 (GNU/Linux)
Comment: Using GnuPG with Mozilla -

Received on Tue Aug 23 2011 - 17:21:02 EDT
This archive was generated by hypermail 2.3.0.


Sort all by: