StarCluster - Mailing List Archive

Re: Crash report

From: Daniel Povey <no email>
Date: Wed, 6 Feb 2013 00:17:06 -0500

Also, somehow this cluster got into a weird state, with two copies of
node001:

Cluster nodes:
     master running i-b0a5cec0 ec2-204-236-252-51.compute-1.amazonaws.com
    node001 running i-5c3e542c ec2-54-235-230-217.compute-1.amazonaws.com
    node001 running i-063e5476 ec2-23-20-247-62.compute-1.amazonaws.com
    node002 running i-5a32582a ec2-23-23-20-49.compute-1.amazonaws.com
    node003 running i-5c32582c ec2-54-242-192-21.compute-1.amazonaws.com
    node004 running i-da741eaa ec2-23-22-42-142.compute-1.amazonaws.com
    node005 running i-dc741eac ec2-50-16-179-158.compute-1.amazonaws.com
    node006 running i-a06515d0 ec2-50-19-184-152.compute-1.amazonaws.com
    node007 running i-a26515d2 ec2-54-234-70-30.compute-1.amazonaws.com
    node008 running i-c4493ab4 ec2-54-242-116-109.compute-1.amazonaws.com
    node009 running i-c6493ab6 ec2-107-22-61-85.compute-1.amazonaws.com
    node010 running i-c8493ab8 ec2-23-20-134-170.compute-1.amazonaws.com
Total nodes: 12

Also some nodes (e.g. 002, 003, 004) were not listed in _at_allhosts in the
queue.
Possibly this is because I was running the load balancer? It didn't seem
to be working quite right; it wasn't really removing nodes.

Dan



On Wed, Feb 6, 2013 at 12:14 AM, Daniel Povey <dpovey_at_gmail.com> wrote:

> BTW, I manually removed it from the queue using qconf -mhgrp _at_allhosts
> before I called the rn command (because I wanted to make sure no jobs were
> running on the nodes I was removing and I wasn't sure whether the rn
> command would wait). Not sure if this would cause the crash.
>
> Dan
>
>
Received on Wed Feb 06 2013 - 00:17:07 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject