Also, somehow this cluster got into a weird state, with two copies of
node001:
Cluster nodes:
master running i-b0a5cec0 ec2-204-236-252-51.compute-1.amazonaws.com
node001 running i-5c3e542c ec2-54-235-230-217.compute-1.amazonaws.com
node001 running i-063e5476 ec2-23-20-247-62.compute-1.amazonaws.com
node002 running i-5a32582a ec2-23-23-20-49.compute-1.amazonaws.com
node003 running i-5c32582c ec2-54-242-192-21.compute-1.amazonaws.com
node004 running i-da741eaa ec2-23-22-42-142.compute-1.amazonaws.com
node005 running i-dc741eac ec2-50-16-179-158.compute-1.amazonaws.com
node006 running i-a06515d0 ec2-50-19-184-152.compute-1.amazonaws.com
node007 running i-a26515d2 ec2-54-234-70-30.compute-1.amazonaws.com
node008 running i-c4493ab4 ec2-54-242-116-109.compute-1.amazonaws.com
node009 running i-c6493ab6 ec2-107-22-61-85.compute-1.amazonaws.com
node010 running i-c8493ab8 ec2-23-20-134-170.compute-1.amazonaws.com
Total nodes: 12
Also some nodes (e.g. 002, 003, 004) were not listed in _at_allhosts in the
queue.
Possibly this is because I was running the load balancer? It didn't seem
to be working quite right; it wasn't really removing nodes.
Dan
On Wed, Feb 6, 2013 at 12:14 AM, Daniel Povey <dpovey_at_gmail.com> wrote:
> BTW, I manually removed it from the queue using qconf -mhgrp _at_allhosts
> before I called the rn command (because I wanted to make sure no jobs were
> running on the nodes I was removing and I wasn't sure whether the rn
> command would wait). Not sure if this would cause the crash.
>
> Dan
>
>
Received on Wed Feb 06 2013 - 00:17:07 EST
This archive was generated by
hypermail 2.3.0.