Re: load balancer stopped working?

Date: Mon, 8 Jun 2015 16:05:08 +0000

Raj has commented in the past that the load balancer does not use the same logic as listclusters:

From that archived message:

The elastic load balancer parses the output of 'qhost' on the cluster:
I don't remember the exact reason for using that instead of the same logic as 'listclusters' above, but here's my guess a few years after the fact:
- Avoids another remote API call to AWS' tagging service to retrieve the tags for all instances within an account. This needs to be called every minute, so a speedy call to your cluster instead of to a remote API is beneficial
- qhost outputs the number of machines correctly configured and able to process work. If a machine shows up in 'listcluster' but not in 'qhost' it's likely not usable to process jobs, and would probably need manual cleanup.

It appears that the load balancer thinks the cluster is not running, even though listclusters says it is and I can successfully login using sshmaster. Still can't figure out why this is the case.

Apologies for the spam.

ubuntu_at_ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config loadbalance -n 1 -m 20 -w 300 dragon-1.3.0
StarCluster - ( (v. 0.95.6)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to

!!! ERROR - cluster dragon-1.3.0 is not running
ubuntu_at_ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config listclusters

dragon-1.3.0 (security group: _at_sc-dragon-1.3.0)
Launch time: 2015-04-26 03:40:22
Uptime: 43 days, 11:42:08
VPC: vpc-849ec2e1
Subnet: subnet-b6901fef
Zone: us-east-1d
Keypair: bean_key
EBS volumes:
    vol-34a33e73 on master:/dev/sdz (status: attached)
    vol-dc7beb9b on master:/dev/sdx (status: attached)
    vol-57148c10 on master:/dev/sdy (status: attached)
    vol-8ba835cc on master:/dev/sdv (status: attached)
    vol-9253ced5 on master:/dev/sdw (status: attached)
Cluster nodes:
     master running i-609aa79c
    node002 running i-f4d6470b
    node014 running i-52d6b2ad
    node016 running i-fb9ae804
    node017 running i-b275084d
    node020 running i-14532eeb
    node021 running i-874b3678
    node022 running i-5abfc2a5
    node023 running i-529ee3ad
    node024 running i-0792eff8
Total nodes: 10

ubuntu_at_ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config sshmaster dragon-1.3.0
StarCluster - ( (v. 0.95.6)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to

The authenticity of host ' (' can't be established.
ECDSA key fingerprint is e7:21:af:bf:2b:bf:c4:49:43:b8:dd:0b:aa:d3:81:a0.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '' (ECDSA) to the list of known hosts.
On Mon, Jun 8, 2015 at 11:10 AM David Koppstein <<>> wrote:

I noticed that my load balancer stopped working -- specifically, it has stopped deleting unnecessary nodes. It's been running fine for about three weeks.

I have a small T2 micro instance loadbalancing a cluster of M3.xlarge. The cluster is running Ubuntu 14.04 using the shared 14.0. AMI ami-38b99850.

The loadbalancer process is still running (started with nohup CMD &, where CMD is the loadbalancer command below):

ubuntu_at_ip-10-0-0-20:~$ ps -ef | grep load
ubuntu 11784 11730 0 15:04 pts/1 00:00:00 grep --color=auto load
ubuntu 19493 1 0 Apr26 ? 01:25:03 /opt/venv/python2_venv/bin/python /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config loadbalance -n 1 -m 20 -w 300 dragon-1.3.0

Queue has been empty for several days.

dkoppstein_at_master:/dkoppstein/150521SG_v1.9_round2$<mailto:dkoppstein_at_master:/dkoppstein/150521SG_v1.9_round2$> qstat -u "*"

However, there are about 8 nodes that have been running over the weekend and are not being killed despite -n 1. If anyone has any guesses as to why the loadbalancer might stop working please let me know so I can prevent this from happening in the future.


