Edit:
It appears that the load balancer thinks the cluster is not running, even
though listclusters says it is and I can successfully login using
sshmaster. Still can't figure out why this is the case.
Apologies for the spam.
ubuntu_at_ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python
/opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config
loadbalance -n 1 -m 20 -w 300 dragon-1.3.0
StarCluster - (
http://star.mit.edu/cluster) (v. 0.95.6)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster_at_mit.edu
!!! ERROR - cluster dragon-1.3.0 is not running
ubuntu_at_ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python
/opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config
listclusters
-----------------------------------------------
dragon-1.3.0 (security group: _at_sc-dragon-1.3.0)
-----------------------------------------------
Launch time: 2015-04-26 03:40:22
Uptime: 43 days, 11:42:08
VPC: vpc-849ec2e1
Subnet: subnet-b6901fef
Zone: us-east-1d
Keypair: bean_key
EBS volumes:
vol-34a33e73 on master:/dev/sdz (status: attached)
vol-dc7beb9b on master:/dev/sdx (status: attached)
vol-57148c10 on master:/dev/sdy (status: attached)
vol-8ba835cc on master:/dev/sdv (status: attached)
vol-9253ced5 on master:/dev/sdw (status: attached)
Cluster nodes:
master running i-609aa79c 52.0.250.221
node002 running i-f4d6470b 52.4.102.101
node014 running i-52d6b2ad 52.7.159.255
node016 running i-fb9ae804 54.88.226.88
node017 running i-b275084d 52.5.86.254
node020 running i-14532eeb 52.5.111.191
node021 running i-874b3678 54.165.179.93
node022 running i-5abfc2a5 54.85.47.151
node023 running i-529ee3ad 52.1.197.60
node024 running i-0792eff8 54.172.58.21
Total nodes: 10
ubuntu_at_ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python
/opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config
sshmaster dragon-1.3.0
StarCluster - (
http://star.mit.edu/cluster) (v. 0.95.6)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster_at_mit.edu
The authenticity of host '52.0.250.221 (52.0.250.221)' can't be established.
ECDSA key fingerprint is e7:21:af:bf:2b:bf:c4:49:43:b8:dd:0b:aa:d3:81:a0.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '52.0.250.221' (ECDSA) to the list of known
hosts.
_ _ _
__/\_____| |_ __ _ _ __ ___| |_ _ ___| |_ ___ _ __
\ / __| __/ _` | '__/ __| | | | / __| __/ _ \ '__|
/_ _\__ \ || (_| | | | (__| | |_| \__ \ || __/ |
\/ |___/\__\__,_|_| \___|_|\__,_|___/\__\___|_|
StarCluster Ubuntu 13.04 AMI
Software Tools for Academics and Researchers (STAR)
Homepage:
http://star.mit.edu/cluster
Documentation:
http://star.mit.edu/cluster/docs/latest
Code:
https://github.com/jtriley/StarCluster
Mailing list:
http://star.mit.edu/cluster/mailinglist.html
This AMI Contains:
* Open Grid Scheduler (OGS - formerly SGE) queuing system
* Condor workload management system
* OpenMPI compiled with Open Grid Scheduler support
* OpenBLAS - Highly optimized Basic Linear Algebra Routines
* NumPy/SciPy linked against OpenBlas
* Pandas - Data Analysis Library
* IPython 1.1.0 with parallel and notebook support
* Julia 0.3pre
* and more! (use 'dpkg -l' to show all installed packages)
Open Grid Scheduler/Condor cheat sheet:
* qstat/condor_q - show status of batch jobs
* qhost/condor_status- show status of hosts, queues, and jobs
* qsub/condor_submit - submit batch jobs (e.g. qsub -cwd ./job.sh)
* qdel/condor_rm - delete batch jobs (e.g. qdel 7)
* qconf - configure Open Grid Scheduler system
Current System Stats:
System load: 0.0 Processes: 226
Usage of /: 80.8% of 78.61GB Users logged in: 2
Memory usage: 6% IP address for eth0: 10.0.0.213
Swap usage: 0%
=> There are 2 zombie processes.
https://landscape.canonical.com/
Last login: Sun Apr 26 04:50:46 2015 from c-24-60-255-35.hsd1.ma.comcast.net
root_at_master:~#
On Mon, Jun 8, 2015 at 11:10 AM David Koppstein <david.koppstein_at_gmail.com>
wrote:
> Hi,
>
> I noticed that my load balancer stopped working -- specifically, it has
> stopped deleting unnecessary nodes. It's been running fine for about three
> weeks.
>
> I have a small T2 micro instance loadbalancing a cluster of M3.xlarge. The
> cluster is running Ubuntu 14.04 using the shared 14.0. AMI ami-38b99850.
>
> The loadbalancer process is still running (started with nohup CMD &, where
> CMD is the loadbalancer command below):
>
> ```
> ubuntu_at_ip-10-0-0-20:~$ ps -ef | grep load
> ubuntu 11784 11730 0 15:04 pts/1 00:00:00 grep --color=auto load
> ubuntu 19493 1 0 Apr26 ? 01:25:03
> /opt/venv/python2_venv/bin/python /opt/venv/python2_venv/bin/starcluster -c
> /home/ubuntu/.starcluster/config loadbalance -n 1 -m 20 -w 300 dragon-1.3.0
> ```
>
> Queue has been empty for several days.
>
> ```
> dkoppstein_at_master:/dkoppstein/150521SG_v1.9_round2$ qstat -u "*"
> dkoppstein_at_master:/dkoppstein/150521SG_v1.9_round2$
> ```
>
> However, there are about 8 nodes that have been running over the weekend
> and are not being killed despite -n 1. If anyone has any guesses as to why
> the loadbalancer might stop working please let me know so I can prevent
> this from happening in the future.
>
> Thanks,
> David
>
>
>
>
Received on Mon Jun 08 2015 - 11:24:15 EDT