Thanks, Steve. Indeed, I also noticed that the starcluster rn command
wasn't working, which calls qconf:
git:(master) ✗ starcluster -c output/starcluster_config.ini rn -n 8
dragon-1.3.0
StarCluster - (
http://star.mit.edu/cluster) (v. 0.95.6)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster_at_mit.edu
*** WARNING - Setting 'AWS_SECRET_ACCESS_KEY' from environment...
*** WARNING - Setting 'AWS_ACCESS_KEY_ID' from environment...
Remove 8 nodes from dragon-1.3.0 (y/n)? y
>>> Running plugin starcluster.plugins.users.CreateUsers
>>> Running plugin starcluster.plugins.sge.SGEPlugin
>>> Removing node024 from SGE
!!! ERROR - Error occured while running plugin
'starcluster.plugins.sge.SGEPlugin':
!!! ERROR - remote command 'source /etc/profile && qconf -dconf node024'
!!! ERROR - failed with status 1:
!!! ERROR - can't resolve hostname "node024"
!!! ERROR - can't delete configuration "node024" from list:
!!! ERROR - configuration does not exist
So it looks like for some reason the cluster was in a state where one
machine was still running and in the starcluster security group but wasn't
configured to run jobs. If anyone has run into this behavior before and
knows how to prevent it from happening I'd appreciate the feedback as the
cost of ten big nodes adds up quickly =)
Thanks,
David
On Mon, Jun 8, 2015 at 12:04 PM Steve Darnell <darnells_at_dnastar.com> wrote:
> Raj has commented in the past that the load balancer does not use the
> same logic as listclusters:
> http://star.mit.edu/cluster/mlarchives/2585.html
>
> --
>
>
>
> From that archived message:
>
>
>
> The elastic load balancer parses the output of 'qhost' on the cluster:
>
>
> https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py#L59
>
> I don't remember the exact reason for using that instead of the same
> logic as 'listclusters' above, but here's my guess a few years after the
> fact:
>
> - Avoids another remote API call to AWS' tagging service to retrieve the
> tags for all instances within an account. This needs to be called every
> minute, so a speedy call to your cluster instead of to a remote API is
> beneficial
>
> - qhost outputs the number of machines correctly configured and able to
> process work. If a machine shows up in 'listcluster' but not in 'qhost'
> it's likely not usable to process jobs, and would probably need manual
> cleanup.
>
> HTH
>
> Raj
>
>
>
> *From:* starcluster-bounces_at_mit.edu [mailto:starcluster-bounces_at_mit.edu] *On
> Behalf Of *David Koppstein
> *Sent:* Monday, June 08, 2015 10:24 AM
> *To:* starcluster_at_mit.edu
> *Subject:* Re: [StarCluster] load balancer stopped working?
>
>
>
> Edit:
>
> It appears that the load balancer thinks the cluster is not running, even
> though listclusters says it is and I can successfully login using
> sshmaster. Still can't figure out why this is the case.
>
>
>
> Apologies for the spam.
>
>
>
> ubuntu_at_ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python
> /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config
> loadbalance -n 1 -m 20 -w 300 dragon-1.3.0
>
> StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
>
> Software Tools for Academics and Researchers (STAR)
>
> Please submit bug reports to starcluster_at_mit.edu
>
>
>
> !!! ERROR - cluster dragon-1.3.0 is not running
>
> ubuntu_at_ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python
> /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config
> listclusters
>
>
>
> -----------------------------------------------
>
> dragon-1.3.0 (security group: _at_sc-dragon-1.3.0)
>
> -----------------------------------------------
>
> Launch time: 2015-04-26 03:40:22
>
> Uptime: 43 days, 11:42:08
>
> VPC: vpc-849ec2e1
>
> Subnet: subnet-b6901fef
>
> Zone: us-east-1d
>
> Keypair: bean_key
>
> EBS volumes:
>
> vol-34a33e73 on master:/dev/sdz (status: attached)
>
> vol-dc7beb9b on master:/dev/sdx (status: attached)
>
> vol-57148c10 on master:/dev/sdy (status: attached)
>
> vol-8ba835cc on master:/dev/sdv (status: attached)
>
> vol-9253ced5 on master:/dev/sdw (status: attached)
>
> Cluster nodes:
>
> master running i-609aa79c 52.0.250.221
>
> node002 running i-f4d6470b 52.4.102.101
>
> node014 running i-52d6b2ad 52.7.159.255
>
> node016 running i-fb9ae804 54.88.226.88
>
> node017 running i-b275084d 52.5.86.254
>
> node020 running i-14532eeb 52.5.111.191
>
> node021 running i-874b3678 54.165.179.93
>
> node022 running i-5abfc2a5 54.85.47.151
>
> node023 running i-529ee3ad 52.1.197.60
>
> node024 running i-0792eff8 54.172.58.21
>
> Total nodes: 10
>
>
>
> ubuntu_at_ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python
> /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config
> sshmaster dragon-1.3.0
>
> StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
>
> Software Tools for Academics and Researchers (STAR)
>
> Please submit bug reports to starcluster_at_mit.edu
>
>
>
> The authenticity of host '52.0.250.221 (52.0.250.221)' can't be
> established.
>
> ECDSA key fingerprint is e7:21:af:bf:2b:bf:c4:49:43:b8:dd:0b:aa:d3:81:a0.
>
> Are you sure you want to continue connecting (yes/no)? yes
>
> Warning: Permanently added '52.0.250.221' (ECDSA) to the list of known
> hosts.
>
> _ _ _
>
> __/\_____| |_ __ _ _ __ ___| |_ _ ___| |_ ___ _ __
>
> \ / __| __/ _` | '__/ __| | | | / __| __/ _ \ '__|
>
> /_ _\__ \ || (_| | | | (__| | |_| \__ \ || __/ |
>
> \/ |___/\__\__,_|_| \___|_|\__,_|___/\__\___|_|
>
>
>
> StarCluster Ubuntu 13.04 AMI
>
> Software Tools for Academics and Researchers (STAR)
>
> Homepage: http://star.mit.edu/cluster
>
> Documentation: http://star.mit.edu/cluster/docs/latest
>
> Code: https://github.com/jtriley/StarCluster
>
> Mailing list: http://star.mit.edu/cluster/mailinglist.html
>
>
>
> This AMI Contains:
>
>
>
> * Open Grid Scheduler (OGS - formerly SGE) queuing system
>
> * Condor workload management system
>
> * OpenMPI compiled with Open Grid Scheduler support
>
> * OpenBLAS - Highly optimized Basic Linear Algebra Routines
>
> * NumPy/SciPy linked against OpenBlas
>
> * Pandas - Data Analysis Library
>
> * IPython 1.1.0 with parallel and notebook support
>
> * Julia 0.3pre
>
> * and more! (use 'dpkg -l' to show all installed packages)
>
>
>
> Open Grid Scheduler/Condor cheat sheet:
>
>
>
> * qstat/condor_q - show status of batch jobs
>
> * qhost/condor_status- show status of hosts, queues, and jobs
>
> * qsub/condor_submit - submit batch jobs (e.g. qsub -cwd ./job.sh)
>
> * qdel/condor_rm - delete batch jobs (e.g. qdel 7)
>
> * qconf - configure Open Grid Scheduler system
>
>
>
> Current System Stats:
>
>
>
> System load: 0.0 Processes: 226
>
> Usage of /: 80.8% of 78.61GB Users logged in: 2
>
> Memory usage: 6% IP address for eth0: 10.0.0.213
>
> Swap usage: 0%
>
>
>
> => There are 2 zombie processes.
>
>
>
> https://landscape.canonical.com/
>
> Last login: Sun Apr 26 04:50:46 2015 from
> c-24-60-255-35.hsd1.ma.comcast.net
>
> root_at_master:~#
>
>
>
>
>
>
>
> On Mon, Jun 8, 2015 at 11:10 AM David Koppstein <david.koppstein_at_gmail.com>
> wrote:
>
> Hi,
>
>
>
> I noticed that my load balancer stopped working -- specifically, it has
> stopped deleting unnecessary nodes. It's been running fine for about three
> weeks.
>
>
>
> I have a small T2 micro instance loadbalancing a cluster of M3.xlarge. The
> cluster is running Ubuntu 14.04 using the shared 14.0. AMI ami-38b99850.
>
>
>
> The loadbalancer process is still running (started with nohup CMD &, where
> CMD is the loadbalancer command below):
>
>
>
> ```
>
> ubuntu_at_ip-10-0-0-20:~$ ps -ef | grep load
>
> ubuntu 11784 11730 0 15:04 pts/1 00:00:00 grep --color=auto load
>
> ubuntu 19493 1 0 Apr26 ? 01:25:03
> /opt/venv/python2_venv/bin/python /opt/venv/python2_venv/bin/starcluster -c
> /home/ubuntu/.starcluster/config loadbalance -n 1 -m 20 -w 300 dragon-1.3.0
>
> ```
>
>
>
> Queue has been empty for several days.
>
>
>
> ```
>
> dkoppstein_at_master:/dkoppstein/150521SG_v1.9_round2$ qstat -u "*"
>
> dkoppstein_at_master:/dkoppstein/150521SG_v1.9_round2$
>
> ```
>
>
>
> However, there are about 8 nodes that have been running over the weekend
> and are not being killed despite -n 1. If anyone has any guesses as to why
> the loadbalancer might stop working please let me know so I can prevent
> this from happening in the future.
>
>
>
> Thanks,
>
> David
>
>
>
>
>
>
>
>
Received on Mon Jun 08 2015 - 12:37:26 EDT