Lab Currently Closed Hours: 9am–6pm on Tuesday (Early Lab Closure) more »
These are general steps:
Can you ping
lb.ocf.berkeley.edu? If not, then most likely none of the
three master servers are using that IP, and so
keepalived is probably broken.
Find the currently keepalived host, then do
If you get connection refused, then
marathon-lb is probably broken. If you
get back a bad config (missing backends or servers), probably Marathon is
unhealthy (check the
marathon-lb logs—they'll probably indicate they can't
reach Marathon, or similar). If everything here looks good, move on.
Take one of the
server entries from the previous step and try to curl it.
For example, if you saw the line:
server hal_169_229_226_10_31754 18.104.22.168:31754 check inter 60s fall 4
You would do
curl 22.214.171.124:31754 and make sure you get a response.
If you do, then move on. If not, it's most likely that
marathon-lb has a
different world-view than Marathon (maybe Marathon is unhealthy?). Check the
Most likely at this stage,
nginx is broken on the load balancers. Try to
curl the load balancers on port 80 and 443, and check the nginx logs.
You can do
ssh lb and see what you get connected to, but will probably have
to deal with the key changing if you do this often.
TODO: is there a better way?
TODO: not sure
marathon-lb is a systemd service running as
systemctl status ocf-lb
journalctl -eu ocf-lb
systemctl restart ocf-lb
ckuehl@supernova:~$ curl mesos0:9090/_haproxy_getconfig global daemon log /dev/log local0 log /dev/log local1 notice maxconn 50000 [...]
If everything is working, you should see a backend for each app exposed on the load balancer, with one or more servers in it. For example, here is a working ocfweb backend with three servers:
backend ocfweb_web_10002 balance roundrobin mode tcp server hal_169_229_226_10_31754 126.96.36.199:31754 check inter 60s fall 4 server pandemic_169_229_226_14_31005 188.8.131.52:31005 check inter 60s fall 4 server pandemic_169_229_226_14_31419 184.108.40.206:31419 check inter 60s fall 4
ckuehl@supernova:~$ dig leader.mesos @mesos1 [...] ;; QUESTION SECTION: ;leader.mesos. IN A ;; ANSWER SECTION: leader.mesos. 1 IN A 220.127.116.11 [...]
To check against the main DNS server (and not the masters), just run the same
mesos-dns is a systemd service.
systemctl status mesos-dns
journalctl -eu mesos-dns
systemctl restart mesos-dns