At the OCF we have fully migrated all services from Mesos/Marathon to
Kubernetes. In this document we will explain the design of our
Kubernetes cluster while also touching briefly on relevant core concepts. This
page is not a
HOWTO for deploying services or troubleshooting a bad
cluster. Rather, it is meant to explain architectural considerations such that
current work can be built upon. Although, reading this document will help you
both deploy services in the OCF Kubernetes cluster and debug issues when they
Kubernetes is a container orchestration system open sourced by Google. Its main purpose is to schedule services to run on a cluster of computers while abstracting away the existence of the cluster from the services. The design of Kubernetes is loosely based on Google's internal orchestration system Borg. Kubernetes is now maintained by the Cloud Native Computing Foundation, which is a part of the Linux Foundation. Kubernetes can flexibly handle replication, impose resource limits, and recover quickly from failures.
A Kubernetes cluster consists of "master" nodes and "worker" nodes. In short, master nodes share state to manage the cluster and schedule jobs to run on workers. It is considered best practice to run an odd number of masters, and currently our cluster has three masters.
One master is elected as a leader of the cluster. The leader has the ability to
commit writes to the KVS.
etcd then reliably replicates this state across
every master, so that if the leader fails, another master can be elected and no
state will be lost in the process. Do note that the state stored in
scheduling state, service locations, and other cluster metadata; it does not
keep state for the services running on the cluster.
Consider a cluster of N members. When masters form quorum to agree on cluster state, quorum must have at least ⌊N/2⌋+1 members. Every new odd number in a cluster with M > 1 masters adds one more node of fault tolerance. Therefore, adding an extra node to an odd numbered cluster gives us nothing. If interested read more here.
Workers are the brawn in the Kubernetes cluster. While master nodes are
constantly sharing data, managing the control plane (routing inside the
Kubernetes cluster), and scheduling services, workers primarily run
kubelet is the service that executes pods as dictated by the
control plane, performs health checks, and recovers from pod failures should
they occur. Workers also run an instance of
kube-proxy, which forwards
control plane traffic to the correct
In the Kubernetes world, pods are the smallest computing unit. A pod is made up of one or more containers. The difference between a pod and a standalone container is best illustrated by an example. Consider ocfweb; it is composed of several containers—the web container, static container, and worker container. In Kubernetes, together these three containers form one pod, and it is pods that can be scaled up or down. A failure in any of these containers indicates a failure in the entire pod. An astute reader might wonder: if pods can be broken down into containers, how can pods possibly be the smallest unit? Do note that if one wished to deploy a singleton container, it would still need to be wrapped in the pod abstraction for Kubernetes to do anything with it.
While pods are essential for understanding Kubernetes, when writing services we don't actually deal in pods but one further abstraction, deployments, which create pods for us.
Since almost all OCF architecture is bootstapped using Puppet, it was necessary
for us to do the same with Kubernetes. We rely on the
puppetlabs-kubernetes module to handle initial
bootstrapping and bolt OCF specific configurations on top of it.
puppetlabs-kubernetes performs two crucial tasks:
kube-dns, initializes the cluster, and applies a networking backend.
Do note that
puppetlabs-kubernetes is still very much a work in progress. If
you notice an issue in the module you are encouraged to write a patch and send
All the private keys and certs for the PKI are in the puppet private share, in
/opt/puppet/shares/private/kubernetes. We won't go into detail of everything
contained there, but Kubernetes and
etcd communication is authenticated using
client certificates. All the necessary items for workers are included in
os/Debian.yaml, although adding a new master to the cluster requires a manual
re-run of kubetool to generate new
etcd server and
etcd peer certs.
Currently, the OCF has three Kubernetes masters: (1)
autocrat. A Container Networking Interface (
cni) is the last piece
required for a working cluster. The
cni's purpose is to faciltate intra-pod
puppetlabs-kubernetes supports two choices:
flannel. Both solutions work out-the-box, and we've had success with
flannel thus far so we've stuck with it.
One of the challenges with running Kubernetes on bare-metal is getting traffic
into the cluster. Kubernetes is commonly deployed on
so Kubernetes has native support for ingress on these providers. Since we are
on bare-metal, we designed our own scheme for ingressing traffic.
The figure below demonstrates a request made for
For the purpose of simplicity, we assume
deadlock is the current
master, and that
nginx will send this request to
---------------------------------------------------- | Kubernetes Cluster | nginx | | ---------- | Ingress Ocfweb Pod | |autocrat| | Host: Templates --------- --------- | ---------- | ---------> |Worker1| - |Worker1| | | / --------- \ --------- | | / | | nginx | / Ingress | Templates Pod | ------------------- ✘ SSL / / --------- | --------- | REQ --> | deadlock: | ---> - |Worker2| ---> |Worker2| | |keepalived master| \ --------- --------- | ------------------- | | | | nginx | Ingress Grafana Pod | ---------- | --------- --------- | | coup | | |Worker3| |Worker3| | ---------- | --------- --------- | ----------------------------------------------------
All three Kubernetes masters are running an instance of Nginx.
Furthermore, the masters are all running
keepalived. The traffic for any
Kubernetes HTTP service will go through the current
keepalived master, which
holds the virtual IP for all Kubernetes services. The
keepalived master is
randomly chosen but will move hosts in the case of failure.
terminate ssl and pass the request on to a worker running Ingress
Nginx. Right now ingress is running as a NodePort
service on all workers (Note: we can easily change this to be a subset of
workers if our cluster scales such that this is no longer feasible). The
ingress worker will inspect the
Host header and forward the request on to the
appropriate pod where the request is finally processed. Do note that the
target pod is not necessarily on the same worker that routed the traffic.
MetalLB was created so a bare-metal Kubernetes cluster could use
LoadBalancer in Service definitions. The problem is, in
L2 mode, it takes a
pool of IPs and puts your service on a random IP in that pool. How one makes
DNS work in this configuration is completely unspecified. We would need to
dynamically update our DNS, which sounds like a myriad of outages waiting to
L3 mode would require the OCF dedicating a router to Kubernetes.
In our previous Marathon configuration, we gave each service a port on the load
balancer and traffic coming into that port is routed accordingly. First, in
Kubernetes we would emulate this behavior using
NodePort services, and all
Kubernetes documentation discourages this. Second, it's ugly. Every time we add
a new service we need to modify the load balancer configuration in Puppet. With
our Kubernetes configuration we can add unlimited HTTP services without
But wait! The Kubernetes documentation says not to use
NodePort services in
production, and you just said that above too! True, but we only run one
ingress-nginx, to keep us from needing other
services. SoundCloud, a music streaming company that runs massive bare-metal
Kubernetes clusters, also has an interesting blog post about running NodePort