Stateful Sets

Statefulness

Kubernetes has a concept of StatefulSets (formerly called PetSets, from the Pets vs. Cattle analogy). They are also running multiple Pods, but separate between a stateful and stateless part. These can be used to build fully-blown clusters running on Kubernetes, where each cluster member has a distinct identity that cannot easily be replaced, most often combined with making use of persistent storage.

Unfortunately, in the case of PostgreSQL, running a scalable and highly available cluster on Kubernetes is rather complex. As such, let’s start with a simpler example: an Apache ZooKeeper cluster. And while doing so we will utilize all the concepts that we have learned so far, and then adding some.

ZooKeeper Basics

Apache ZooKeeper is a distributed, open-source coordination service for distributed applications, used e.g. by Kafka, Solr, Hadoop, Mesos … ZooKeeper allows you to read, write, and observe updates to data. Data are organized in a file system like hierarchy and replicated to all ZooKeeper servers in the ensemble (a set of ZooKeeper servers). All operations on data are atomic and sequentially consistent. ZooKeeper ensures this by using the Zab consensus protocol (ZooKeeper Atomic Broadcast) to replicate a state machine across all servers in the ensemble.

The ensemble uses the Zab protocol to elect a leader, and the ensemble cannot write data until that election is complete. Once complete, the ensemble uses Zab to ensure that it replicates all writes to a quorum before it acknowledges and makes them visible to clients. Without respect to weighted quorums, a quorum is a majority component of the ensemble containing the current leader. For instance, if the ensemble has three servers, a component that contains the leader and one other server constitutes a quorum. If the ensemble can not achieve a quorum, the ensemble cannot write data.

ZooKeeper servers keep their entire state machine in memory, and write every mutation to a durable WAL (Write Ahead Log) on storage media. When a server crashes, it can recover its previous state by replaying the WAL. To prevent the WAL from growing without bound, ZooKeeper servers will periodically snapshot them in memory state to storage media. These snapshots can be loaded directly into memory, and all WAL entries that preceded the snapshot may be discarded.

Exercise - Create ZooKeeper cluster services

First, let’s take a look at the required service definitions. You have the required files already checked out, so let’s check:

head -n 12 zookeeper-services.yaml

apiVersion: v1
kind: Service
metadata:
  name: zk-cs
  labels:
    app: zk
spec:
  ports:
    - port: 2181
      name: client
  selector:
    app: zk

So far so normal. type: ClusterIP is the default, so we don’t specify it here, and we label our service and name our port. This service will be used for client accesses to the cluster, and correspondingly it is named -cs.

Now on to something new:

tail -n 15 zookeeper-services.yaml

apiVersion: v1
kind: Service
metadata:
  name: zk-hs
  labels:
    app: zk
spec:
  ports:
    - port: 2888
      name: server
    - port: 3888
      name: leader-election
  clusterIP: None
  selector:
    app: zk

By specifying clusterIP: None we create a so-called headless service which will allow us to address the individual Pods of our to-be-created StatefulSet, as opposed to a “normal” service which would randomly redirect to any Pod which matches the service selector. This service will be used for inter-cluster communication, and it is named -hs to indicate the headless service config.

Now let’s create the services:

kubectl apply -f zookeeper-services.yaml

service/zk-cs created
service/zk-hs created
Tip

Generally it is advisable to generate all services first before creating any Deployment or StatefulSet that will make use of them, as this way all service definitions will also be available within the newly created Pods as convenient environment variables. Check out

kubectl run -it --rm --image busybox environmentcheck

and then inside the container

env | grep ZK_CS | sort; exit

Exercise - Create a ZooKeeper cluster StatefulSet

Now for a start let’s just dive in and create the actual StatefulSet, watching how it creates its Pods:

kubectl apply -f zookeeper-statefulset.yaml; kubectl get pods -w -l app=zk

The StatefulSet controller creates three Pods, and each Pod has a container with a ZooKeeper server. Once the zk-2 Pod is Running and Ready, use Ctrl-C to terminate kubectl.

You will notice that the Pods don’t follow the usual naming scheme: instead of having -<ReplicaSetId>-<PodId> appended to the name of the Deployment, they merely have an ordinal index, indicating their unique identity within the StatefulSet. This is a requirement for the ZooKeeper cluster to work: no two participants should claim the same unique identifier, and the StatefulSet provides the means to achieve that easily.

And not only are these unique and reliably-named identifiers available within the individual Pods, they are also available via our headless service as created above:

kubectl exec zk-0 -- sh -c 'hostname -f; nc -v -z zk-1.zk-hs 2181'

(nc or netcat is used in zero-IO mode (-z) for scanning whether on the specified host the specified port is reachable, yielding verbose (-v) output)

zk-0.zk-hs.default.svc.cluster.local
Connection to zk-1.zk-hs 2181 port [tcp/*] succeeded!

If ever a Pod will need to get recreated, Kubernetes will take care to update the Pod’s DNS entry with the new IP address so that these identifiers remain valid.

Next, before we investigate how this cluster works in detail, let’s first check whether it works at all.

Exercise - Testing the ZooKeeper cluster

The most basic sanity test is to write data to one ZooKeeper server and to read the data from another:

kubectl exec zk-0 -- zkCli.sh create /hello world

[...]
WATCHER::

WatchedEvent state:SyncConnected type:None path:null
Created /hello

kubectl exec zk-1 -- zkCli.sh get /hello 2> /dev/null | tail -n 4

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
world

Feel free to similarly add data on any one server and read it on any other server, it should just work.

Also feel free to randomly delete a single Pod of the StatefulSet, e.g. kubectl delete pod zk-2, and you’ll notice that the StatefulSet will immediately recreate it. And you will be able to query the newly-recreated Pod just fine for the data you had entered before.

And we can even delete the whole StatefulSet, recreate it, and still access our data:

kubectl delete statefulsets.apps zk; kubectl get pods -w -l app=zk

When zk-0 is fully terminated, use Ctrl-C to terminate kubectl, and then reapply the definition:

kubectl apply -f zookeeper-statefulset.yaml; kubectl get pods -w -l app=zk

Once the zk-2 Pod is Running and Ready, use Ctrl-C to terminate kubectl, and query our data again:

kubectl exec zk-2 -- zkCli.sh get /hello 2> /dev/null | tail -n 4

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
world

Nice. But how does this work?

As mentioned in the ZooKeeper Basics section, ZooKeeper commits all entries to a durable WAL, and periodically writes snapshots in memory state, to storage media. Using WALs to provide durability is a common technique for applications that use consensus protocols to achieve a replicated state machine. And of course we persist this durable WAL on persistent storage.

Exercise - Investigating the ZooKeeper cluster

The volumeClaimTemplates field of the zk StatefulSet’s spec specifies a PersistentVolume provisioned for each Pod:

grep -A 8 volumeClaimTemplates zookeeper-statefulset.yaml

volumeClaimTemplates:
  - metadata:
      name: datadir
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 1Gi

The StatefulSet controller generates a PersistentVolumeClaim for each Pod in the StatefulSet. Use the following command to get the StatefulSet’s PersistentVolumeClaims:

kubectl get pvc -l app=zk

NAME           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
datadir-zk-0   Bound    pvc-a39a1b88-f708-40c0-a06c-28c4467820a6   1Gi        RWO            default        124m
datadir-zk-1   Bound    pvc-7a6e7f19-95be-4d43-9583-99980a9d9498   1Gi        RWO            default        124m
datadir-zk-2   Bound    pvc-23ad8307-942a-48ad-800f-6b83cc558d58   1Gi        RWO            default        123m

When the StatefulSet recreated its Pods, it remounts the Pods’ PersistentVolumes. The volumeMounts section of the StatefulSet’s container template mounts the PersistentVolumes in the ZooKeeper servers’ data directories:

grep -A 2 volumeMounts zookeeper-statefulset.yaml

volumeMounts:
  - name: datadir
    mountPath: /var/lib/zookeeper

When a Pod in the zk StatefulSet is (re)scheduled, it will always have the same PersistentVolume mounted to the ZooKeeper server’s data directory. Even when the Pods are rescheduled, all the writes made to the ZooKeeper servers' WALs, and all their snapshots, remain durable.

However, the servers in a ZooKeeper ensemble require consistent configuration to elect a leader and form a quorum. They also require consistent configuration of the Zab protocol in order for the protocol to work correctly over a network. In our example we achieve consistent configuration by embedding the configuration directly into the manifest:

kubectl get sts zk -o yaml | grep -C 5 ' start-zookeeper '

spec:
  containers:
    - command:
        - sh
        - -c
        - start-zookeeper --servers=3 --data_dir=/var/lib/zookeeper/data --data_log_dir=/var/lib/zookeeper/data/log
          --conf_dir=/opt/zookeeper/conf --client_port=2181 --election_port=3888 --server_port=2888
          --tick_time=2000 --init_limit=10 --sync_limit=5 --heap=512M --max_client_cnxns=60
          --snap_retain_count=3 --purge_interval=12 --max_session_timeout=40000 --min_session_timeout=4000
          --log_level=INFO
      image: k8s.gcr.io/kubernetes-zookeeper:1.0-3.4.10

which you will find reflected in the generated config as well:

kubectl exec zk-0 -- cat /opt/zookeeper/conf/zoo.cfg

#This file was autogenerated DO NOT EDIT
clientPort=2181
dataDir=/var/lib/zookeeper/data
dataLogDir=/var/lib/zookeeper/data/log
tickTime=2000
initLimit=10
syncLimit=5
maxClientCnxns=60
minSessionTimeout=4000
maxSessionTimeout=40000
autopurge.snapRetainCount=3
autopurge.purgeInteval=12
server.1=zk-0.zk-hs.default.svc.cluster.local:2888:3888
server.2=zk-1.zk-hs.default.svc.cluster.local:2888:3888
server.3=zk-2.zk-hs.default.svc.cluster.local:2888:3888

Because there is no terminating algorithm for electing a leader in an anonymous network, Zab requires explicit membership configuration to perform leader election. Each server in the ensemble needs to have a unique identifier, all servers need to know the global set of identifiers, and each identifier needs to be associated with a network address.

Info

As you can imagine all this automation requires careful image creation in the first place. Furthermore, due to the configuration consistency requirement we will not be able to scale this ZooKeeper cluster once it has been created. However, we will investigate fully scalable clusters later on.

And while this example here is by no means complete (see e.g. Auto purge task is not starting ) it serves to illustrate the general principles.

Exercise - Check ZooKeeper cluster security and safety

The best practices to allow an application to run as a privileged user inside of a container are a matter of debate. If your organization requires that applications run as a non-privileged user you can use a SecurityContext to control the user that the entry point runs as.

The zk StatefulSet’s Pod template contains a SecurityContext:

grep -A 2 securityContext zookeeper-statefulset.yaml

securityContext:
  runAsUser: 1000
  fsGroup: 1000

In the Pods’ containers, UID 1000 corresponds to the zookeeper user and GID 1000 corresponds to the zookeeper group. As the runAsUser field of the securityContext object is set to 1000, instead of running as root, the ZooKeeper process runs as the zookeeper user:

kubectl exec zk-0 -- ps auxwwkstart_time

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
zookeep+     1  0.0  0.0   4508   848 ?        Ss   08:08   0:00 sh -c start-zookeeper --servers=3 --data_dir=/var/lib/zookeeper/data --data_log_dir=/var/lib/zookeeper/data/log --conf_dir=/opt/zookeeper/conf --client_port=2181 --election_port=3888 --server_port=2888 --tick_time=2000 --init_limit=10 --sync_limit=5 --heap=512M --max_client_cnxns=60 --snap_retain_count=3 --purge_interval=12 --max_session_timeout=40000 --min_session_timeout=4000 --log_level=INFO
zookeep+     7  0.1  1.9 2988028 73964 ?       Sl   08:08   0:03 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Dzookeeper.log.dir=/var/log/zookeeper -Dzookeeper.root.logger=INFO,CONSOLE -cp /usr/bin/../build/classes:/usr/bin/../build/lib/*.jar:/usr/bin/../share/zookeeper/zookeeper-3.4.10.jar:/usr/bin/../share/zookeeper/slf4j-log4j12-1.6.1.jar:/usr/bin/../share/zookeeper/slf4j-api-1.6.1.jar:/usr/bin/../share/zookeeper/netty-3.10.5.Final.jar:/usr/bin/../share/zookeeper/log4j-1.2.16.jar:/usr/bin/../share/zookeeper/jline-0.9.94.jar:/usr/bin/../src/java/lib/*.jar:/usr/bin/../etc/zookeeper: -Xmx512M -Xms512M -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false org.apache.zookeeper.server.quorum.QuorumPeerMain /usr/bin/../etc/zookeeper/zoo.cfg
zookeep+  3728  0.0  0.0  34424  2816 ?        Rs   08:37   0:00 ps auxwwkstart_time

Let’s check this by messing around with the StatefulSet’s readinessProbe and livenessProbe:

grep -A 7 Probe zookeeper-statefulset.yaml

readinessProbe:
  exec:
    command:
      - sh
      - -c
      - "zookeeper-ready 2181"
  initialDelaySeconds: 10
  timeoutSeconds: 5
livenessProbe:
  exec:
    command:
      - sh
      - -c
      - "zookeeper-ready 2181"
  initialDelaySeconds: 10
  timeoutSeconds: 5

First we’ll try simply removing the probe executable, which will fail:

kubectl exec -it zk-2 -- sh -c 'ls -al /usr/bin/zookeeper-ready; id; rm -v /usr/bin/zookeeper-ready'

lrwxrwxrwx 1 root root 34 Jun 13  2017 /usr/bin/zookeeper-ready -> /opt/zookeeper/bin/zookeeper-ready
uid=1000(zookeeper) gid=1000(zookeeper) groups=1000(zookeeper)
rm: cannot remove '/usr/bin/zookeeper-ready': Permission denied
command terminated with exit code 1

Now let’s do some damage where we can, and watch the impact:

kubectl exec -it zk-2 -- sh -c 'ls -al /opt/zookeeper/bin/zookeeper-ready; id; rm -v /opt/zookeeper/bin/zookeeper-ready'; kubectl get pods -w -l app=zk

-rwxr-x--- 1 zookeeper zookeeper 900 Jun 13  2017 /opt/zookeeper/bin/zookeeper-ready
uid=1000(zookeeper) gid=1000(zookeeper) groups=1000(zookeeper)
removed '/opt/zookeeper/bin/zookeeper-ready'
NAME   READY   STATUS    RESTARTS   AGE
zk-0   1/1     Running   0          40m
zk-1   1/1     Running   0          40m
zk-2   1/1     Running   0          40m
zk-2   0/1     Running   0          40m
zk-2   0/1     Running   1          40m
zk-2   1/1     Running   1          41m

Once the zk-2 Pod is Running and Ready again, use Ctrl-C to terminate kubectl, and feel free to query our data once more:

kubectl exec zk-2 -- zkCli.sh get /hello 2> /dev/null | tail -n 4

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
world

The Pod has been recreated, cf. RESTARTS increasing to 1, and all data has persisted. You could observe the same when you just kill the executing Java process: kubectl exec zk-2 -- pkill java.

By the way, you can observe the Probes’ activity in the container logs:

kubectl logs zk-0 --tail 2

2023-12-01 11:59:51,698 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@883] - Processing ruok command from /127.0.0.1:40822
2023-12-01 11:59:51,699 [myid:1] - INFO  [Thread-52:NIOServerCnxn@1044] - Closed socket connection for client /127.0.0.1:40822 (no session established for client)

And one more thing: did you notice where the three Pods have been created, i.e. on which Kubernetes cluster node they are running? Remember how to check for that?

Solution

kubectl get pods -l app=zk -o wide

would be one way,

for i in 0 1 2; do kubectl get pod zk-$i --template {{.spec.nodeName}}; echo ""; done

another.

This is due to to having defined Pod affinity, or - in this case - podAntiAffinity:

grep -A 9 affinity zookeeper-statefulset.yaml

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: "app"
              operator: In
              values:
                - zk
        topologyKey: "kubernetes.io/hostname"

which ensures that no two Pods having the label app: zk will ever be scheduled to run on the same node, as it should be for a cluster. Nice!

Exercise - ZooKeeper cleanup

Once we are done we can (and should) release some of the ever-scarce resources. First of all we can easily delete the ZooKeeper StatefulSet and the corresponding Services:

kubectl delete statefulset zk

and

kubectl delete service zk-cs zk-hs

Of course, alternatively we could have just deleted those by executing:

kubectl delete -f zookeeper-statefulset.yaml; kubectl delete -f zookeeper-services.yaml

However, as per specification the associated PVC that had been automatically created via volumeClaimTemplates will not necessarily be removed even though the consuming Pods are gone, so let’s check for this and remove them manually if needed:

kubectl get pvc

kubectl delete pvc datadir-zk-0 datadir-zk-1 datadir-zk-2