Kubernetes StatefulSets

prerequisites

I recommend you know the basic knowledge of Kubernetes Pods before reading this blog. You can check this blog for details about Kubernetes Pods.

What Is A StatefulSet

StatefulSet is a Kubernetes object designed to manage stateful applications. Like a Deployment, a StatefulSet scales up a set of pods to the desired number that you define in a config file. Pods in a StatefulSet runs the same containers that are defined in the Pod spec inside the StatefulSet spec. Unlike a Deployment, every Pod of a StatefulSet owns a sticky and stable identity. A StatefulSet also provides the guarantee about ordered deployment, deletion, scaling, and rolling updates for its Pods.

A StatefulSet Example

A complete StatefulSet consists of two components:

  • A Headless Service used to control the network ID of the StatefulSet’s Pods

The following example demonstrates how to use a StatefulSet to create a ZooKeeper Server. Please note that the following StatefulSet Spec is simplified for demo purposes. You can check this YAML file for the complete configuration of this ZooKeeper Service.

ZooKeeper Service

A Zookeeper service is a distributed coordination system for distributed applications. It allows you to read, write data, and observe data updates. Data is stored and replicated in each ZooKeeper server and these servers work together as a ZooKeeper Ensemble.

The following picture shows the overview of a ZooKeeper service with five ZooKeeper servers. You can see each server in a ZooKeeper service has a stable network ID for potential leader elections. Moreover, one of the ZooKeeper servers needs to be selected as a leader for managing the service topology and processing write requests. StatefulSets is suitable for running such an application as it guarantees uniqueness for Pods.

Headless Service

A Headless Service is responsible for controlling the network domain for a StatefulSet. The way to create a headless service is to specify clusterIP == None.

The following spec is for creating a Headless Service for the ZooKeeper service. This Headless Service is used to manage Pod Identify for the following StatefulSet.

apiVersion: v1
kind: Service
metadata:
namespace: default
name: zk-hs
labels:
app: zk-hs
spec:
ports:
- port: 2888
name: server
- port: 3888
name: leader-election
clusterIP: None
selector:
app: zk

Unlike a ClusterIP Service or aLoadBalancer Service, a Headless Service does not provide load-balancing. Based on my experience, any request to zk-hs.default.svc.cluster.local is always redirected to the first StatefulSet Pod ( zk-0 in the example). Therefore, A Kubernetes Service that provides load balancing or an Ingress is required if you need to load-balance traffic for your StatefulSet.

StatefulSet Spec

The following spec demonstrates how to use a StatefulSet to run a ZooKeeper service:

apiVersion: apps/v1
kind: StatefulSet
metadata:
namespace: default
name: zk

# StatefulSet spec
spec:
serviceName: zk-hs
selector:
matchLabels:
app: zk # It has to match .spec.template.metadata.labels
replicas: 5
podManagementPolicy: OrderedReady
updateStrategy:
type: RollingUpdate

# volumeClaimTemplates creates a Persistent Volume for each StatefulSet Pods.
volumeClaimTemplates:
- metadata:
name: datadir
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: standard
resources:
requests:
storage: 10Gi

# Pod spec
template:
metadata:
labels:
app: zk
spec:
affinity:
nodeAffinity:
...
podAntiAffinity:
...
# Containers running in each Pod
containers:
- name: k8szk
image: gcr.io/google_samples/k8szk:v3
ports:
- containerPort: 2181
name: client
- containerPort: 2888
name: server
- containerPort: 3888
name: leader-election
env:
...
readinessProbe:
exec:
command:
- "zkOk.sh"
initialDelaySeconds: 10
timeoutSeconds: 5
livenessProbe:
...
volumeMounts:
- name: datadir
mountPath: /var/lib/zookeeper

Metadata

The field metadata contains metadata of this StatefulSet, which includes the name of this StatefulSet and the Namespace it belongs to. You can also put labels and annotations in this field.

Stateful Set Spec and Pod Template

The field spec defines the specification of this StatefulSet and the field spec.template defines a template for creating the Pods this StatefulSet manages.

Pod Selector

Like a Deployment, a StatefulSet uses the field spec.selctor to find which Pods to manage. You can check this doc for details about the usage of Pod Selector.

Replica

The field spec.replica specifies the desired number of Pods for the StatefulSet. It is recommended to run an odd number of Pods for some stateful applications like ZooKeepers, based on the consideration of the efficiency of some operations. For example, a ZooKeeper service marks a data write complete only when more than half of its servers send an acknowledgment back to the leader. Take a six pods ZooKeeper service as an example. The service remains available as long as at least four servers (ceil(6/2 + 1)) are available, which means your service can tolerate the failure of two servers. Nevertheless, it can still tolerate two-servers failure when the server number is lowered down to five. Meanwhile, this also improves write efficiency as now it only needs 3 servers' acknowledgment to complete a write request. Therefore, having the sixth server, in this case, does not give you any additional advantage in terms of write efficiency and server availability.

Pod Identify

A StatefulSet Pod is assigned a unique ID (aka. Pod Name) from its Headless Service when it is created. This ID sticks to the Pod during the life cycle of the StatefulSet. The pattern of constructing ID is ${statefulSetName}-${ordinal}. For example, Kubernetes will create five Pods with five unique IDs zk-0, zk-1, zk-2, zk-3 and zk-4 for the above ZooKeeper service.

The ID of a StatefulSet Pod is also its hostname. The subdomain takes the form ${podID}.${headlessServiceName}.{$namespace}.svc.cluster.local where cluster.local is the cluster domain. For example, the subdomain of the first ZooKeeper Pod is zk-0.zk-hs.default.svc.cluster.local. It is recommended to use a Stateful Pod's subdomain other than its IP to reach the Pod as the subdomain is unique within the whole cluster.

Pod Management Policy

You can choose whether to create/update/delete a StatefulSet’s Pod in order or in parallel by specifying spec.podManagementPolicy == OrderedReady or spec.podManagementPolicy == Parallel. OrderedReady is the default setting and it controls the Pods to be created with the order 0, 1, 2, ..., N and to be deleted with the order N, N-1, ..., 1, 0. In addition, it has to wait for the current Pod to become Ready or terminated prior to terminating or launching the next Pod. Parallel launches or terminates all the Pods simultaneously. It does not rely on the state of the current Pod to lunch or terminate the next Pod.

Update Strategy

There are several rolling update strategies available for StatefulSets. RollingUpdate is the default strategy and it deletes and recreates each Pod for a StatefulSet when a rolling update occurs.

Doing rolling updates for the stateful applications like ZooKeepers is a little bit tricky: Other Pods need enough time to elect a new leader when the StatefulSet Controller is recreating the leader. Therefore, You should consider configuring readnessProbe and readnessProbe.initialDelaySeconds for the containers inside a StatefulSet Pods to delay the new Pod to be ready, thus delaying the rolling update of the next Pod and giving other running Pods enough time to update the service topology. This should give your stateful applications, for example, a ZooKeeper service, enough time to handle the case where a Pod is lost and back.

Pod Affinity

Like a Deployment, the ideal scenario of running a StatefulSet is to distribute its Pods to different nodes in different zones and avoid running multiple Pods in the same node. The spec.template.spec.affinity field allows you to specify node affinity and inter-pod affinity (or anti-affinity) for the SatefulSet Pods. You can check this doc for details about using node/pod affinity in Kubernetes

volumeClaimTemplates

The field spec.volumeClaimTemplates is used to provide stable storage for StatefulSets. As shown in the following picture, the field spec.volumeClaimTemplates creates a Persistent Volume Claim ( datadir-zk-0), a Persistent Volume ( pv-0000), and a 10 GB standard persistent disk for Pod zk-0. These storage settings have the same life cycle as the StatefulSet, which means the storage for a Stateful Pod is stable and persistent. Any StatefulSet Pod will not lose its data whenever it is terminated and recreated.

What Is Next

I recommend you read this blog if you are curious about how to utilize Kubernetes Deployments to run stateless applications in Kubernetes.

I also recommend you read this blog if you are curious about how to utilize Kubernetes Services to load balance traffic to your applications in Kubernetes.

Reference

A software engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store