Kubernetes Volumes and Multiple Zones

While working on a recent patch for Kubernetes, I put down a few notes around how Kubernetes handles multiple zones from the storage perspective.

This is not an extensive description of multi zone support in Kubernetes, but rather a specific explanation of how PersistentVolumes (PVs) are created in the correct zones.

The labels

It all starts with the kubelet adding labels to node objects with information about the zones and regions the node is placed.

This can be quickly verified with the command:

$ kubectl get node/ip-172-18-12-140.ec2.internal -o yaml
apiVersion: v1
kind: Node
  labels:
	failure-domain.beta.kubernetes.io/region: us-east-1
	failure-domain.beta.kubernetes.io/zone: us-east-1d
   (...)

In addition to that, the PersistentVolumeLabel Admission Controller automatically adds zone labels to PVs as soon as they are created.

The scheduler (via the VolumeZonePredicate predicate) will then ensure that pods that claim a given PV are only placed into the same zone as that volume, as volumes cannot be attached across zones.

This approach sounds interesting, but it comes with some problems. For instance, this only prevents scheduling pods in certain zones. It can’t tell the storage provisioner to provion the PV in a certain zone.

There’s a better solution to address this shortcoming: topology-aware volume provisiong.

Topology-aware volume provisioning

With topology-aware volume provisioning, the PV is only provisioned when a pod requests it. When that happens, the volume is provisioned in the same zone as the pod.

The PV NodeAffinity is always set in the storage plugin (or in the external provisioner, in the CSI case). Then, there’s another scheduler predicate that schedules pods on certain nodes: VolumeBindingChecker. This predicate looks at the pv.spec.nodeaffinity field rather than at the PV labels.

This is how the field looks in the PV object:

In-tree storage plugin:

  nodeAffinity:
	required:
	  nodeSelectorTerms:
	  - matchExpressions:
		- key: failure-domain.beta.kubernetes.io/zone
		  operator: In
		  values:
		  - us-east-1d
		- key: failure-domain.beta.kubernetes.io/region
		  operator: In
		  values:
		  - us-east-1

CSI driver:

  nodeAffinity:
	required:
	  nodeSelectorTerms:
	  - matchExpressions:
		- key: topology.ebs.csi.aws.com/zone
		  operator: In
		  values:
		  - us-east-1d