For a variety of reasons a service or container should only run on a specific type of hardware. Maybe one machine has faster disk speeds, another is running a conflicting application, and yet another is part of your bare-metal cluster. Each of these would be a good reason to run an application here instead of there, and being able to control where an application gets run can make all of the difference.
In fleet dictating where a container gets run is called Affinity and Anti-Affinity. In Kubernetes this is called the nodeSelector and nodeAffinity/podAffinity fields under PodSpec.
Affinity and anti-affinity are achieved through fleet-specific options in systemd unit-files. These options include:
MachineID
: Run a unit on a specific host.MachineOf
: Run a unit on the same host as another unit.MachineMetadata
: Run on a host matching some arbitrary metadata defined infleet.conf
.Conflicts
: Prevent a unit from running on the same node as some other unit.Global
: Runs a unit on every node.Replaces
: Run a unit in place of another unit.
These are described in detail in the fleet documentation.
These options allow one to specify with arbitrary granularly how, where, and when to schedule a unit on a fleet cluster.
Kubernetes implements node affinity with the nodeSelector
and nodeAffinity
fields in PodSpec. These fields use both pre-populated metadata or user-defined metadata.
These user-defined labels are set like so:
$ kubectl label nodes some-k8s-node.internal.hostename.ext key=value
And they can be retrieved like so:
$ kubectl get nodes --show-labels
NAME STATUS AGE LABELS
some-k8s-node.internal.hostname.ext Ready 8m beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=instance-type,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=some-us-region-2,failure-domain.beta.kubernetes.io/zone=some-us-region-2c,kubernetes.io/hostname=ip-10-0-0-101.some-us-region-2.internal,key=value
...
nodeSelector
is the most basic way to set node affinity in Kubernetes. Given a set of key: value
pair of requirements, a pod can be scheduled to run (or not run) on certain nodes.
The general form of this node selection looks like this:
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
key1: value1
key2: value2
Let's say we're running a Kubernetes cluster on AWS and we want to run an Nginx pod on m3.medium
instance. We also want a different httpd
pod to only run in the us-west-2
region and only on nodes with the example.com/load-balancer
key set to true
. Here's how that mypods.yaml
would look:
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
nodeSelector:
beta.kubernetes.io/instance-type: m3.medium
example.com/load-balancer: true
---
apiVersion: v1
kind: Pod
metadata:
name: httpd
labels:
env: test
spec:
containers:
- name: httpd
image: httpd
nodeSelector:
failure-domain.beta.kubernetes.io/region: us-west-2
We create the pods just like usual:
$ kubectl create -f mypods.yaml
pod "nginx" created
pod "wordpress" created
After a while we can see that the httpd
pod has been created, but the nginx
is still pending
.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
httpd 1/1 Running 0 1m
nginx 0/1 Pending 0 1m
This is because our worker nodes don't meet one of the nginx
pod requirements; neither have the example.com/load-balancer
key set to true
:
$ kubectl get nodes --show-labels
NAME STATUS AGE LABELS
ip-10-0-0-187.us-west-2.compute.internal Ready 47m beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m3.medium,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2c,key=value,kubernetes.io/hostname=ip-10-0-0-187.us-west-2.compute.internal
ip-10-0-0-188.us-west-2.compute.internal Ready 47m beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m3.medium,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2c,kubernetes.io/hostname=ip-10-0-0-188.us-west-2.compute.internal
ip-10-0-0-50.us-west-2.compute.internal Ready,SchedulingDisabled 47m beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m3.medium,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2c,kubernetes.io/hostname=ip-10-0-0-50.us-west-2.compute.internal
If we add the example.com/load-balancer=true
key to one of our nodes, the nginx
pod will get scheduled to that node.
Once we have the example.com/load-balancer
key set to true
, the nginx
pod will be scheduled.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
httpd 1/1 Running 0 1m
nginx 0/1 Pending 0 1m
$ kubectl label node \
ip-10-0-0-187.us-west-2.compute.internal \
example.com/load-balancer=true
node "ip-10-0-0-187.us-west-2.compute.internal" labeled
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
httpd 1/1 Running 0 9m 10.2.26.5 ip-10-0-0-187.us-west-2.compute.internal
nginx 1/1 Running 0 9m 10.2.26.6 ip-10-0-0-187.us-west-2.compute.internal
When a node label changes, pods are not moved. We can demonstrate this by changing the example.com/load-balancer
key on ip-10-0-0-187.us-west-2.compute.internal
to false
, then adding the example.com/load-balancer=true
label to our other worker node to see what happens:
$ kubectl label nodes --overwrite ip-10-0-0-187.us-west-2.compute.internal example.com/load-balancer=false
node "ip-10-0-0-187.us-west-2.compute.internal" labeled
$ kubectl label nodes ip-10-0-0-188.us-west-2.compute.internal example.com/load-balancer=true
node "ip-10-0-0-188.us-west-2.compute.internal" labeled
$ kubectl get nodes --show-labels
NAME STATUS AGE LABELS
ip-10-0-0-187.us-west-2.compute.internal Ready 1h [...],example.com/load-balancer=false,[...]
ip-10-0-0-188.us-west-2.compute.internal Ready 1h [...],example.com/load-balancer=true,[...]
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
httpd 1/1 Running 0 15m 10.2.26.5 ip-10-0-0-187.us-west-2.compute.internal
nginx 1/1 Running 0 15m 10.2.26.6 ip-10-0-0-187.us-west-2.compute.internal
As you can see, both pods keep running on the same host, because the initial httpd
pod isn't moved or unscheduled. Later httpd pods will be assigned to the second node, according to the new example.com/load-balancer=true
label.
Kubernetes also has a more nuanced way of setting affinity called nodeAffinity
and podAffinity
. These are fields in under Pod
metadata and take automatic or user-defined metadata to dictate where to schedule pods. affinity
differs from nodeSelector
in the following ways:
- Schedule a pod based on which other pods are or are not running on a node.
- Request without requiring that a pod be run on a node.
- Specify a set of allowable values instead of a single value requirement.
Affinity selector | Requirements met | Requirements not met | Requirements lost |
---|---|---|---|
requiredDuringSchedulingIgnoredDuringExecution |
Runs | Fails | Keeps Running |
preferredDuringSchedulingIgnoredDuringExecution |
Runs | Runs | Keeps Running |
(un-implemented) requiredDuringSchedulingRequiredDuringExecution |
Runs | Fails | Fails |
In addition to affinity/anti-affinity for specific nodes nodeAffinity
Lets take the above example of deploying a nginx
and a httpd
pod, except we have a more complicated set of requirements:
nginx
cannot run on the same node ashttpd
httpd
should run on a node with thex-web:yes
label, but can run anywhere.nginx
must run on a node withy-web:yes
label and should fail if not.
TODO: Debug this example
---
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
nginx: yes
annotations:
scheduler.alpha.kubernetes.io/affinity: >
{
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": [
{
"labelSelector": {
"matchExpressions": [
{
"key": "x-web",
"operator": "In",
"values": ["yes", "true"]
}
]
}
}
]
},
"podAntiAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": [
{
"labelSelector": {
"matchExpressions": [
{
"key": "httpd",
"operator": "Exists"
"values": ["yes", "true"]
}
]
}
}
]
}
}
spec:
containers:
- name: nginx
image: nginx
---
apiVersion: v1
kind: Pod
metadata:
name: httpd
labels:
nginx: yes
annotations:
scheduler.alpha.kubernetes.io/affinity: >
{
"nodeAffinity": {
"preferredDuringSchedulingIgnoredDuringExecution": [
{
"labelSelector": {
"matchExpressions": [
{
"key": "y-web",
"operator": "In",
"values": ["yes", "true"]
}
]
}
}
]
}
}
spec:
containers:
- name: httpd
image: httpd