Introduction
As a container orchestrator, Kubernetes manages the lifecycle of containers in a cluster.
The kubelet
will ensure that the containers of the different pods are running and healthy. These pods (podSpec) are provided through the apiserver.
Different probes are used by the kubelet
to decide how to treat a container and what actions to take. We can use each of these probes to achieve an optimal experience when running our containers.
- To make sure incoming requests do not fail due to the application not yet ready ⇒ Readiness probes
- Restart an unhealthy container if it fails, to resolve some situations where it may get stuck ⇒ Liveness probes
- And if we have an application that has a long-running initialization process, we want the
kubelet
to be patient and give it enough time to perform its initialization, before checking if it is live and ready ⇒ Startup probes
These probes are defined for each container of the pod. Different checks can be performed on each of these containers:
- httpGet
- tcpSocket
- gRPC (since Kubernetes v1.24 in beta, and GA in v1.27)
- exec
Typically, you will need to have at least a readinessProbe
and a livenessProbe
defined for your containers.
We will be using the excellent podinfo to showcase these different probes.
Deploying a microservice
There are different ways you can deploy the podinfo
microservice. We won’t go deep into the different scenarios.
For simplicity, and as we want to make changes to the chart we will clone the repository:
git clone git@github.com:stefanprodan/podinfo.git
And let’s install the chart with its default values:
helm install podinfo -n dev --create-namespace charts/podinfo
Kubernetes probes
A probe is a diagnostic performed periodically by the kubelet on a container. To perform a diagnostic, the kubelet either executes code within the container, or makes a network request.
Liveness probes
Liveness probes are used to perform checks on the container, to make sure it is running. if the probe fails, the container is killed by the kubelet
and will get restarted following the defined restartPolicy in the pod spec
.
This simple kill-and-recreate of the container may help resolve some problems in situations where a restart is needed.
NB: If no restartPolicy
is defined, it defaults to Always
.
Ideally, we will have 2 different HTTP endpoints for liveness and readiness checks.
- For liveness probes these may be
/healthz
or/livez
- For readiness probes:
/readyz
These are the commonly used endpoints. You can use different - specific - endpoints.
The z
at the end of /healthz
, /livez
and /readyz
is just to avoid any collisions with existing endpoints.
Depending on your application, you may need some a check mechanisms other than a simple HTTP request. Kubernetes supports TCP checks on a pod port, command execution inside the container, and recently gRPC remote procedure call.
By default, the chart performs checks using commands:
livenessProbe:
exec:
command:
- podcli
- check
- http
- localhost:9898/healthz
failureThreshold: 3
initialDelaySeconds: 1
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
This instructs the kubelet
to:
- First, wait for 1 second
- Then, execute the provided command every 10 seconds
- Use a timeout of 5 seconds for these executions
- Consider the container live after the first command succeeds (probe outcome: Success)
- Consider the container unhealthy and kill it after 3 failures (probe outcome: Failure)
To simulate an unhealthy container let set the following flag to the chart:
helm upgrade podinfo charts/podinfo --set faults.unhealthy=true
The pod will start and will show as running (Readiness Probes are successful). But the /healthz
will return 503 errors. And after 3 checks, the kubelet
will decide that the container should be killed and restarted:
kubectl describe po podinfo-777f58f87f-l5rdp
#[...]
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
#[...]
Warning Unhealthy 42s kubelet Liveness probe failed: 2023-04-15T09:59:48.039Z INFO podcli/check.go:137 check failed {"address": "http://localhost:9898/healthz", "status code": 503}
Warning Unhealthy 30s kubelet Liveness probe failed: 2023-04-15T10:00:00.042Z INFO podcli/check.go:137 check failed {"address": "http://localhost:9898/healthz", "status code": 503}
Warning Unhealthy 22s kubelet Liveness probe failed: 2023-04-15T10:00:08.029Z INFO podcli/check.go:137 check failed {"address": "http://localhost:9898/healthz", "status code": 503}
Normal Killing 22s kubelet Container podinfo failed liveness probe, will be restarted
We can also change the command checks with gRPC checks:
livenessProbe:
grpc:
port: 9999
service: podinfo
initialDelaySeconds: 1
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
failureThreshold: 3
Readiness probes
Readiness probe checks if the container is ready to receive requests.
A pod is ready when all of its containers are ready (successful readiness probes for all containers).
Similarly to liveness probes, Readiness probes are performed periodically. When the probe fails the pod IP is removed from all the endpoints and services.
The default chart performs readiness probes using a command:
readinessProbe:
exec:
command:
- podcli
- check
- http
- localhost:9898/readyz
failureThreshold: 3
initialDelaySeconds: 1
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
Let’s set readiness flag to false to simulate Unready container:
helm upgrade podinfo charts/podinfo --set faults.unready=true
Let’s check the pod:
kubectl get po
NAME READY STATUS RESTARTS AGE
podinfo-85698889c-f8v8r 0/1 Running 0 83s
We can see the pod running but not ready. The (0/1) on the READY column means 0 containers ready out of the total pod containers (in this case 1 container).
Describing the pod:
kubectl describe po podinfo-85698889c-f8v8r
#[...]
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
#[...]
Warning Unhealthy 3m22s kubelet Readiness probe failed: 2023-04-15T07:20:35.706Z INFO podcli/check.go:137 check failed {"address": "http://localhost:9898/readyz", "status code": 503}
Warning Unhealthy 3m21s kubelet Readiness probe failed: 2023-04-15T07:20:36.758Z INFO podcli/check.go:137 check failed {"address": "http://localhost:9898/readyz", "status code": 503}
Warning Unhealthy 3m13s kubelet Readiness probe failed: 2023-04-15T07:20:44.207Z INFO podcli/check.go:137 check failed {"address": "http://localhost:9898/readyz", "status code": 503}
Warning Unhealthy 3m3s kubelet Readiness probe failed: 2023-04-15T07:20:54.211Z INFO podcli/check.go:137 check failed {"address": "http://localhost:9898/readyz", "status code": 503}
We can see the different checks that fail. And since the pod is not ready, traffic should not be directed to it.
Checking the service:
kubectl describe svc podinfo
Name: podinfo
Namespace: dev
Labels: app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=podinfo
app.kubernetes.io/version=6.3.5
helm.sh/chart=podinfo-6.3.5
Annotations: meta.helm.sh/release-name: podinfo
meta.helm.sh/release-namespace: dev
Selector: app.kubernetes.io/name=podinfo
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.43.223.176
IPs: 10.43.223.176
Port: http 9898/TCP
TargetPort: http/TCP
Endpoints:
Port: grpc 9999/TCP
TargetPort: grpc/TCP
Endpoints:
Session Affinity: None
Events: <none>
We can see that the two Endpoints
fields are empty to make sure traffic does not reach the pod until it gets ready.
Startup probes
Startup probes are used for slow-starting containers. They are used to make sure the container has finished it initialization phase, before performing readiness and liveness checks.
The startup probe disables liveness and readiness probes and waits for the application to start (success of startup probe). When the startup probe succeeds, the liveness and readiness probes are enabled and traffic is served to the pod upon Readiness probe success.
Let’s create a startup probe for our podinfo
deployment:
startupProbe:
exec:
command:
- podcli
- check
- http
- localhost:9898/healthz
failureThreshold: 20
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
In this startupProbe
example, we are just giving more time to the application to start by setting failureThreshold
to 20 and initialDelaySeconds
to 10. This can be a more complex probe, like checking the existence of a file on the file system.
To sum up
Readiness probes and Liveness probes plays different roles in the lifecycle of a pod. A common mistake is to have them both perform the same check with the same configuration, which leads to killing the pod whenever probes fail. This does not allow Kubernetes to stop serving traffic to the pod and wait for it to recover.
Ideally, the two probes should be different, as discussed in this article, with 2 different checks for each probe.
If the same endpoint (or command) is used for both probes, the configuration should be different (delays, thresholds).