Treat your pods according to their needs - three QoS classes in Kubernetes

5 minute read

One of the features that comes with Kubernetes is its ability to scale horizontally services running on it and use available resources more efficiently. I’ve been hearing that containers are just lightweight virtualization (which is not true) so you can put more apps on the same resources. I can agree that it’s partially true - for some type of workloads containers can use less resources because of Kubernetes scheduler and linux kernel magic :-)

Two-level scheduling

When you launch a pod in Kubernetes a really nice and sophisticated piece of software called scheduler determines which host should be chosen to run it. It does its magic by taking into consideration many factors and it’s also highly configurable. Like I said - it’s pretty cool and if you want to find out more please watch this video.

So when your pod finally launches on a host it is under its control. To make it clear we need to understand one, simple principle:

Containers are just processes

Yes, they are more fancy, but still are treated like processes by the kernel on the node they’re running. It means that after they have been chosen by kubernetes scheduler their fate is in hands of linux kernel. It’s behaviour can be controlled with proper parameters passed from Kubernetes to container runtime (e.g. docker-engine).

Not everyone is equal

When it comes to access to precious resources such as compute power and memory there is a constant fight between all running processes on particular host - including the ones running in host namespace (non-containerized) and also in each container. Linux kernel has many features for handling processes. For cpu it has many schedulers, process priorities (statically assigned by administrator using nice command, dynamic internally assigned by kernel) and many more. Memory management is even more complex. Actually most of kernel code is about memory, but only when Google presented cgroups we could start limiting process memory (hard rss limits defined in limits.conf weren’t actually respected…). With cgroups we received many, many more things of which two are used by containers:

  • cpu controller for cpu prioritization and limiting
  • mem controller for memory reservation and limiting

Thanks to these features we can run nginx with memory limited to 128M and half a core of cpu:

docker run -m 128M -c 0.5 nginx

What does it have to do with Kubernetes?

Three QoS classes

When deploying a pod you can specify two values for both cpu and mem resources:

resources:
  limits:
    cpu: 200m
    memory: 1G
  requests:
    cpu: 500m
    memory: 1G

Request can be treated as reservations (or soft limits) while limits are just hard limits which cannot be exceeded. What is important is that these values are optional and depending how you set them your pod will be assigned to one of three classes:

  • Guaranteed - if you set only limits OR if you set limits and requests to the same value
  • Burstable - if you set only requests OR if you set requests lower than limits
  • Best-effort - if you don’t set those at all

Why should I care about what class my pod is assigned to?

There are a couple of reasons why you should consider specifying resources values.

1. Pod will be assigned to a default class

If you don’t specify resources your pod will be assigned to Best-Effort class (unless your namespace has been configured in a special way - see below). And yes - that’s the worst, less prioritized class.

2. Class determines how your pod is treated when host is running low on resources

Depending on resource type - cpu or memory - bad things may happen, sometimes even a murder (of a process)!

For cpu its excess power will be distributed proportionally to requests.cpu values assigned to containers running on a node. So if two processes are fighting for CPU, the one with higher request gets more. And guess what - the one without any value set will get only what’s left (actually kernel won’t let it starve, but it will be significantly slower).

For memory this is more interesting. If container doesn’t have any requests.memory assigned then it will be terminated by linux kernel “killer” feature - Out-of-memory (OOM) killer. It will try to spare those from guaranteed class, probably save some from burstable, but best-effort will be chosen first.

What class should I assign to my pods

I can offer the best answer I can - it depends.

I would go however with the following approach:

  • CPU
    • Guaranteed if you have “sensitive” app for cpu spikes OR want to minimize latency it may cause OR don’t want to overcommit CPU resources (Kubernetes scheduler will track and enforce it)
    • Burstable for most, generic workloads with some priorities by setting different request<->limit gaps according to requirement
    • Best-effort for non-critical workloads, batch jobs and any workload that spans dozens/hundreds nodes since classes are used by local node scheduler ONLY when there’s a higher demand of supply
  • Memory
    • Guaranteed similarly, for “sensitive” apps like databases, anything that runs in StatefulSet since it’s probably critical (it needs storage after all)
    • Burstable for most, generic workloads, I highly recommend setting also limits to minimize OOM kills
    • Best-effort for non-critical workloads, keeping in mind that they will be killed first!

How to set a class

Place a proper resources section under spec.containers.

For OpenShift users you can do it from cli, eg. for jenkins deploymentconfig:

oc set resources dc jenkins --limits=cpu=0.5,memory=1024M --requests=cpu=2,memory=1024M

How to check which class my pod belongs to

Please check qosClass field of pod’s status. E.g.:

kubectl get pod MYPOD -o jsonpath='{.status.qosClass}'

Don’t lose hope - LimitRange to the rescue!

For those who don’t want to define requests and limits per each pod there’s a LimitRange admission controller which injects default values for you. Here’s a sample configuration object you can apply to your namespace/project:

apiVersion: "v1"
kind: "LimitRange"
metadata:
  name: you-shall-have-limits
spec:
  limits:
    - type: "Container"
      max:
        cpu: "2"
        memory: "1Gi"
      min:
        cpu: "100m"
        memory: "4Mi"
      default:
        cpu: "500m"
        memory: "200Mi"
      defaultRequest:
        cpu: "200m"
        memory: "100Mi"

Each container specified in pod defined in a namespace with this LimitRange configuration will get:

  • memory requests=100MiB, limits=200MiB
  • cpu requests=200m, limits=500m

Additional parameters define maximum and minimum values that will be enforced for containers with values specified explicitly.

Conclusions

Now a quick recap in a few points.

  1. Kubernetes is awesome!
  2. Kubernetes scheduler operates on cluster level and linux kernel operates on node/local level
  3. Thanks to linux kernel cgroups feature we can easily enforce limits and reservation for cpu and memory of our containers
  4. There are three QoS classes: Guaranteed, Burstable, Best-effort
  5. Best-effort class is default and is probably the worst choice for most production workloads
  6. It’s a good idea to choose class explicitly by setting resources in pod definition or use LimitRange

P.S. Puppies have nothing to do with the article (except there are also three of them and are absolutely cute) - it’s just a cheap trick to attract you to visiting my site ;-)

Leave a comment