BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Common Reasons for Failed Kubernetes Deployments

Common Reasons for Failed Kubernetes Deployments

This item in japanese

Lire ce contenu en français

Bookmarks

A recent series of articles highlighted the 10 common reasons for failed Kubernetes deployments. These range from missing and incorrect inputs to exceeding resource limits. In most cases, the kubectl describe command can help to pinpoint the underlying reason.

Invalid inputs for a Kubernetes deployment include specifying a non-existent container image, or one that is inaccessible due to permission issues. The default registry is Dockerhub, so the registry URL needs to be specified if another registry like Amazon ECR or Quay.io is used. Private registries require credentials when accessing images. An image pull failure can also happen when the tag name to be pulled is invalid. This can happen when the latest tag does not exist but the image does ('latest' is the default tag if nothing is specified). Network problems might also cause problems. The error messages in these cases is similar so deeper inspection is required to pinpoint the exact reason.

Deployment failures in Kubernetes usually lead to the specific Pod not coming up. The 'kubectl describe pod <pod-name>' prints out an event log describing the reasons for failure. The kubectl command takes the 'pod', 'replicaset' and 'deployment' arguments. These commands combined with 'kubectl logs <podname>' are key to debugging deployment failures.

If the default policy in Kubernetes is set to not pull always from the registry, updated changes might not be visible even after they have been committed and the image pushed. The recommended approach for production is to use unique tags for each image and use them in the pull request. Specifying non-existent persistent volumes in the deployment config can also fail deployments.

Two other cases of invalid inputs are missing application runtime ConfigMap or Secrets, and invalid Spec objects. A ConfigMap is a map of key-value pairs of configuration data needed by the application. ConfigMaps can be specified as CLI arguments, environment variables or files in a mounted volume. If they are missing the Pod creation stops with the status set to "RunContainerError". Secrets are a mechanism for storing sensitive data like credentials. A missing Secret will result in similar errors. Both ConfigMap and Secrets can also be mounted as volumes. If this fails the container creation stops with the status stuck at "ContainerCreating" in the event log.

Invalid Kubernetes Spec objects due to indentation mistakes in the YAML or typos are another cause of failures. These can be easily prevented by CLI-based YAML validation and by using the --dry-run flag like this:

kubectl create -f test-application.deploy.yaml --dry-run --validate=true

This requires a running Kubernetes cluster to work. There is ongoing work to avoid this dependency and have client side validation. The YAML validation can be added as part of the pre-commit hooks in the source control system.

Another class of failed Kubernetes deployments involves exceeding resource limits. Pod and containers both have specified limits for CPU and memory. Exceeding these limits would result in no Pods being created. Debugging this takes a bit of digging. A 'kubectl describe deployment <deployment name>' would give us the name of the ReplicaSet that Kubernetes attempted to create. Typing 'kubectrl describe replicaset <replicaset name>', passing the name of the replica set obtained in the previous step, prints out the event log like in the other cases, with the error message.

Deployment failures can also result from exceeding resource quotas, which are a mechanism to limit resource consumption per namespace when teams share a cluster with a fixed number of nodes. Resources include pods, services and deployments as well as the total amount of compute resources. In this case too, the 'kubectl describe' command helps in digging down to the actual error message.

The cluster autoscaler automatically adjusts the Kubernetes cluster size when either nodes are underutilizing resources or a pod is not able to run due to insufficient resources. If this is not enabled, deployments that request for more than the allocated resources will fail with the Pod status stuck at 'Pending'. The event log will display the actual resource which was short.

Unexpected changes in the application’s behavior can cause failures in different ways. A launch error with the message 'CrashLoopBackOff' is usually caused by the app crashing. The application logs can help to figure out the problem. Also, the Liveness/Readiness probes which Kubernetes uses to detect the health/readiness of a services can fail if there is a misconfiguration or timeout. E.g. the health check URL might have changed in the application or it might not be working as expected due to a database change. Some URLs might take a while to respond to the Readiness check, which might timeout and fail the deployment.

The article’s author has open sourced a script which prints out helpful Kubernetes related information to the build log whenever a build fails.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Prevalidation is a godsend

    by Lord Fire,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I've just ran into a problem today that the name of a named TCP port can not exceed 15 characters for whatever reason.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT