Listen to this Post

Introduction:
Claiming Kubernetes expertise is easy, but when a production cluster starts throwing CrashLoopBackOff, ImagePullBackOff, or mysterious `Pending` pods, the gap between theory and real-world readiness becomes painfully clear. This article extracts operational troubleshooting patterns from a comprehensive Kubernetes error encyclopedia, focusing on root cause analysis and step‑by‑step commands that transform guesswork into systematic incident response.
Learning Objectives:
- Diagnose and resolve the top 10 production Kubernetes failures using
kubectl, `kubelet` logs, and container runtime checks. - Implement a signal‑based troubleshooting methodology for pod lifecycle, networking, storage, and node health issues.
- Apply Linux/Windows commands and security hardening techniques to prevent recurring pod crashes, DNS timeouts, and admission control denials.
You Should Know:
1. Pod Lifecycle Failures – CrashLoopBackOff & OOMKilled
Step‑by‑step guide:
When a pod crashes repeatedly, Kubernetes enters CrashLoopBackOff. Start by inspecting the previous container’s logs:
Get pod name and namespace kubectl get pods -n <namespace> View logs of the crashed container (--previous is critical) kubectl logs <pod-name> --previous -n <namespace> Describe pod for events and exit codes kubectl describe pod <pod-name> -n <namespace>
Common root causes:
- Application exits with code 1 (misconfiguration, missing env vars)
- OOMKilled – memory limit too low
Check OOM details:
Linux – inspect system OOM logs journalctl -k | grep -i "oom" | tail -20 For containerized environments (crictl or docker) crictl logs <container-id> --previous
Fix:
- Increase memory limits in the Deployment (
spec.containers.resources.limits.memory) - Debug the application entrypoint – run an interactive debug pod:
kubectl run debug --image=busybox -it --rm -- sh
Windows equivalent (if using Windows containers on AKS/EKS):
Get pod logs via kubectl (same command) kubectl logs <pod-name> --previous Check container events via Docker (if Docker runtime) docker ps -a | findstr <pod-prefix> docker logs <container-id>
- ImagePullBackOff / ErrImagePull – Registry & Authentication Failures
Step‑by‑step guide:
This error means the kubelet cannot pull the container image. Troubleshoot systematically:
Describe the pod – look for "Failed to pull image" kubectl describe pod <pod-name> | grep -A 10 "Events" Check if image exists locally crictl images | grep <image-name> Test image pull manually on a node (SSH into node) sudo crictl pull <registry>/<image>:<tag> Or with Docker sudo docker pull <registry>/<image>:<tag>
Common fixes:
- Wrong image name or tag (typo)
- Missing imagePullSecrets for private registries
Add to pod spec imagePullSecrets:</li> <li>name: regcred
- Network policies blocking registry access
Create a secret for Docker Hub / ACR / ECR:
kubectl create secret docker-registry regcred \ --docker-server=<registry-server> \ --docker-username=<user> \ --docker-password=<token> \ --docker-email=<email>
- Pending Pods – Resource Shortage & PVC Binding Issues
Step‑by‑step guide:
A pod stays in `Pending` when the scheduler cannot find a suitable node or a PVC is not bound.
Describe pod to see the exact reason kubectl describe pod <pending-pod> -n <namespace> Check node resources kubectl top nodes kubectl describe nodes | grep -A 5 "Allocated resources" Inspect PVC status kubectl get pvc -n <namespace> kubectl describe pvc <pvc-name> -n <namespace>
If PVC is pending:
- StorageClass missing or not provisioned
- PersistentVolume insufficient or accessModes mismatch
Force pod scheduling by adding tolerations or node selectors:
tolerations: - key: "node.kubernetes.io/not-ready" operator: "Exists" effect: "NoExecute"
To drain a node and free resources:
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data kubectl uncordon <node-name>
- Node NotReady & DiskPressure – Kubelet and Node Health Checks
Step‑by‑step guide:
When a node becomes NotReady, check the kubelet status and system resources.
SSH into the problematic node ssh <node-ip> Check kubelet service sudo systemctl status kubelet sudo journalctl -u kubelet -n 50 --no-pager Look for DiskPressure, MemoryPressure, PIDPressure kubectl describe node <node-name> | grep -A 5 "Conditions" Check disk usage df -h sudo du -sh /var/lib/docker/overlay2 Docker storage sudo du -sh /var/lib/containerd Containerd storage
Mitigate DiskPressure:
Clean unused container images
sudo crictl rmi --prune
Or Docker
sudo docker system prune -a -f
Remove evicted pods
kubectl get pods --all-namespaces | grep Evicted | awk '{print $2 " --namespace=" $1}' | xargs kubectl delete pod
Windows node (if using Windows Server):
Check kubelet logs
Get-WinEvent -LogName Application | Where-Object {$_.ProviderName -like "kubelet"} | Select-Object -First 50
Check disk space
Get-PSDrive C
- Networking Failures – CoreDNS & kube-proxy & Ingress 502/504
Step‑by‑step guide:
DNS resolution failures and Ingress 5xx errors are among the most subtle production killers.
Test DNS from inside a pod:
kubectl run -it --rm test-dns --image=busybox:1.28 -- nslookup kubernetes.default
Check CoreDNS pods:
kubectl get pods -n kube-system | grep coredns kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
If DNS times out, increase CoreDNS replicas or adjust ConfigMap:
CoreDNS ConfigMap snippet
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30
reload
}
Troubleshoot Ingress 502 (upstream unhealthy):
Get Ingress controller logs (e.g., nginx-ingress) kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100 Check service endpoints – missing endpoints cause 502 kubectl get endpoints <service-name> -n <namespace> If endpoints are empty, check pod labels match service selector
Fix Service ClusterIP unreachable:
Verify kube-proxy mode and iptables kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=50 On node, check iptables rules sudo iptables -t nat -L -n | grep <cluster-ip>
6. Storage & CSI Volume Mount Failures
Step‑by‑step guide:
When a pod cannot mount a PersistentVolume (PVC stuck `Pending` or pod fails with `Multi-Attach error` or `VolumeMount` failure).
Describe pod for volume mount errors kubectl describe pod <pod-name> -n <namespace> | grep -A 20 Volumes Check CSI driver logs (e.g., for EBS, Azure Disk) kubectl logs -n <csi-namespace> -l app=<csi-driver> --tail=100 Manually check if volume is attached to node (cloud provider CLI) AWS example aws ec2 describe-volumes --volume-ids vol-xxxxx --query 'Volumes[].Attachments'
Resolve PVC not bound:
Ensure StorageClass has correct provisioner and reclaimPolicy apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: fast provisioner: kubernetes.io/aws-ebs parameters: type: gp3 reclaimPolicy: Retain
Force unmount stale NFS volume on Linux node:
sudo umount -l /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~nfs/<pvc-name>
- Security Admission Denials – Pod Security & NetworkPolicy
Step‑by‑step guide:
Modern clusters enforce Pod Security Standards (PSS) or OPA/Gatekeeper. A pod may be rejected with Forbidden: violates PodSecurity.
Describe the pod event kubectl describe pod <pod-name> -n <namespace> | grep -i "admission" Check namespace labels for PSS level kubectl get namespace <namespace> -o yaml | grep pod-security If using Kyverno or OPA kubectl get constraint -A | grep <denial-reason>
Fix by relaxing PSS (if safe) or patching pod spec:
kubectl label namespace <namespace> pod-security.kubernetes.io/enforce=privileged --overwrite
NetworkPolicy blocking traffic – test connectivity:
Install netcat in a debug pod kubectl run netshoot --image=nicolaka/netshoot -it --rm -- /bin/bash Then test nc -zv <target-service> <port>
Create a permissive policy for troubleshooting:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-all
spec:
podSelector: {}
ingress:
- {}
egress:
- {}
What Undercode Say:
- Systematic elimination beats guessing – Every production outage teaches that `kubectl describe` and `kubectl logs –previous` should be muscle memory, not last resorts. The difference between a junior and senior engineer is the structured order of checks.
- Kubernetes troubleshooting is multi‑layer – A single `CrashLoopBackOff` can stem from OOM, missing config maps, volume mount race conditions, or even a failed liveness probe. Real readiness requires weaving together pod, node, storage, and networking signals.
Prediction:
As Kubernetes adoption deepens in edge and multi‑cloud environments, AI‑driven root cause analysis tools will become standard – but they will never replace the need for engineers who understand failure patterns at the `kubelet` and `cni` levels. Expect to see certification exams shift from YAML syntax to live‑cluster incident simulations, and observability platforms will auto‑correlate events with the exact commands shown in this guide. Teams that invest in error encyclopedias and chaos engineering will outpace those still relying on tribal knowledge.
▶️ Related Video (86% Match):
🎯Let’s Practice For Free:
IT/Security Reporter URL:
Reported By: Firdevs Balaban – Hackers Feeds
Extra Hub: Undercode MoN
Basic Verification: Pass ✅


