Hands-on troubleshooting Kubernetes applications.
Table of contents:
Preparation | Intro | Poking pods |
---|---|---|
Storage | Network | Security |
Observability | Vaccination | References |
To demonstrate the different issues and failures as well as how to fix them, I’ve been using the commands and resources as shown below.
NOTE: whenever you see a 📄 icon, it means this is a reference to the official Kubernetes docs.
Before starting, set up:
# create the namespace we'll be operating in:
kubectl create ns vnyc
# in different tmux pane keep an eye on the resources:
watch kubectl -n vnyc get all
Using 00_intro.yaml:
kubectl -n vnyc apply -f 00_intro.yaml
kubectl -n vnyc describe deploy/unhappy-camper
THEPOD=$(kubectl -n vnyc get po -l=app=whatever --output=jsonpath={.items[*].metadata.name})
kubectl -n vnyc describe po/$THEPOD
kubectl -n vnyc logs $THEPOD
kubectl -n vnyc exec -it $THEPOD -- sh
kubectl -n vnyc delete deploy/unhappy-camper
Download in original resolution.
Using 01_pp_image.yaml:
# let's deploy a confused image and look for the error:
kubectl -n vnyc apply -f 01_pp_image.yaml
kubectl -n vnyc get events | grep confused | grep Error
# fix it by specifying the correct image:
kubectl -n vnyc patch deployment confused-imager \
--patch '{ "spec" : { "template" : { "spec" : { "containers" : [ { "name" : "something" , "image" : "mhausenblas/simpleservice:0.5.0" } ] } } } }'
kubectl -n vnyc delete deploy/confused-imager
Relevant real-world examples on StackOverflow:
Using 02_pp_oomer.yaml and 02_pp_oomer-fixed.yaml:
# prepare a greedy fellow that will OOM:
kubectl -n vnyc apply -f 02_pp_oomer.yaml
# wait > 5s and then check mem in container:
kubectl -n vnyc exec -it $(kubectl -n vnyc get po -l=app=oomer --output=jsonpath={.items[*].metadata.name}) -c greedymuch -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes /sys/fs/cgroup/memory/memory.usage_in_bytes
kubectl -n vnyc describe po $(kubectl -n vnyc get po -l=app=oomer --output=jsonpath={.items[*].metadata.name})
# fix the issue:
kubectl -n vnyc apply -f 02_pp_oomer-fixed.yaml
# wait > 20s
kubectl -n vnyc exec -it $(kubectl -n vnyc get po -l=app=oomer --output=jsonpath={.items[*].metadata.name}) -c greedymuch -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes /sys/fs/cgroup/memory/memory.usage_in_bytes
kubectl -n vnyc delete deploy wegotan-oomer
Relevant real-world examples on StackOverflow:
Using 03_pp_logs.yaml:
kubectl -n vnyc apply -f 03_pp_logs.yaml
# nothing to see here:
kubectl -n vnyc describe deploy/hiccup
# but I see it in the logs:
kubectl -n vnyc logs --follow $(kubectl -n vnyc get po -l=app=hiccup --output=jsonpath={.items[*].metadata.name})
kubectl -n vnyc delete deploy hiccup
Relevant real-world examples on StackOverflow:
References:
Using 04_storage-failedmount.yaml and 04_storage-failedmount-fixed.yaml:
kubectl -n vnyc apply -f 04_storage-failedmount.yaml
# has the data been written?
kubectl -n vnyc exec -it $(kubectl -n vnyc get po -l=app=wheresmyvolume --output=jsonpath={.items[*].metadata.name}) -c writer -- cat /tmp/out/data
# has the data been read in?
kubectl -n vnyc exec -it $(kubectl -n vnyc get po -l=app=wheresmyvolume --output=jsonpath={.items[*].metadata.name}) -c reader -- cat /tmp/in/data
kubectl -n vnyc describe po $(kubectl -n vnyc get po -l=app=wheresmyvolume --output=jsonpath={.items[*].metadata.name})
kubectl -n vnyc apply -f 04_storage-failedmount-fixed.yaml
kubectl -n vnyc delete deploy wheresmyvolume
Relevant real-world examples on StackOverflow:
References:
Using 05_network-wrongsel.yaml and 05_network-wrongsel-fixed.yaml:
kubectl -n vnyc run webserver --image nginx --port 80
kubectl -n vnyc apply -f 05_network-wrongsel.yaml
kubectl -n vnyc run -it --rm debugpod --restart=Never --image=centos:7 -- curl webserver.vnyc
kubectl -n vnyc run -it --rm debugpod --restart=Never --image=centos:7 -- ping webserver.vnyc
kubectl -n vnyc run -it --rm debugpod --restart=Never --image=centos:7 -- ping $(kubectl -n vnyc get po -l=run=webserver --output=jsonpath={.items[*].status.podIP})
kubectl -n vnyc apply -f 05_network-wrongsel-fixed.yaml
kubectl -n vnyc delete deploy webserver
Other scenarios often found:
connection refused
? You could be hitting the 127.0.0.1
issue with the solution to make the app listen on 0.0.0.0
rather than on localhost. Further, see also some discussion here.selector
and that removes the pod from the pool of endpoints the service has to serve traffic to while leaving the pod running, ready for you to kubectl exec -it
in.Relevant real-world examples on StackOverflow:
References:
kubectl -n vnyc create sa prober
kubectl -n vnyc run -it --rm probepod --serviceaccount=prober --restart=Never --image=centos:7 -- sh
# in the container; will result in an 403, b/c we don't have the permissions necessary:
export CURL_CA_BUNDLE=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
APISERVERTOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
curl -H "Authorization: Bearer $APISERVERTOKEN" https://kubernetes.default/api/v1/namespaces/vnyc/pods
# different tmux pane, verify if the SA actually is allowed to:
kubectl -n vnyc auth can-i list pods --as=system:serviceaccount:vnyc:prober
# … seems not to be the case, so give sufficient permissions:
kubectl create clusterrole podreader \
--verb=get --verb=list \
--resource=pods
kubectl -n vnyc create rolebinding allowpodprobes \
--clusterrole=podreader \
--serviceaccount=vnyc:prober \
--namespace=vnyc
# clean up
kubectl delete clusterrole podreader && kubectl delete ns vnyc
Relevant real-world examples on StackOverflow:
References see kubernetes-security.info.
From metrics (Prometheus and Grafana) to logs (EFK/ELK stack) to tracing (OpenCensus and OpenTracing).
Show Linkerd 2.0 in action using this Katacoda scenario as a starting point.
Show Jaeger 1.6 in action using this Katacoda scenario.
References:
Show chaoskube in action, killing off random pods in the vnyc
namespace.
We have the following setup:
+----------------+
| |
+-----> | webserver/pod1 |
| | |
+----------------+ | +----------------+
| | | +----------------+
| appserver/pod1 +--------+ +---------+ | | |
| | | +--+ | +-----> | webserver/pod2 |
+----------------+ | X | | | |
| X | | +----------------+
| X | | +----------------+
v X | | | |
X svc/webserver +--------> | webserver/pod3 |
^ X | | | |
+----------------+ | X | | +----------------+
| | | X | | +----------------+
| appserver/pod2 +--------+ X | | | |
| | +--+ | +-----> | webserver/pod4 |
+----------------+ +----------+ | | |
| +----------------+
| +----------------+
| | |
+-----> | webserver/pod5 |
| |
+----------------+
That is, a webserver
running with five replicas along with a service as well as an appserver
running with two replicas that queries said service.
# let's create our victims, that is webservers and appservers:
kubectl create ns vnyc
kubectl -n vnyc run webserver --image nginx --port 80 --replicas 5
kubectl -n vnyc expose deploy/webserver
kubectl -n vnyc run appserver --image centos:7 --replicas 2 -- sh -c "while true; do curl webserver ; sleep 10 ; done"
kubectl -n vnyc logs deploy/appserver --follow
# also keep on the events generated:
kubectl -n vnyc get events --watch
# now release the chaos monkey:
chaoskube \
--interval 30s \
--namespaces 'vnyc' \
--no-dry-run
kubectl delete ns vnyc
And here’s a screen shot of chaoskube
in action, with all the above commands applied:
References: