https://k8s.af/
A compiled list of links to public failure stories related to Kubernetes. Most recent publications on top.
- How a couple of characters brought down our site - Skyscanner - blog post 2021
- involved: Gitops, templating, namespace deletion
- impact: global outage
- 10 More Weird Ways to Blow Up Your Kubernetes - Airbnb - KubeCon NA 2020
- involved: MutatingAdmissionWebhook, CPU Limits, OOMKill, kube2iam, HPA
- impact: outages
- Why we switched from fluent-bit to Fluentd in 2 hours - PrometheusKube - blog post 2020
- involved: fluent-bit, missing logs, Fluentd
- impact: lost application logs in production
- Make your services faster by removing CPU limits - Buffer - blog post 2020
- involved: kops, CPU Limit, CPU throttling
- impact: high latency
- The case of the missing packet: An EKS migration tale - MindTickle - blog post 2020
- involved: EKS, AWS CNI Plugin,
- impact: frequent connection failures when talking to services outside the cluster
- Kubernetes Networking Problems Due to the Conntrack - loveholidays - blog post 2020
- involved: GKE, conntrack, HAProxy
- impact: high error rate on network-heavy services
- DNS issues in Kubernetes. Public postmortem #1 - Preply - blog post 2020
- involved: conntrack, DNS, CoreDNS-autoscaler
- impact: partial production outage
- CPU limits and aggressive throttling in Kubernetes - Omio - blog post 2020
- involved: GKE, CPU Limit, CPU throttling
- impact: high latency, errors
- When GKE ran out of IP addresses - loveholidays - blog post 2020
- involved: GKE, cluster autoscaler, HPA, Alias IP VPC (VPC Native)
- impact: stuck deployment, blocked autoscaling of both pods and nodes
- Intermittent delays in Kubernetes - MindTickle - blog post 2019
- involved: kops, conntrack DNAT/SNAT, libc, musl
- impace: intermittent delays in interval of 5 seconds in DNS lookups for HTTP/1.1 and HTTP/2
- 10 Weird Ways to Blow Up Your Kubernetes - Airbnb - KubeCon NA 2019
- involved: sidecars, DaemonSet, image registry, JVM, HPA
- impact: outages
- Did Kubernetes Make My p95s Worse? - Airbnb - KubeCon NA 2019
- involved: CPU Limit, CPU Throttling, DNS
- impact: high latency
- How we failed to integrate Istio into our platform - Exponea - blog post 2019
- involved: Istio, GKE, proxy injection
- impact: stopped Istio rollout, developers' time spent
- Kubernetes made my latency 10x higher - Adevinta - blog post 2019
- involved: KIAM, DNS, AWS IAM, latency
- impact: service showing up to x10 higher latencies compared to a deployment in EC2
- A Kubernetes failure story (dex) - anonymous Fullstaq client - Dutch kubernetes meetup slides 2019-06
- involved: etcd, apiserver, dex, custom resources
- impact: broken control plane on production with no access to o11y due to broken authentication system, no actual business impact
- How A Cryptocurrency Miner Made Its Way onto Our Internal Kubernetes Clusters - JW Player - Medium post March 2019
- involved: Weave Scope, public AWS ELB
- impact: security issue, cryptominer stealing compute power
- A Kubernetes crime story - Prezi - blog post 2019
- involved: AWS EKS, SNAT, conntrack, Amazon VPC CNI plugin
- impact: delay of 1-3 seconds for outgoing TCP connections