Chaos Engineering on AWS: Fault Injection Simulator Guide
You don’t know your system is resilient until you’ve broken it on purpose.
I believed our payment processing service was fault tolerant. …
Read Article →106 articles about devops development, tools, and best practices
You don’t know your system is resilient until you’ve broken it on purpose.
I believed our payment processing service was fault tolerant. …
Read Article →Gateway API is what Ingress should have been from day one.
I don’t say that lightly. I’ve spent years wrangling Kubernetes Ingress …
Read Article →If you’re not running scheduled terraform plan, you have drift. You just don’t know it yet.
I learned this the hard way. A colleague made …
Read Article →Everything I’ve learned building on AWS since 2012, organized by domain.
This is the hub for everything I’ve written about Kubernetes. Whether you’re setting up your first cluster or optimizing a multi-tenant …
Read Article →I’ve been running Kubernetes in production for years now, and there’s a specific kind of pain that only hits you once you cross the …
Read Article →99.99% availability sounds great until you realize that’s 4 minutes and 19 seconds of downtime per month. Four minutes. That’s barely …
Read Article →I mass-deleted requirements.txt files from a monorepo last month. Fourteen of them. Some had unpinned dependencies, some had pins from 2021, one had a …
NGINX Ingress is the Honda Civic of ingress controllers. Boring, reliable, gets the job done. I’ve deployed it on dozens of clusters and …
Read Article →I deleted roughly 2,000 lines of orchestration code from our payment processing service last year. Replaced it with about 200 lines of Amazon States …
Read Article →Most Terraform code has zero tests. That’s insane for something managing production infrastructure. We wouldn’t ship application code …
Read Article →I spent four hours on a Tuesday night debugging a 30-second API call. Four hours. The call touched 12 services — auth, inventory, pricing, three …
Read Article →If you’re not scanning container images before they hit production, it’s only a matter of time before something ugly shows up in your …
Read Article →EventBridge is the most underused AWS service. I’ll die on that hill. Teams will build these elaborate Rube Goldberg machines out of SNS topics, …
Read Article →Operator SDK vs kubebuilder — I pick kubebuilder every time. Operator SDK wraps kubebuilder anyway, adds a layer of abstraction that mostly just gets …
Read Article →I use both. Terraform for multi-cloud, CDK when it’s pure AWS and the team knows TypeScript. That’s the short answer. But the long answer …
Read Article →Platform engineering is DevOps done right. Or maybe it’s DevOps with a product mindset. Either way, it’s the recognition that telling …
Read Article →CPU-based autoscaling is a lie for most web services. There, I said it.
I spent a painful week last year watching an HPA scale our API pods from 3 to …
Read Article →VPNs are not zero trust. Stop calling them that.
I can’t count how many times I’ve sat in architecture reviews where someone points at a …
Read Article →I got a call from a startup founder last year. “Our AWS bill just hit $47,000 and we have twelve engineers.” They’d been running for …
Read Article →I’m going to say something that’ll upset people: if your developers have cluster-admin access in production, you’re running on …
Read Article →I once inherited a project with a single main.tf that was over 3,000 lines long. No modules. No abstractions. Just one enormous file that deployed an …
ArgoCD won the GitOps war. I’ll say it. Flux is fine—it works, it’s CNCF graduated, it has its fans—but ArgoCD’s UI alone makes it …
Read Article →I started learning Rust as someone who’d spent years writing Python scripts and Go services for cloud infrastructure. My first reaction was …
Read Article →