Chaos Engineering on AWS: Fault Injection Simulator Guide
You don’t know your system is resilient until you’ve broken it on purpose.
I believed our payment processing service was fault tolerant. We ran multi-AZ. We had health checks. We had auto scaling. We had all the boxes ticked on the Well-Architected review. Then us-east-1b had a networking event on a Tuesday afternoon, and we watched a service that was supposed to gracefully fail over instead fall flat on its face. The load balancer kept routing to unhealthy targets for nearly four minutes because our health check intervals were too generous. The database failover triggered but the application’s connection pool held stale connections for another two minutes after that. Six minutes of degraded service for a payment processor. That’s the kind of thing that gets you a phone call from someone whose title starts with “Chief.”
The postmortem was brutal. Every single failure mode we hit was something we could’ve caught if we’d actually tested failover instead of assuming it worked. That incident is why I now treat chaos engineering practices as non-negotiable for anything running in production.
AWS Fault Injection Simulator — now called AWS Fault Injection Service, though everyone still says FIS — is the tool that changed how I approach this. It lets you run controlled experiments against your AWS infrastructure: stop EC2 instances, fail over RDS clusters, inject network latency, stress CPU on EKS pods, and more. All with guardrails so you don’t accidentally take down production while trying to prove it won’t go down.
Here’s how I use it, and how you should too.
What FIS Actually Does
FIS is a managed service for running fault injection experiments against AWS resources. You define an experiment template — what to break, how to break it, and when to stop — and FIS executes it while monitoring your stop conditions.
The core concepts are straightforward:
- Actions — the faults you inject. Stop instances, fail over databases, disrupt network connectivity, stress CPU, inject API errors.
- Targets — the AWS resources you’re hitting. EC2 instances, ECS tasks, EKS pods, RDS clusters, subnets.
- Stop conditions — CloudWatch alarms that automatically halt the experiment if things go sideways beyond what you intended.
- Experiment templates — reusable definitions that combine all of the above.
The supported action list is extensive. EC2 actions include stopping, rebooting, and terminating instances, plus Spot interruption simulation. ECS gets task-level CPU stress, network blackhole, latency injection, and packet loss. EKS has pod-level equivalents plus node group termination. RDS supports cluster failover and instance reboot. There are also network-level actions for disrupting subnet connectivity, VPC endpoints, and transit gateways, plus API-level fault injection for throttling and internal errors.
The thing that makes FIS different from running aws ec2 stop-instances in a script is the guardrails. Stop conditions tied to CloudWatch alarms mean the experiment automatically rolls back if your error rate or latency crosses a threshold. That’s the difference between chaos engineering and just breaking things.
Setting Up: IAM Role and Permissions
Before you run anything, FIS needs an IAM role with permission to do the things you’re asking it to do. This is where I see people get stuck — they create a role that’s either too broad or too narrow.
Here’s a role trust policy that lets FIS assume it:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "fis.amazonaws.com"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"aws:SourceAccount": "111122223333"
}
}
}
]
}
The aws:SourceAccount condition is important. Without it, any account could potentially use your role through FIS. I’ve seen this missed in tutorials and it’s a real security gap.
For the permissions policy, scope it to exactly what your experiments need. If you’re stopping EC2 instances and failing over RDS:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:StopInstances",
"ec2:StartInstances",
"ec2:DescribeInstances"
],
"Resource": "arn:aws:ec2:us-east-1:111122223333:instance/*",
"Condition": {
"StringEquals": {
"aws:ResourceTag/Environment": "staging"
}
}
},
{
"Effect": "Allow",
"Action": [
"rds:FailoverDBCluster",
"rds:DescribeDBClusters"
],
"Resource": "arn:aws:rds:us-east-1:111122223333:cluster:*"
}
]
}
Tag-based conditions on the EC2 permissions mean FIS can only touch instances tagged for your experiment environment. Don’t give FIS blanket ec2:* permissions. That’s how you end up in a different kind of postmortem.
Your First Experiment: Stopping EC2 Instances
Start simple. The best first experiment is stopping instances in a single AZ and watching what happens. Save this as az-instance-stop.json:
{
"description": "Stop all prod instances in us-east-1b to simulate AZ failure",
"tags": {
"Name": "AZ-Failure-Simulation"
},
"targets": {
"az-instances": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"Environment": "staging"
},
"filters": [
{
"path": "Placement.AvailabilityZone",
"values": ["us-east-1b"]
},
{
"path": "State.Name",
"values": ["running"]
}
],
"selectionMode": "ALL"
}
},
"actions": {
"StopInstances": {
"actionId": "aws:ec2:stop-instances",
"description": "Stop instances and restart after 5 minutes",
"parameters": {
"startInstancesAfterDuration": "PT5M"
},
"targets": {
"Instances": "az-instances"
}
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:us-east-1:111122223333:alarm:HighErrorRate"
}
],
"roleArn": "arn:aws:iam::111122223333:role/FISExperimentRole"
}
Create the template and run it:
# Create the experiment template
aws fis create-experiment-template \
--cli-input-json file://az-instance-stop.json
# Note the template ID from the output, then start the experiment
aws fis start-experiment \
--experiment-template-id EXT1a2b3c4d5e6f7
# Monitor the experiment status
aws fis get-experiment \
--id EXP9z8y7x6w5v4u3
A few things to notice in this template. The selectionMode is ALL, meaning every instance matching the filters gets stopped. You could use COUNT(3) to stop exactly three random instances, or PERCENT(50) to hit half of them. The startInstancesAfterDuration parameter means FIS restarts the instances after five minutes — you don’t have to clean up manually.
The stop condition points to a CloudWatch alarm. If that alarm trips during the experiment, FIS stops immediately and restarts your instances. Set this to something meaningful — your service’s error rate alarm, not just a CPU metric.
RDS Failover Testing
Database failover is where I’ve seen the most assumptions go untested. Teams configure Multi-AZ RDS and assume failover is seamless. It isn’t. There’s a DNS propagation delay. Connection pools hold stale connections. Applications that cache the database endpoint instead of resolving it fresh will keep hitting the old primary.
Here’s a template for forcing an RDS cluster failover:
{
"description": "Force Aurora cluster failover to test application resilience",
"tags": {
"Name": "RDS-Failover-Test"
},
"targets": {
"myCluster": {
"resourceType": "aws:rds:cluster",
"resourceArns": [
"arn:aws:rds:us-east-1:111122223333:cluster:my-aurora-cluster"
],
"selectionMode": "ALL"
}
},
"actions": {
"FailoverCluster": {
"actionId": "aws:rds:failover-db-cluster",
"description": "Failover the Aurora cluster",
"targets": {
"Clusters": "myCluster"
}
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:us-east-1:111122223333:alarm:PaymentProcessingErrors"
}
],
"roleArn": "arn:aws:iam::111122223333:role/FISExperimentRole"
}
When I first ran this against our staging environment, the application threw connection errors for 45 seconds. The connection pool library we were using had a 30-second timeout on stale connections, and the DNS TTL on the RDS endpoint was another 15 seconds. We’d never have found that without actually triggering the failover.
The fix was straightforward — reduce the connection pool’s idle timeout, enable connection validation on checkout, and make sure the application used the cluster endpoint rather than the instance endpoint. But we only found the problem because we broke it deliberately.
Network Disruption Experiments
Network faults are where FIS really shines compared to just stopping instances. You can simulate partial failures — the kind that are hardest to handle because your instances are still running but can’t reach each other or downstream services.
The aws:network:disrupt-connectivity action lets you black-hole traffic for specific subnets:
{
"description": "Disrupt network connectivity for app-tier subnets in AZ-b",
"targets": {
"targetSubnets": {
"resourceType": "aws:ec2:subnet",
"resourceTags": {
"Tier": "application"
},
"filters": [
{
"path": "AvailabilityZone",
"values": ["us-east-1b"]
}
],
"selectionMode": "ALL"
}
},
"actions": {
"DisruptConnectivity": {
"actionId": "aws:network:disrupt-connectivity",
"description": "Block all network traffic for 5 minutes",
"parameters": {
"duration": "PT5M",
"scope": "all"
},
"targets": {
"Subnets": "targetSubnets"
}
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:us-east-1:111122223333:alarm:CriticalServiceHealth"
}
],
"roleArn": "arn:aws:iam::111122223333:role/FISExperimentRole"
}
The scope parameter controls what gets blocked. Use all to black-hole everything, availability-zone to block cross-AZ traffic only, or vpc to block traffic leaving the VPC. Each one tests a different failure mode, and you should test all three.
This is the experiment that would’ve caught our AZ failure. If we’d run a network disruption test against a single AZ, we’d have seen the load balancer continuing to route traffic to unreachable targets. We’d have caught the health check interval problem before a real outage taught us the hard way.
Container Fault Injection: ECS and EKS
If you’re running containers, FIS has task-level and pod-level actions that let you inject faults without touching the underlying infrastructure.
For ECS, you can stress CPU, inject network latency, kill processes, or black-hole specific ports on individual tasks. For EKS, you get pod-level equivalents plus the ability to inject custom Chaos Mesh resources directly.
Here’s a CPU stress test for ECS tasks:
{
"description": "CPU stress on order-service ECS tasks",
"targets": {
"orderTasks": {
"resourceType": "aws:ecs:task",
"resourceTags": {
"Service": "order-service"
},
"parameters": {
"cluster": "arn:aws:ecs:us-east-1:111122223333:cluster/production"
},
"selectionMode": "PERCENT(50)"
}
},
"actions": {
"StressCPU": {
"actionId": "aws:ecs:task-cpu-stress",
"description": "Stress CPU to 90% on half the tasks",
"parameters": {
"duration": "PT3M",
"percent": "90"
},
"targets": {
"Tasks": "orderTasks"
}
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:us-east-1:111122223333:alarm:OrderServiceP99Latency"
}
],
"roleArn": "arn:aws:iam::111122223333:role/FISExperimentRole"
}
Using PERCENT(50) for the selection mode is deliberate. You want to see how the service behaves when half its capacity is degraded, not when everything’s on fire. That’s a more realistic failure scenario — partial degradation is far more common than total loss, and it’s harder to detect and handle correctly.
For EKS, you can inject Kubernetes-native faults through Chaos Mesh integration. FIS creates the custom resource in your cluster and cleans it up when the experiment ends. This is useful if you’re already using Chaos Mesh but want the guardrails and audit trail that FIS provides.
Multi-Action Experiments
Real outages don’t happen one fault at a time. The power of FIS is combining multiple actions into a single experiment to simulate realistic failure scenarios.
FIS has a built-in AZ Availability scenario that simulates a complete AZ power interruption. It stops EC2 instances, pauses instance launches, disrupts network connectivity, and fails over RDS and ElastiCache clusters — all at once. That’s the kind of compound failure that actually happens during an AZ event.
You can build your own multi-action experiments too. Use the startAfter parameter to sequence actions:
{
"description": "Cascading failure: network disruption then instance stops",
"targets": {
"appSubnets": {
"resourceType": "aws:ec2:subnet",
"resourceTags": { "Tier": "application" },
"filters": [
{ "path": "AvailabilityZone", "values": ["us-east-1b"] }
],
"selectionMode": "ALL"
},
"appInstances": {
"resourceType": "aws:ec2:instance",
"resourceTags": { "Environment": "staging" },
"filters": [
{ "path": "Placement.AvailabilityZone", "values": ["us-east-1b"] }
],
"selectionMode": "ALL"
}
},
"actions": {
"DisruptNetwork": {
"actionId": "aws:network:disrupt-connectivity",
"parameters": { "duration": "PT10M", "scope": "all" },
"targets": { "Subnets": "appSubnets" }
},
"StopInstances": {
"actionId": "aws:ec2:stop-instances",
"parameters": { "startInstancesAfterDuration": "PT8M" },
"targets": { "Instances": "appInstances" },
"startAfter": ["DisruptNetwork"]
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:us-east-1:111122223333:alarm:CriticalServiceHealth"
}
],
"roleArn": "arn:aws:iam::111122223333:role/FISExperimentRole"
}
The startAfter on StopInstances means it won’t execute until DisruptNetwork completes. This simulates a cascading failure — first the network degrades, then instances go down. Without startAfter, both actions run simultaneously.
Integrating FIS Into Your CI/CD Pipeline
Running experiments manually is a start, but the real value comes from automating them. I run FIS experiments as a post-deployment step in our staging pipeline. Every deployment gets a basic resilience check before it’s promoted to production.
#!/bin/bash
set -euo pipefail
TEMPLATE_ID="EXT1a2b3c4d5e6f7"
# Start the experiment
EXPERIMENT_ID=$(aws fis start-experiment \
--experiment-template-id "$TEMPLATE_ID" \
--query 'experiment.id' \
--output text)
echo "Started experiment: $EXPERIMENT_ID"
# Poll until the experiment completes
while true; do
STATUS=$(aws fis get-experiment \
--id "$EXPERIMENT_ID" \
--query 'experiment.state.status' \
--output text)
case "$STATUS" in
completed)
echo "Experiment completed successfully"
exit 0
;;
stopped)
echo "Experiment stopped — stop condition triggered"
exit 1
;;
failed)
echo "Experiment failed"
exit 1
;;
*)
echo "Status: $STATUS — waiting..."
sleep 10
;;
esac
done
If the experiment triggers a stop condition, the deployment pipeline fails and the release doesn’t proceed. This catches resilience regressions — like someone changing a health check interval or removing a retry policy — before they reach production.
You can also list your experiment templates programmatically to run a suite of experiments:
# List all experiment templates tagged for CI
aws fis list-experiment-templates \
--query "experimentTemplates[?tags.Pipeline=='ci'].[id,description]" \
--output table
What to Measure During Experiments
Running the experiment is only half the job. You need to know what to watch. Here’s what I track:
- Time to detect — how long before your monitoring notices the fault. If your alarms don’t fire within 60 seconds of an AZ going dark, your SLOs and error budgets are based on fiction.
- Time to mitigate — how long before traffic stops hitting the failed resources. This is your load balancer health check interval plus deregistration delay. Most people are surprised how long this actually takes.
- Time to recover — how long before the system is fully healthy after the fault is resolved. Connection pool recovery, cache warming, DNS propagation — it all adds up.
- Error rate during fault — what percentage of requests failed. Zero is the goal for an AZ failure if you’re running multi-AZ. If you’re seeing errors, your fault tolerance has gaps.
- Blast radius — did the fault affect only the targeted resources, or did it cascade? Cascading failures are the ones that turn a minor incident into a major outage.
Set up a CloudWatch dashboard specifically for chaos experiments. Include your key business metrics alongside infrastructure metrics. The infrastructure might look fine while your customers are seeing errors.
Guardrails and Safety
I want to be direct about this: chaos engineering done carelessly is just breaking things. The “engineering” part is the controls.
Always set stop conditions. Every experiment template should have at least one CloudWatch alarm as a stop condition. If your error rate exceeds your threshold, the experiment stops automatically. No exceptions.
Start in staging. I know the chaos engineering purists say you should test in production. They’re right, eventually. But start in staging, find the obvious problems, fix them, and then graduate to production. Running your first-ever chaos experiment against production is not brave, it’s reckless.
Use tag-based targeting. Never target resources by ARN in a template you plan to reuse. Tags let you control the blast radius through resource tagging rather than template editing. Tag your staging resources with ChaosReady=true and target that tag.
Communicate before you run. Even in staging, tell your team. Nothing erodes trust in chaos engineering faster than someone spending an hour debugging a “mystery outage” that was actually your experiment.
Keep experiments short. Five minutes is plenty for most experiments. You’re testing detection and failover, not endurance. If your system can’t recover from a five-minute AZ outage, a thirty-minute experiment won’t tell you anything new.
Building a Chaos Engineering Practice
FIS is a tool. Chaos engineering is a practice. The tool is useless without the practice around it.
Start with a hypothesis. “Our service will continue serving requests with less than 1% error rate when one AZ loses network connectivity.” That’s testable. “Let’s see what happens when we break stuff” is not chaos engineering — it’s just chaos.
Build a library of experiments. I maintain a Git repository of experiment templates organized by failure mode: AZ failure, database failover, network partition, resource exhaustion, dependency failure. Each template has a README explaining the hypothesis, expected behavior, and what to check.
Run experiments regularly. Monthly at minimum. Quarterly is too infrequent — your infrastructure changes faster than that. I run basic AZ failure experiments after every significant infrastructure change and a full suite monthly.
Track results over time. Your time-to-detect and time-to-recover should improve as you fix issues. If they’re getting worse, something’s regressing. This feeds directly into your SRE fundamentals — resilience isn’t a checkbox, it’s a metric.
Connect findings to automated remediation. Every chaos experiment that reveals a manual recovery step is an opportunity to automate. If you’re SSHing into boxes during an AZ failure, you’ve got work to do.
Where I’ve Landed
That AZ failure I mentioned at the start cost us six minutes of degraded service and a lot of credibility. The chaos engineering practice we built afterward has caught dozens of similar issues before they hit production. Connection pool misconfigurations, health check intervals that were too long, retry policies that amplified failures instead of absorbing them, auto scaling policies that couldn’t react fast enough.
Every one of those was a future incident we prevented.
FIS isn’t the only way to do chaos engineering on AWS, but it’s the most integrated. The guardrails, the IAM integration, the CloudWatch stop conditions, the experiment audit trail — it’s built for teams that need to break things safely. And if you’re running anything that matters on AWS, you need to be breaking things safely.
You don’t know your system is resilient until you’ve proven it. Go prove it.