Infrastructure Drift Detection and Remediation
If you’re not running scheduled terraform plan, you have drift. You just don’t know it yet.
I learned this the hard way. A colleague made a “quick fix” in the AWS console — changed a security group rule to unblock a vendor integration. Totally reasonable in the moment. Nobody updated the Terraform code. Three weeks later, I ran a deploy that included security group changes for a different service. Terraform saw the console change as drift, reverted it, and killed the vendor connection. That vendor connection happened to feed data into our payment processing pipeline. Two hours of downtime, a war room, and a very uncomfortable post-mortem later, we had a new rule: nothing touches production infrastructure outside of code. Ever.
That incident changed how I think about drift. It’s not a theoretical problem. It’s a ticking bomb sitting in the gap between what your code says and what actually exists.
What Drift Actually Is
Drift is any difference between your declared infrastructure state and the real state of your resources. It happens constantly, and it happens for boring reasons:
- Someone clicks something in the console during an incident
- An AWS service applies a default that wasn’t in your config
- Auto-scaling modifies resource attributes
- A different team’s automation touches shared resources
- You ran
terraform applylocally and forgot to push the state
The problem isn’t that drift happens. The problem is that most teams don’t know it’s happening until something breaks. Your Terraform state might say one thing while reality says another, and you won’t find out until the worst possible moment.
Scheduled Terraform Plan: The Foundation
The single most impactful thing you can do is run terraform plan on a schedule against every environment. Not just when someone opens a PR — continuously.
Here’s a GitHub Actions workflow that runs plan every 6 hours and alerts on drift:
name: Drift Detection
on:
schedule:
- cron: '0 */6 * * *'
workflow_dispatch:
jobs:
detect-drift:
runs-on: ubuntu-latest
strategy:
matrix:
environment: [production, staging]
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.9.x
- name: Terraform Init
run: terraform init -backend-config="env/${{ matrix.environment }}.hcl"
working-directory: infrastructure
- name: Detect Drift
id: plan
run: |
terraform plan -detailed-exitcode -out=drift.plan 2>&1 | tee plan_output.txt
echo "exitcode=$?" >> $GITHUB_OUTPUT
working-directory: infrastructure
continue-on-error: true
- name: Notify on Drift
if: steps.plan.outputs.exitcode == '2'
run: |
DRIFT_SUMMARY=$(grep -E "Plan:|changed|destroyed" plan_output.txt | head -5)
curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
-H 'Content-Type: application/json' \
-d "{\"text\":\"⚠️ Drift detected in ${{ matrix.environment }}:\n${DRIFT_SUMMARY}\"}"
Exit code 2 from terraform plan -detailed-exitcode means changes detected. That’s your drift signal. This integrates directly into your CI/CD pipeline and gives you visibility you didn’t have before.
I run this every 6 hours. Some teams do it hourly. The frequency depends on how much console access your org still allows and how paranoid you are. I’m very paranoid.
AWS Config Rules for Real-Time Detection
Scheduled plans catch drift, but they’re not instant. AWS Config gives you near-real-time detection for specific resource types. It watches for configuration changes as they happen.
Here’s a Terraform config for Config rules that catch common drift scenarios:
resource "aws_config_config_rule" "sg_open_access" {
name = "restricted-ssh"
source {
owner = "AWS"
source_identifier = "INCOMING_SSH_DISABLED"
}
}
resource "aws_config_config_rule" "s3_public_access" {
name = "s3-bucket-public-read-prohibited"
source {
owner = "AWS"
source_identifier = "S3_BUCKET_PUBLIC_READ_PROHIBITED"
}
}
resource "aws_config_config_rule" "encrypted_volumes" {
name = "encrypted-volumes"
source {
owner = "AWS"
source_identifier = "ENCRYPTED_VOLUMES"
}
}
These managed rules are a starting point. The real power comes from custom rules when you need to enforce organization-specific policies. But honestly, the managed rules alone would’ve caught the security group change that caused my outage.
Pair Config with SNS notifications and you’ve got alerts within minutes of someone touching something they shouldn’t:
resource "aws_config_delivery_channel" "main" {
name = "config-delivery"
s3_bucket_name = aws_s3_bucket.config.id
sns_topic_arn = aws_sns_topic.config_alerts.arn
}
Building a Drift Detection Pipeline
Individual tools are fine. A pipeline that ties them together is better. Here’s how I structure drift detection as a proper system rather than a collection of scripts.
The pipeline has three stages: detect, classify, remediate.
# Lambda function that processes drift detection results
resource "aws_lambda_function" "drift_processor" {
filename = "drift_processor.zip"
function_name = "drift-processor"
role = aws_iam_role.drift_processor.arn
handler = "index.handler"
runtime = "python3.12"
timeout = 300
environment {
variables = {
SLACK_WEBHOOK_URL = var.slack_webhook_url
PAGERDUTY_API_KEY = var.pagerduty_api_key
DRIFT_TABLE = aws_dynamodb_table.drift_log.name
}
}
}
resource "aws_dynamodb_table" "drift_log" {
name = "drift-detection-log"
billing_mode = "PAY_PER_REQUEST"
hash_key = "drift_id"
range_key = "detected_at"
attribute {
name = "drift_id"
type = "S"
}
attribute {
name = "detected_at"
type = "S"
}
}
Every drift event gets logged to DynamoDB. This matters more than you’d think — when you’re in a post-mortem asking “how long was this drifted?”, you want that data.
Classifying Drift: Not All Drift Is Equal
This is where most teams get it wrong. They detect drift and treat every instance the same way. That leads to alert fatigue fast.
I classify drift into three categories:
Critical — Security-related changes. Open security groups, public S3 buckets, IAM policy modifications. These get PagerDuty alerts and need immediate remediation.
Warning — Functional changes that could cause issues. Modified instance types, changed database parameters, altered load balancer configs. These get Slack notifications and a 24-hour remediation window.
Informational — Cosmetic or expected drift. Tag changes, description updates, AWS-managed attribute modifications. These get logged but don’t alert anyone.
Here’s a script that parses terraform plan output and classifies:
#!/bin/bash
set -euo pipefail
PLAN_OUTPUT="$1"
CRITICAL_PATTERNS="(aws_security_group|aws_iam|aws_s3_bucket_public|aws_kms)"
WARNING_PATTERNS="(aws_instance|aws_db_instance|aws_lb|aws_ecs)"
critical_count=$(grep -cE "# ${CRITICAL_PATTERNS}" "$PLAN_OUTPUT" || true)
warning_count=$(grep -cE "# ${WARNING_PATTERNS}" "$PLAN_OUTPUT" || true)
if [ "$critical_count" -gt 0 ]; then
echo "CRITICAL: ${critical_count} security-related drift(s) detected"
exit 2
elif [ "$warning_count" -gt 0 ]; then
echo "WARNING: ${warning_count} functional drift(s) detected"
exit 1
else
echo "INFO: Only informational drift detected"
exit 0
fi
This classification feeds into your alerting. No more waking someone up at 3am because a tag changed.
Remediation Strategies
Detection without remediation is just anxiety. You need a plan for what happens when drift shows up.
Auto-remediation for known-safe changes: Some drift can be fixed automatically. If someone adds a tag that’s not in your code, you can safely run terraform apply to revert it. I auto-remediate informational drift nightly.
Guided remediation for warnings: For functional drift, I generate a remediation PR automatically. The pipeline runs terraform plan, captures the output, and opens a PR with the plan attached. A human reviews and merges.
# Addition to the drift detection workflow
- name: Create Remediation PR
if: steps.plan.outputs.exitcode == '2'
uses: peter-evans/create-pull-request@v6
with:
title: "🔧 Drift remediation: ${{ matrix.environment }}"
body: |
Automated drift detection found changes in `${{ matrix.environment }}`.
**Plan output:**
```
$(cat plan_output.txt)
```
Review the plan and merge to remediate.
branch: "drift-remediation/${{ matrix.environment }}-${{ github.run_id }}"
labels: drift-remediation,automated
Manual remediation for critical drift: Security drift gets a PagerDuty alert. Someone investigates immediately. Sometimes the drift is intentional — an incident response that bypassed the normal process. In that case, you update the code to match reality rather than reverting. The key is that someone makes a conscious decision.
Here’s the thing that took me too long to learn: sometimes the right remediation is updating your Terraform code, not reverting the infrastructure. If someone changed an instance type because the old one was causing OOM kills, reverting that change is going to cause the same problem. Import the change, update the code, move on.
# When drift is intentional, import the current state
terraform plan -target=aws_instance.api_server | grep "forces replacement"
# If no replacement needed, update code to match and apply
terraform apply -target=aws_instance.api_server -auto-approve
# Verify state matches
terraform plan -detailed-exitcode
Preventing Drift at the Source
Detection and remediation are reactive. Prevention is better.
Lock down console access. I’m not saying remove it entirely — you need it for incidents. But day-to-day changes should go through code. Use SCPs to restrict write access in production accounts:
resource "aws_organizations_policy" "restrict_console_changes" {
name = "restrict-production-console"
content = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "DenyConsoleChanges"
Effect = "Deny"
Action = [
"ec2:AuthorizeSecurityGroupIngress",
"ec2:RevokeSecurityGroupIngress",
"s3:PutBucketPolicy",
"iam:AttachRolePolicy",
"iam:PutRolePolicy"
]
Resource = "*"
Condition = {
StringNotLike = {
"aws:PrincipalArn" = "arn:aws:iam::*:role/terraform-*"
}
}
}
]
})
}
This SCP allows Terraform’s role to make changes but blocks everyone else from modifying security-sensitive resources directly. It’s opinionated and some people will push back. Hold the line. That security group change that caused my two-hour outage? An SCP like this would’ve prevented it entirely.
Use Terraform modules with strict variable validation. When teams use shared modules, there’s less temptation to go around the process. Make the right thing the easy thing.
Implement policy-as-code with your testing framework. Catch drift-prone patterns before they deploy. If a resource doesn’t have lifecycle { prevent_destroy = true } on critical infrastructure, fail the PR.
State File Integrity
Drift isn’t always about infrastructure changing — sometimes it’s your state file that’s wrong. Corrupted state, partial applies, state that got out of sync because two people ran apply at the same time.
State locking is non-negotiable. If you’re not using it, stop reading this and go set it up:
terraform {
backend "s3" {
bucket = "myorg-terraform-state"
key = "production/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
Beyond locking, I version state files and run periodic state validation:
#!/bin/bash
set -euo pipefail
# Validate state isn't corrupted
terraform validate
# Check for resources in state that no longer exist
terraform plan -refresh-only -detailed-exitcode
EXIT_CODE=$?
if [ "$EXIT_CODE" -eq 2 ]; then
echo "State refresh detected changes — resources may have been deleted outside Terraform"
terraform show -json | jq '.values.root_module.resources | length'
fi
terraform plan -refresh-only is underrated. It tells you what changed in the real world without proposing any modifications. Run it before every apply in CI to catch surprises early.
Monitoring Drift Over Time
Once you’ve got detection running, you’ll want to track trends. Are certain environments drifting more? Are specific teams causing more drift? Is drift decreasing over time as your processes improve?
I push drift metrics to CloudWatch:
resource "aws_cloudwatch_metric_alarm" "drift_frequency" {
alarm_name = "high-drift-frequency"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "DriftEventsCount"
namespace = "InfrastructureDrift"
period = 86400
statistic = "Sum"
threshold = 10
alarm_description = "More than 10 drift events in 24 hours"
alarm_actions = [aws_sns_topic.ops_alerts.arn]
}
Ten drift events in a day means something systemic is wrong. Either your process has a gap or someone’s actively working around it. Both need attention.
Track drift-to-remediation time too. If you’re detecting drift in minutes but taking days to fix it, your detection is just generating noise. The goal is detection under an hour and remediation under four hours for critical drift.
The Cultural Problem
I’ve saved this for last because it’s the hardest part and no amount of tooling fixes it completely.
Drift is a people problem wrapped in a technical problem. Someone made a console change because the Terraform workflow was too slow, or too complicated, or they didn’t have access to the repo. Every drift event is feedback about your developer experience.
When I find drift, I don’t just fix it — I ask why it happened. If the answer is “the deploy pipeline takes 45 minutes,” that’s a pipeline problem, not a discipline problem. If the answer is “I didn’t know how to add this to Terraform,” that’s a documentation and training problem.
Following IaC best practices isn’t just about writing good code. It’s about making the code-based workflow so fast and easy that nobody wants to use the console. Fast plan times. Clear module documentation. Self-service for common changes. Make the right path the path of least resistance.
The teams I’ve seen succeed at eliminating drift aren’t the ones with the strictest policies. They’re the ones where making a change through code is genuinely easier than clicking through the console. That’s the bar.
Where to Start
If you’re reading this and you don’t have any drift detection today, here’s the priority order:
- Set up scheduled
terraform planin CI. Today. This catches 80% of drift. - Enable AWS Config with managed rules for security-critical resources.
- Implement drift classification so you’re not drowning in noise.
- Add auto-remediation for low-risk drift.
- Lock down console access with SCPs.
- Build the cultural feedback loop.
You don’t need all of this on day one. But you need step one on day one. Because right now, somewhere in your infrastructure, something doesn’t match your code. And you won’t find out until it hurts.