I’ve been managing Terraform state across production environments for years now, and if there’s one thing I’m certain of, it’s this: state management is where most Terraform setups fall apart. Not modules. Not provider quirks. State.

The state file is Terraform’s memory. It’s how Terraform knows what it built, what changed, and what to tear down. Lose it, corrupt it, or let two people write to it at the same time, and you’re in for a rough day. I once lost a state file for a networking stack and spent the better part of 6 hours reimporting over 200 resources by hand. VPCs, subnets, route tables, NAT gateways — one at a time. Never again.

So here’s how I handle state in 2026, after plenty of scars. Everything below is what I run in production today — not theory, not what the docs suggest, but what actually works when you’re managing infrastructure across multiple accounts and teams.


Local State Is a Non-Starter

I don’t care if it’s a side project. Don’t use local state. The moment you close your laptop, switch machines, or someone else touches the project, you’re in trouble. Local state doesn’t lock. It doesn’t version. It sits on your filesystem like a ticking bomb.

If you’re working alone on a throwaway experiment, fine. But the second anything matters — remote backend. No exceptions.

I wrote a full walkthrough on How to Store Terraform State in AWS S3 if you want the step-by-step.


The S3 + DynamoDB Backend

This is the bread and butter. S3 stores the state file, DynamoDB handles locking. It’s battle-tested, cheap, and straightforward.

terraform {
  backend "s3" {
    bucket         = "acme-terraform-state"
    key            = "prod/networking/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-lock"
    kms_key_id     = "alias/terraform-state"
  }
}

A few things I’ve learned the hard way about this setup:

Use a hierarchical key path. I go with {env}/{component}/terraform.tfstate. Makes it dead simple to find state files later and to write targeted IAM policies. Don’t just dump everything at the root of the bucket.

Always encrypt. State files contain secrets in plaintext. Database passwords, API keys, all of it. Set encrypt = true and point it at a dedicated KMS key. This isn’t optional.

Enable bucket versioning. This has saved me more than once. Someone runs a bad apply, state gets mangled — you just roll back to the previous version. Takes 30 seconds instead of 3 hours. I had a colleague accidentally run terraform apply with a completely wrong variable file against a production state. Wiped out half the outputs. With versioning, we grabbed the previous state version from S3, restored it, and were back in business within minutes. Without versioning? That would’ve been a full reimport job.

aws s3api put-bucket-versioning \
  --bucket acme-terraform-state \
  --versioning-configuration Status=Enabled

I also strongly recommend using a backend config file instead of hardcoding values. Keeps things DRY across environments. I covered that in How to Use a Backend Config File for Terraform S3 State Configuration.


State Locking — Don’t Skip This

Two engineers run terraform apply at the same time. Both read the same state. Both try to write back different versions. Congratulations, your state is now garbage.

State locking prevents this. DynamoDB is the standard lock backend for S3. You need one table — that’s it.

resource "aws_dynamodb_table" "terraform_lock" {
  name         = "terraform-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

I have a dedicated guide on how to create a DynamoDB lock table for Terraform that walks through the full setup including IAM permissions.

Now, locks get stuck sometimes. A CI pipeline crashes mid-apply, your terminal dies, whatever. You’ll see this:

Error: Error acquiring the state lock

Don’t panic. First, make absolutely sure nobody else is actually running an apply. Then:

terraform force-unlock LOCK_ID_HERE

I’ve written about this exact scenario in Solved: Error Acquiring the State Lock in Terraform. The key thing: don’t just force-unlock blindly. Verify first. I’ve seen people force-unlock while a colleague’s apply was still running. That’s how you get a corrupted state file and a very awkward Slack conversation.


Split Your State. Seriously.

One giant state file for everything? Terrible idea. I’ve seen plans that take 8 minutes because Terraform has to refresh 400 resources just to change a single security group rule. Beyond speed, there’s the blast radius problem. A bad apply against a monolithic state can take down networking, compute, and databases all at once.

Here’s how I split things:

By environment. This is non-negotiable. Dev, staging, prod — separate state files. Always. I don’t want a fat-fingered terraform destroy in a dev terminal anywhere near my production state.

By component. Networking gets its own state. Compute gets its own. Databases, monitoring, IAM — all separate. A change to a CloudWatch alarm shouldn’t force Terraform to evaluate my RDS clusters.

By team ownership. If the platform team owns networking and the app team owns compute, give them separate state files. Independent deploy cadences, independent IAM boundaries.

My typical S3 layout looks like this:

acme-terraform-state/
├── prod/
│   ├── networking/terraform.tfstate
│   ├── compute/terraform.tfstate
│   ├── database/terraform.tfstate
│   └── monitoring/terraform.tfstate
├── staging/
│   ├── networking/terraform.tfstate
│   └── compute/terraform.tfstate
└── dev/
    └── ...

When one state needs to reference another — say, compute needs the VPC ID from networking — use terraform_remote_state:

data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "acme-terraform-state"
    key    = "prod/networking/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_instance" "app" {
  subnet_id = data.terraform_remote_state.networking.outputs.private_subnet_id
  ami           = "ami-0abcdef1234567890"
  instance_type = "t3.medium"
}

For larger setups, I also split modules into their own Git repos. Keeps versioning clean and lets teams iterate independently. I wrote about that pattern in Splitting Terraform Modules into Separate Git Repositories.


Workspaces — Useful, But Overrated for Env Separation

I know workspaces get recommended a lot for managing environments. I’ve used them. They’re fine for ephemeral stuff — feature branch environments, short-lived test stacks, that kind of thing.

terraform workspace new feature-xyz
terraform workspace select feature-xyz
terraform apply -var-file=feature.tfvars

But for long-lived environments like prod vs. staging? Workspaces are overrated for this. Here’s why:

  • You can’t have different backend configs per workspace. They all share the same backend block.
  • IAM isolation is awkward. You can’t easily say “this role can only touch staging state” when it’s all in the same bucket prefix.
  • The infrastructure shape is often different between environments. Prod has multi-AZ RDS, staging has single-AZ. Prod has a WAF, dev doesn’t. Workspaces assume identical configs with different variables, and that falls apart fast.

What I do instead: directory-per-environment with shared modules.

infra/
├── modules/
│   ├── networking/
│   └── compute/
├── prod/
│   ├── main.tf
│   ├── backend.hcl
│   └── terraform.tfvars
└── staging/
    ├── main.tf
    ├── backend.hcl
    └── terraform.tfvars

Each environment calls the same modules but has its own backend config and variables. Explicit. Clear. No magic.

That said, workspaces absolutely have their place. I covered the details in Managing Multiple Environments with Terraform Workspaces — including when they genuinely make sense.


Disaster Recovery for State

Okay, so you’ve got S3 versioning enabled. Good start. But that’s not a disaster recovery plan. That’s a single-bucket safety net.

Here’s what I actually do:

Cross-account backups. I run a nightly process that copies state files to a completely separate AWS account. If the primary account gets compromised or someone accidentally deletes the bucket, the backup account is untouched.

#!/bin/bash
SOURCE_BUCKET="acme-terraform-state"
BACKUP_BUCKET="acme-terraform-state-dr"
BACKUP_PROFILE="backup-account"

aws s3 sync "s3://${SOURCE_BUCKET}" "s3://${BACKUP_BUCKET}" \
  --profile "${BACKUP_PROFILE}" \
  --sse aws:kms

Simple. Runs in a cron job or a scheduled Lambda. I don’t overthink this — s3 sync does the heavy lifting.

Test your recovery. I do this quarterly. Pull a state file from backup, point a fresh Terraform init at it, run terraform plan. If the plan comes back clean (no changes), your backup is good. If it shows drift, something’s off. Most people skip this step. Don’t be most people. An untested backup is just a file you hope works. I’ve seen teams discover their “backups” were empty objects because the sync job’s IAM role didn’t have KMS decrypt permissions. They found out during an actual incident. Not ideal.

The nuclear option: reimporting everything. If state is truly gone and backups failed, you’re stuck importing resources one by one:

terraform import aws_vpc.main vpc-0a1b2c3d4e5f
terraform import aws_subnet.private[0] subnet-0a1b2c3d
terraform import aws_security_group.app sg-0a1b2c3d
terraform import aws_db_instance.main mydb-instance

This is miserable. I’ve done it. For a medium-sized stack it took me most of a day. For a large one, multiple days. Prevention is infinitely better than cure here.


Secrets in State — The Elephant in the Room

This catches people off guard. Terraform stores resource attributes in state, and that includes sensitive values. Your RDS master password? It’s in the state file. In plaintext. That API key you passed as a variable? Also there.

KMS encryption on the S3 bucket helps — it encrypts at rest. But anyone with s3:GetObject permission can still read the decrypted state. So:

  • Lock down IAM on the state bucket. Only CI/CD roles and a small group of senior engineers should have read access.
  • Enable CloudTrail data events on the bucket so you know who accessed what and when.
  • Where possible, don’t manage secrets through Terraform at all. Use Secrets Manager or SSM Parameter Store and reference them with data sources. Let Terraform know where the secret is, not what it is.

CI/CD and State

Running Terraform in a pipeline introduces its own set of problems. Here’s the pattern I’ve settled on:

# In CI — init with backend config file
terraform init -backend-config=backend.hcl

# Plan and save the plan file
terraform plan -out=tfplan

# Apply only the saved plan (on merge to main)
terraform apply tfplan

The -out=tfplan bit is critical. Without it, there’s a gap between what the plan showed and what gets applied. Someone could merge another PR between your plan and apply steps, and now you’re applying a stale plan against changed state. The saved plan file eliminates that.

Other things I enforce in CI:

  • TF_IN_AUTOMATION=true — suppresses interactive prompts and adjusts output formatting.
  • OIDC for AWS auth. No long-lived access keys sitting in GitHub secrets. Ever.
  • Plan runs on every PR. Apply only triggers on merge to main.
  • If a plan fails or shows unexpected changes, the pipeline stops. No auto-applying surprises.

When things go wrong in the pipeline and you need to figure out what Terraform is actually doing, crank up the logging. I wrote about that in How to Log Debug in Terraform — it’s saved me hours of guessing.


Where I’ve Landed

After years of iterating on this stuff, my setup is pretty stable: S3 backend with DynamoDB locking, KMS encryption, bucket versioning, cross-account backups, state split by environment and component, directory-per-environment instead of workspaces for anything long-lived, and strict CI/CD pipelines with saved plan files.

It’s not glamorous. None of this is. But it’s the foundation everything else sits on. Get state management wrong and every terraform apply is a gamble. Get it right and deployments become boring. Boring is good. Boring means nothing caught fire today.

If you’re just getting started with Terraform state on AWS, start with the S3 backend and a lock table. Get that working. Then layer on the rest — state splitting, cross-account backups, CI/CD integration — as your infrastructure grows. You don’t need to boil the ocean on day one, but you do need to start with a remote backend. That part’s non-negotiable.