Production & Security

Production Terraform requires a fundamentally different approach than development environments. The stakes are higher, the requirements more complex, and the margin for error much smaller. Security isn’t an afterthought—it needs to be built into every aspect of your Terraform workflow, from how you handle secrets to who can make changes and when.

The patterns in this part address the operational realities of running Terraform in business-critical environments. They’re based on hard-learned lessons about what works at scale, what fails under pressure, and what practices separate reliable infrastructure from systems that break at the worst possible moments.

Secrets Management

Never, ever put secrets directly in your Terraform configuration. I’ve seen too many repositories with database passwords, API keys, and certificates committed to Git. Here’s how to handle secrets properly:

Environment variables for runtime secrets:

export TF_VAR_database_password="$(aws secretsmanager get-secret-value --secret-id prod/db/password --query SecretString --output text)"
export TF_VAR_api_key="$(vault kv get -field=api_key secret/myapp)"

terraform apply

External secret management systems:

# Fetch secrets from AWS Secrets Manager
data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "prod/database/password"
}

resource "aws_db_instance" "main" {
  password = data.aws_secretsmanager_secret_version.db_password.secret_string
  # other configuration...
}

# Use HashiCorp Vault
data "vault_generic_secret" "api_keys" {
  path = "secret/myapp"
}

resource "aws_lambda_function" "api" {
  environment {
    variables = {
      API_KEY = data.vault_generic_secret.api_keys.data["api_key"]
    }
  }
}

Generated secrets that Terraform manages:

resource "random_password" "db_password" {
  length  = 32
  special = true
}

resource "aws_secretsmanager_secret" "db_password" {
  name = "prod/database/password"
}

resource "aws_secretsmanager_secret_version" "db_password" {
  secret_id     = aws_secretsmanager_secret.db_password.id
  secret_string = random_password.db_password.result
}

resource "aws_db_instance" "main" {
  password = random_password.db_password.result
  # other configuration...
}

Access Control and IAM

Terraform needs permissions to create and manage resources, but those permissions should be as limited as possible:

Principle of least privilege:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeInstances",
        "ec2:DescribeImages",
        "ec2:DescribeVpcs",
        "ec2:DescribeSubnets",
        "ec2:RunInstances",
        "ec2:TerminateInstances",
        "ec2:CreateTags"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:RequestedRegion": ["us-west-2", "us-east-1"]
        }
      }
    }
  ]
}

Environment-specific roles:

# Different IAM roles for different environments
data "aws_iam_role" "terraform" {
  name = "terraform-${var.environment}"
}

provider "aws" {
  assume_role {
    role_arn = data.aws_iam_role.terraform.arn
  }
}

Cross-account access for multi-account strategies:

provider "aws" {
  alias = "production"
  
  assume_role {
    role_arn = "arn:aws:iam::123456789012:role/terraform-production"
  }
}

resource "aws_instance" "prod_web" {
  provider = aws.production
  
  ami           = "ami-12345678"
  instance_type = "t3.large"
}

State File Security

State files contain sensitive information and need special protection:

Encrypt state at rest:

terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "us-west-2"
    encrypt        = true
    kms_key_id     = "arn:aws:kms:us-west-2:123456789012:key/12345678-1234-1234-1234-123456789012"
    dynamodb_table = "terraform-locks"
  }
}

Restrict state file access:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": [
          "arn:aws:iam::123456789012:role/terraform-ci",
          "arn:aws:iam::123456789012:role/terraform-admin"
        ]
      },
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-terraform-state/*"
    }
  ]
}

State file versioning and backup:

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  
  rule {
    id     = "state_file_lifecycle"
    status = "Enabled"
    
    noncurrent_version_expiration {
      noncurrent_days = 90
    }
  }
}

Testing Strategies

Infrastructure code needs testing just like application code:

Validation and linting:

# Validate syntax and configuration
terraform validate

# Format code consistently
terraform fmt -recursive

# Use tflint for additional checks
tflint --init
tflint

Plan testing to catch issues before apply:

# Generate and review plans
terraform plan -out=tfplan
terraform show -json tfplan | jq '.planned_values'

# Test plans in CI/CD
terraform plan -detailed-exitcode
if [ $? -eq 2 ]; then
  echo "Plan contains changes"
  # Review or auto-approve based on your workflow
fi

Integration testing with real resources:

// Example using Terratest (Go)
func TestVPCModule(t *testing.T) {
    terraformOptions := &terraform.Options{
        TerraformDir: "../modules/vpc",
        Vars: map[string]interface{}{
            "name": "test-vpc",
            "cidr_block": "10.0.0.0/16",
        },
    }
    
    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)
    
    vpcId := terraform.Output(t, terraformOptions, "vpc_id")
    assert.NotEmpty(t, vpcId)
}

Policy testing with tools like Conftest:

# security.rego
package terraform.security

deny[msg] {
    resource := input.planned_values.root_module.resources[_]
    resource.type == "aws_instance"
    resource.values.instance_type == "t3.2xlarge"
    msg := "Large instance types require approval"
}

deny[msg] {
    resource := input.planned_values.root_module.resources[_]
    resource.type == "aws_security_group_rule"
    resource.values.cidr_blocks[_] == "0.0.0.0/0"
    resource.values.from_port == 22
    msg := "SSH should not be open to the world"
}

Compliance and Governance

Enterprise environments need compliance controls and governance:

Resource tagging policies:

# Enforce consistent tagging
locals {
  required_tags = {
    Environment = var.environment
    Project     = var.project_name
    Owner       = var.team_name
    CostCenter  = var.cost_center
    ManagedBy   = "terraform"
  }
}

resource "aws_instance" "web" {
  ami           = "ami-12345678"
  instance_type = "t3.micro"
  
  tags = merge(local.required_tags, {
    Name = "web-server"
    Role = "webserver"
  })
  
  lifecycle {
    postcondition {
      condition = alltrue([
        for tag in keys(local.required_tags) :
        contains(keys(self.tags), tag)
      ])
      error_message = "All required tags must be present."
    }
  }
}

Cost controls:

# Prevent expensive resources in non-production
variable "allowed_instance_types" {
  description = "Allowed EC2 instance types"
  type        = list(string)
  default     = ["t3.micro", "t3.small", "t3.medium"]
}

resource "aws_instance" "web" {
  ami           = "ami-12345678"
  instance_type = var.instance_type
  
  lifecycle {
    precondition {
      condition = contains(var.allowed_instance_types, var.instance_type)
      error_message = "Instance type ${var.instance_type} is not allowed in this environment."
    }
  }
}

Audit logging:

# Enable CloudTrail for Terraform operations
resource "aws_cloudtrail" "terraform_audit" {
  name           = "terraform-audit"
  s3_bucket_name = aws_s3_bucket.audit_logs.bucket
  
  event_selector {
    read_write_type                 = "All"
    include_management_events       = true
    
    data_resource {
      type   = "AWS::S3::Object"
      values = ["${aws_s3_bucket.terraform_state.arn}/*"]
    }
  }
  
  tags = {
    Purpose = "Terraform audit logging"
  }
}

Disaster Recovery and Backup

Production infrastructure needs disaster recovery planning:

State file backup:

#!/bin/bash
# Backup script for Terraform state
DATE=$(date +%Y%m%d-%H%M%S)
aws s3 cp s3://my-terraform-state/prod/terraform.tfstate \
  s3://my-terraform-backups/state-backups/terraform.tfstate.$DATE

# Keep only last 30 days of backups
aws s3 ls s3://my-terraform-backups/state-backups/ | \
  awk '$1 < "'$(date -d '30 days ago' '+%Y-%m-%d')'" {print $4}' | \
  xargs -I {} aws s3 rm s3://my-terraform-backups/state-backups/{}

Cross-region replication:

resource "aws_s3_bucket_replication_configuration" "terraform_state" {
  role   = aws_iam_role.replication.arn
  bucket = aws_s3_bucket.terraform_state.id
  
  rule {
    id     = "replicate_state"
    status = "Enabled"
    
    destination {
      bucket        = aws_s3_bucket.terraform_state_replica.arn
      storage_class = "STANDARD_IA"
    }
  }
}

Infrastructure documentation:

# Generate documentation automatically
resource "local_file" "infrastructure_docs" {
  content = templatefile("${path.module}/docs/infrastructure.md.tpl", {
    vpc_id           = aws_vpc.main.id
    subnet_ids       = aws_subnet.private[*].id
    security_groups  = aws_security_group.web.id
    load_balancer    = aws_lb.main.dns_name
  })
  
  filename = "${path.module}/docs/infrastructure.md"
}

Monitoring and Alerting

Monitor your Terraform-managed infrastructure:

Resource drift detection:

#!/bin/bash
# Check for configuration drift
terraform plan -detailed-exitcode -out=drift.tfplan

if [ $? -eq 2 ]; then
  echo "Configuration drift detected!"
  terraform show drift.tfplan
  # Send alert to monitoring system
  curl -X POST "$SLACK_WEBHOOK" -d '{"text":"Terraform drift detected in production"}'
fi

State file monitoring:

resource "aws_cloudwatch_metric_alarm" "state_file_changes" {
  alarm_name          = "terraform-state-changes"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "1"
  metric_name         = "NumberOfObjects"
  namespace           = "AWS/S3"
  period              = "300"
  statistic           = "Average"
  threshold           = "1"
  alarm_description   = "This metric monitors terraform state file changes"
  
  dimensions = {
    BucketName = aws_s3_bucket.terraform_state.bucket
    StorageType = "AllStorageTypes"
  }
}

Security Scanning

Integrate security scanning into your Terraform workflow:

Static analysis with tools like Checkov:

# Install and run Checkov
pip install checkov
checkov -f main.tf --framework terraform

# Example output:
# FAILED for resource: aws_s3_bucket.example
# File: /main.tf:1-5
# Guide: https://docs.bridgecrew.io/docs/s3_1-acl-read-permissions-everyone

Runtime security with policy engines:

# Open Policy Agent policy
package terraform.security

deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_security_group_rule"
    resource.change.after.cidr_blocks[_] == "0.0.0.0/0"
    resource.change.after.from_port <= 22
    resource.change.after.to_port >= 22
    msg := sprintf("Security group rule allows SSH from anywhere: %v", [resource.address])
}

What’s Next

Production security and operational practices are what make Terraform suitable for managing business-critical infrastructure. The patterns we’ve covered—secrets management, access control, testing, and monitoring—form the foundation for reliable, secure infrastructure management.

In the next part, we’ll explore team collaboration patterns, including CI/CD integration, code review workflows, and the organizational practices that let multiple teams work together effectively with Terraform while maintaining security and reliability standards.