Production & Security
Production Terraform requires a fundamentally different approach than development environments. The stakes are higher, the requirements more complex, and the margin for error much smaller. Security isn’t an afterthought—it needs to be built into every aspect of your Terraform workflow, from how you handle secrets to who can make changes and when.
The patterns in this part address the operational realities of running Terraform in business-critical environments. They’re based on hard-learned lessons about what works at scale, what fails under pressure, and what practices separate reliable infrastructure from systems that break at the worst possible moments.
Secrets Management
Never, ever put secrets directly in your Terraform configuration. I’ve seen too many repositories with database passwords, API keys, and certificates committed to Git. Here’s how to handle secrets properly:
Environment variables for runtime secrets:
export TF_VAR_database_password="$(aws secretsmanager get-secret-value --secret-id prod/db/password --query SecretString --output text)"
export TF_VAR_api_key="$(vault kv get -field=api_key secret/myapp)"
terraform apply
External secret management systems:
# Fetch secrets from AWS Secrets Manager
data "aws_secretsmanager_secret_version" "db_password" {
secret_id = "prod/database/password"
}
resource "aws_db_instance" "main" {
password = data.aws_secretsmanager_secret_version.db_password.secret_string
# other configuration...
}
# Use HashiCorp Vault
data "vault_generic_secret" "api_keys" {
path = "secret/myapp"
}
resource "aws_lambda_function" "api" {
environment {
variables = {
API_KEY = data.vault_generic_secret.api_keys.data["api_key"]
}
}
}
Generated secrets that Terraform manages:
resource "random_password" "db_password" {
length = 32
special = true
}
resource "aws_secretsmanager_secret" "db_password" {
name = "prod/database/password"
}
resource "aws_secretsmanager_secret_version" "db_password" {
secret_id = aws_secretsmanager_secret.db_password.id
secret_string = random_password.db_password.result
}
resource "aws_db_instance" "main" {
password = random_password.db_password.result
# other configuration...
}
Access Control and IAM
Terraform needs permissions to create and manage resources, but those permissions should be as limited as possible:
Principle of least privilege:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:DescribeInstances",
"ec2:DescribeImages",
"ec2:DescribeVpcs",
"ec2:DescribeSubnets",
"ec2:RunInstances",
"ec2:TerminateInstances",
"ec2:CreateTags"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:RequestedRegion": ["us-west-2", "us-east-1"]
}
}
}
]
}
Environment-specific roles:
# Different IAM roles for different environments
data "aws_iam_role" "terraform" {
name = "terraform-${var.environment}"
}
provider "aws" {
assume_role {
role_arn = data.aws_iam_role.terraform.arn
}
}
Cross-account access for multi-account strategies:
provider "aws" {
alias = "production"
assume_role {
role_arn = "arn:aws:iam::123456789012:role/terraform-production"
}
}
resource "aws_instance" "prod_web" {
provider = aws.production
ami = "ami-12345678"
instance_type = "t3.large"
}
State File Security
State files contain sensitive information and need special protection:
Encrypt state at rest:
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "us-west-2"
encrypt = true
kms_key_id = "arn:aws:kms:us-west-2:123456789012:key/12345678-1234-1234-1234-123456789012"
dynamodb_table = "terraform-locks"
}
}
Restrict state file access:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::123456789012:role/terraform-ci",
"arn:aws:iam::123456789012:role/terraform-admin"
]
},
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::my-terraform-state/*"
}
]
}
State file versioning and backup:
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_lifecycle_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
id = "state_file_lifecycle"
status = "Enabled"
noncurrent_version_expiration {
noncurrent_days = 90
}
}
}
Testing Strategies
Infrastructure code needs testing just like application code:
Validation and linting:
# Validate syntax and configuration
terraform validate
# Format code consistently
terraform fmt -recursive
# Use tflint for additional checks
tflint --init
tflint
Plan testing to catch issues before apply:
# Generate and review plans
terraform plan -out=tfplan
terraform show -json tfplan | jq '.planned_values'
# Test plans in CI/CD
terraform plan -detailed-exitcode
if [ $? -eq 2 ]; then
echo "Plan contains changes"
# Review or auto-approve based on your workflow
fi
Integration testing with real resources:
// Example using Terratest (Go)
func TestVPCModule(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../modules/vpc",
Vars: map[string]interface{}{
"name": "test-vpc",
"cidr_block": "10.0.0.0/16",
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
vpcId := terraform.Output(t, terraformOptions, "vpc_id")
assert.NotEmpty(t, vpcId)
}
Policy testing with tools like Conftest:
# security.rego
package terraform.security
deny[msg] {
resource := input.planned_values.root_module.resources[_]
resource.type == "aws_instance"
resource.values.instance_type == "t3.2xlarge"
msg := "Large instance types require approval"
}
deny[msg] {
resource := input.planned_values.root_module.resources[_]
resource.type == "aws_security_group_rule"
resource.values.cidr_blocks[_] == "0.0.0.0/0"
resource.values.from_port == 22
msg := "SSH should not be open to the world"
}
Compliance and Governance
Enterprise environments need compliance controls and governance:
Resource tagging policies:
# Enforce consistent tagging
locals {
required_tags = {
Environment = var.environment
Project = var.project_name
Owner = var.team_name
CostCenter = var.cost_center
ManagedBy = "terraform"
}
}
resource "aws_instance" "web" {
ami = "ami-12345678"
instance_type = "t3.micro"
tags = merge(local.required_tags, {
Name = "web-server"
Role = "webserver"
})
lifecycle {
postcondition {
condition = alltrue([
for tag in keys(local.required_tags) :
contains(keys(self.tags), tag)
])
error_message = "All required tags must be present."
}
}
}
Cost controls:
# Prevent expensive resources in non-production
variable "allowed_instance_types" {
description = "Allowed EC2 instance types"
type = list(string)
default = ["t3.micro", "t3.small", "t3.medium"]
}
resource "aws_instance" "web" {
ami = "ami-12345678"
instance_type = var.instance_type
lifecycle {
precondition {
condition = contains(var.allowed_instance_types, var.instance_type)
error_message = "Instance type ${var.instance_type} is not allowed in this environment."
}
}
}
Audit logging:
# Enable CloudTrail for Terraform operations
resource "aws_cloudtrail" "terraform_audit" {
name = "terraform-audit"
s3_bucket_name = aws_s3_bucket.audit_logs.bucket
event_selector {
read_write_type = "All"
include_management_events = true
data_resource {
type = "AWS::S3::Object"
values = ["${aws_s3_bucket.terraform_state.arn}/*"]
}
}
tags = {
Purpose = "Terraform audit logging"
}
}
Disaster Recovery and Backup
Production infrastructure needs disaster recovery planning:
State file backup:
#!/bin/bash
# Backup script for Terraform state
DATE=$(date +%Y%m%d-%H%M%S)
aws s3 cp s3://my-terraform-state/prod/terraform.tfstate \
s3://my-terraform-backups/state-backups/terraform.tfstate.$DATE
# Keep only last 30 days of backups
aws s3 ls s3://my-terraform-backups/state-backups/ | \
awk '$1 < "'$(date -d '30 days ago' '+%Y-%m-%d')'" {print $4}' | \
xargs -I {} aws s3 rm s3://my-terraform-backups/state-backups/{}
Cross-region replication:
resource "aws_s3_bucket_replication_configuration" "terraform_state" {
role = aws_iam_role.replication.arn
bucket = aws_s3_bucket.terraform_state.id
rule {
id = "replicate_state"
status = "Enabled"
destination {
bucket = aws_s3_bucket.terraform_state_replica.arn
storage_class = "STANDARD_IA"
}
}
}
Infrastructure documentation:
# Generate documentation automatically
resource "local_file" "infrastructure_docs" {
content = templatefile("${path.module}/docs/infrastructure.md.tpl", {
vpc_id = aws_vpc.main.id
subnet_ids = aws_subnet.private[*].id
security_groups = aws_security_group.web.id
load_balancer = aws_lb.main.dns_name
})
filename = "${path.module}/docs/infrastructure.md"
}
Monitoring and Alerting
Monitor your Terraform-managed infrastructure:
Resource drift detection:
#!/bin/bash
# Check for configuration drift
terraform plan -detailed-exitcode -out=drift.tfplan
if [ $? -eq 2 ]; then
echo "Configuration drift detected!"
terraform show drift.tfplan
# Send alert to monitoring system
curl -X POST "$SLACK_WEBHOOK" -d '{"text":"Terraform drift detected in production"}'
fi
State file monitoring:
resource "aws_cloudwatch_metric_alarm" "state_file_changes" {
alarm_name = "terraform-state-changes"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "1"
metric_name = "NumberOfObjects"
namespace = "AWS/S3"
period = "300"
statistic = "Average"
threshold = "1"
alarm_description = "This metric monitors terraform state file changes"
dimensions = {
BucketName = aws_s3_bucket.terraform_state.bucket
StorageType = "AllStorageTypes"
}
}
Security Scanning
Integrate security scanning into your Terraform workflow:
Static analysis with tools like Checkov:
# Install and run Checkov
pip install checkov
checkov -f main.tf --framework terraform
# Example output:
# FAILED for resource: aws_s3_bucket.example
# File: /main.tf:1-5
# Guide: https://docs.bridgecrew.io/docs/s3_1-acl-read-permissions-everyone
Runtime security with policy engines:
# Open Policy Agent policy
package terraform.security
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_security_group_rule"
resource.change.after.cidr_blocks[_] == "0.0.0.0/0"
resource.change.after.from_port <= 22
resource.change.after.to_port >= 22
msg := sprintf("Security group rule allows SSH from anywhere: %v", [resource.address])
}
What’s Next
Production security and operational practices are what make Terraform suitable for managing business-critical infrastructure. The patterns we’ve covered—secrets management, access control, testing, and monitoring—form the foundation for reliable, secure infrastructure management.
In the next part, we’ll explore team collaboration patterns, including CI/CD integration, code review workflows, and the organizational practices that let multiple teams work together effectively with Terraform while maintaining security and reliability standards.