I once inherited a project with a single main.tf that was over 3,000 lines long. No modules. No abstractions. Just one enormous file that deployed an entire production environment — VPCs, ECS clusters, RDS instances, Lambda functions, IAM roles — all jammed together with hardcoded values and copy-pasted blocks. Changing a security group rule meant scrolling for five minutes and praying you edited the right resource. It was, without exaggeration, the worst Terraform I’ve ever seen.

That experience broke something in me. I spent the next three weeks doing nothing but carving that monolith into modules, and in the process I learned more about module design than I had in the previous two years. This article is everything I wish someone had told me before I started.

If you’re new to Terraform, my Terraform primer covers the fundamentals. What follows assumes you’re comfortable with the basics and ready to think about structure.


Why Most Modules Are Wrong

Most Terraform modules I see on the registry are over-engineered garbage. I don’t say that to be provocative — I say it because I’ve wasted days debugging modules that tried to be everything to everyone and ended up being useful to nobody.

The typical failure mode looks like this: someone creates a module with 47 input variables, 30 of which have defaults, conditional logic everywhere using count and for_each with ternary expressions nested three levels deep, and a README that’s longer than the code itself. The module “supports” every possible configuration, which means it actually supports none of them well.

Good module design starts with a question most people skip: who is this for?

If the answer is “my team, deploying our specific application stack,” then the module should encode your team’s opinions. It shouldn’t expose every knob. It should make the right thing easy and the wrong thing impossible.

If the answer is “the open-source community,” then yes, you need more flexibility. But even then, most registry modules would be better as three focused modules instead of one sprawling one.


The Composition Pattern

The single most important pattern I use is composition over configuration. Instead of building one massive module that does everything, I build small modules that do one thing and compose them together.

Here’s what I mean. Instead of this:

module "application" {
  source = "./modules/mega-app"

  vpc_cidr             = "10.0.0.0/16"
  create_database      = true
  database_engine      = "postgres"
  database_instance    = "db.r6g.large"
  create_cache         = true
  cache_engine         = "redis"
  create_cdn           = false
  enable_waf           = true
  container_image      = "myapp:latest"
  container_cpu        = 512
  container_memory     = 1024
  # ... 40 more variables
}

I do this:

module "network" {
  source   = "./modules/network"
  cidr     = "10.0.0.0/16"
  env_name = var.environment
}

module "database" {
  source          = "./modules/postgres"
  subnet_ids      = module.network.database_subnet_ids
  security_groups = [module.network.database_sg_id]
  instance_class  = "db.r6g.large"
}

module "service" {
  source          = "./modules/ecs-service"
  subnet_ids      = module.network.private_subnet_ids
  security_groups = [module.network.app_sg_id]
  image           = "myapp:latest"
  cpu             = 512
  memory          = 1024
  db_endpoint     = module.database.endpoint
}

The second approach has more lines, but each module is independently testable, independently versionable, and independently understandable. When the database module breaks, I know exactly where to look. When I need to swap Redis for Memcached, I replace one module instead of flipping a boolean in a mega-module and hoping the conditional logic works.

This maps directly to what I wrote about splitting Terraform modules into separate repositories — composition is the prerequisite for that split.


Module Interface Design

The interface of a module — its variables and outputs — matters more than its implementation. I’ve refactored module internals dozens of times without touching a single calling configuration, and that’s only possible because the interface was right from the start.

My rules for variables:

Require what matters, default what doesn’t. If a value changes between environments, it’s a required variable. If it’s the same everywhere, it’s a default. Don’t make people specify things they’ll never change.

variable "instance_class" {
  description = "RDS instance class"
  type        = string
  # No default — this MUST vary by environment
}

variable "backup_retention_days" {
  description = "Number of days to retain backups"
  type        = number
  default     = 7
  # Sane default, override if you need to
}

variable "deletion_protection" {
  description = "Enable deletion protection"
  type        = bool
  default     = true
  # Safe by default
}

Use objects for related values. When you’ve got five variables that always travel together, that’s a struct waiting to happen:

variable "container" {
  description = "Container configuration"
  type = object({
    image  = string
    cpu    = number
    memory = number
    port   = number
  })
}

This is cleaner than container_image, container_cpu, container_memory, container_port as four separate variables. It also makes it obvious these values are related.

Outputs should be references, not computed strings. Don’t output a connection string you’ve assembled — output the host, port, and database name separately. Let the caller compose what they need. You can’t predict every format someone will want.

output "endpoint" {
  description = "Database endpoint hostname"
  value       = aws_db_instance.this.address
}

output "port" {
  description = "Database port"
  value       = aws_db_instance.this.port
}

output "arn" {
  description = "Database instance ARN"
  value       = aws_db_instance.this.arn
}

Versioning That Actually Works

If you’re using modules from a shared registry or separate repos, versioning isn’t optional. I’ve seen teams pin to main and then wonder why their infrastructure changed when someone merged a PR. That’s not Terraform’s fault — that’s a process failure.

I use semantic versioning for all shared modules. The rules are simple:

  • Patch (1.0.x): Bug fixes, documentation, internal refactors that don’t change behavior
  • Minor (1.x.0): New variables with defaults, new outputs, new optional resources
  • Major (x.0.0): Removed variables, changed defaults, renamed resources (state-breaking changes)

In practice:

module "network" {
  source  = "git::https://github.com/myorg/terraform-aws-network.git?ref=v2.1.0"
  cidr    = "10.0.0.0/16"
}

Never use a branch reference in production. Never. I don’t care how stable you think main is. Pin to a tag.

For managing multiple environments, version pinning becomes even more critical. You want to be able to promote a module version from dev to staging to production deliberately, not accidentally.

The upgrade path matters too. When I release a major version, I include a migration guide in the changelog. Not “see the new variables” — actual step-by-step instructions including any terraform state mv commands needed. If you’re going to break people’s workflows, at least make it easy to fix.


Testing Modules

I’ll be honest: most teams don’t test their Terraform modules, and most of the time they get away with it. But when they don’t get away with it, the blast radius is enormous. A bad module version deployed to production can take down infrastructure in ways that are genuinely hard to recover from.

My testing approach has three layers:

Static analysis catches the obvious stuff. I run terraform validate and tflint in CI on every PR. This is table stakes — if you’re not doing this, start today.

# .tflint.hcl
plugin "aws" {
  enabled = true
  version = "0.31.0"
  source  = "github.com/terraform-linters/tflint-ruleset-aws"
}

rule "terraform_naming_convention" {
  enabled = true
}

rule "terraform_unused_declarations" {
  enabled = true
}

Plan-level tests verify that a module produces the expected resources without actually creating anything. I use terraform plan with -out and then inspect the plan file. For more structured testing, Terratest works but it’s slow. I’ve been using terraform test (the native testing framework) more lately:

# tests/basic.tftest.hcl
run "creates_vpc_with_correct_cidr" {
  command = plan

  variables {
    cidr     = "10.0.0.0/16"
    env_name = "test"
  }

  assert {
    condition     = aws_vpc.main.cidr_block == "10.0.0.0/16"
    error_message = "VPC CIDR block doesn't match input"
  }

  assert {
    condition     = length(aws_subnet.private) == 3
    error_message = "Expected 3 private subnets"
  }
}

Integration tests actually deploy infrastructure, run checks, and tear it down. These are expensive and slow, so I only run them on merges to main, not on every PR. Terratest is still the best option here:

func TestNetworkModule(t *testing.T) {
    terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
        TerraformDir: "../examples/basic",
    })

    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)

    vpcID := terraform.Output(t, terraformOptions, "vpc_id")
    assert.NotEmpty(t, vpcID)
}

The key insight: test the module’s contract, not its implementation. I don’t care how many aws_route_table_association resources exist internally. I care that the outputs are correct and the infrastructure works.


A Real Module: ECS Service

Theory is great, but let me show you a module I actually use in production. This deploys an ECS Fargate service with an ALB target group. It’s opinionated — it assumes Fargate, it assumes you’re passing in a load balancer, and it doesn’t try to create the cluster.

# modules/ecs-service/variables.tf
variable "name" {
  description = "Service name"
  type        = string
}

variable "cluster_arn" {
  description = "ECS cluster ARN"
  type        = string
}

variable "container" {
  description = "Container configuration"
  type = object({
    image  = string
    cpu    = number
    memory = number
    port   = number
  })
}

variable "subnet_ids" {
  description = "Subnets for the service"
  type        = list(string)
}

variable "security_group_ids" {
  description = "Security groups for the service"
  type        = list(string)
}

variable "target_group_arn" {
  description = "ALB target group ARN"
  type        = string
}

variable "desired_count" {
  description = "Number of tasks"
  type        = number
  default     = 2
}

variable "environment" {
  description = "Environment variables for the container"
  type        = map(string)
  default     = {}
}
# modules/ecs-service/main.tf
resource "aws_ecs_task_definition" "this" {
  family                   = var.name
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = var.container.cpu
  memory                   = var.container.memory
  execution_role_arn       = aws_iam_role.execution.arn
  task_role_arn            = aws_iam_role.task.arn

  container_definitions = jsonencode([{
    name      = var.name
    image     = var.container.image
    essential = true
    portMappings = [{
      containerPort = var.container.port
      protocol      = "tcp"
    }]
    environment = [
      for k, v in var.environment : { name = k, value = v }
    ]
    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"         = aws_cloudwatch_log_group.this.name
        "awslogs-region"        = data.aws_region.current.name
        "awslogs-stream-prefix" = var.name
      }
    }
  }])
}

resource "aws_ecs_service" "this" {
  name            = var.name
  cluster         = var.cluster_arn
  task_definition = aws_ecs_task_definition.this.arn
  desired_count   = var.desired_count
  launch_type     = "FARGATE"

  network_configuration {
    subnets         = var.subnet_ids
    security_groups = var.security_group_ids
  }

  load_balancer {
    target_group_arn = var.target_group_arn
    container_name   = var.name
    container_port   = var.container.port
  }

  lifecycle {
    ignore_changes = [desired_count]
  }
}

resource "aws_cloudwatch_log_group" "this" {
  name              = "/ecs/${var.name}"
  retention_in_days = 30
}

data "aws_region" "current" {}
# modules/ecs-service/outputs.tf
output "service_name" {
  description = "ECS service name"
  value       = aws_ecs_service.this.name
}

output "task_definition_arn" {
  description = "Task definition ARN"
  value       = aws_ecs_task_definition.this.arn
}

output "log_group_name" {
  description = "CloudWatch log group name"
  value       = aws_cloudwatch_log_group.this.name
}

Notice what this module doesn’t do: it doesn’t create the cluster, the VPC, the load balancer, or the DNS record. Those are separate concerns handled by separate modules. The caller wires them together. That’s composition.

The ignore_changes on desired_count is deliberate — I don’t want Terraform fighting with autoscaling. This is the kind of operational opinion that belongs in a module. Anyone on my team who deploys an ECS service gets this behavior automatically, without having to remember it.


Patterns I Keep Coming Back To

After building modules for a few years, certain patterns show up again and again.

The “defaults with escape hatches” pattern. Encode your team’s standards as defaults, but allow overrides for the cases that genuinely need them. Don’t make people fight the module to do something reasonable.

variable "tags" {
  description = "Additional tags to apply"
  type        = map(string)
  default     = {}
}

locals {
  default_tags = {
    ManagedBy   = "terraform"
    Module      = "ecs-service"
  }
  tags = merge(local.default_tags, var.tags)
}

The “data source lookup” pattern. Instead of requiring callers to pass in ARNs and IDs they’d have to look up anyway, do the lookup inside the module when it makes sense:

variable "vpc_name" {
  description = "Name tag of the VPC"
  type        = string
}

data "aws_vpc" "selected" {
  filter {
    name   = "tag:Name"
    values = [var.vpc_name]
  }
}

This trades a string variable for a data source call. Sometimes that’s the right trade — it depends on whether the caller naturally has the ID or the name. Don’t be dogmatic about it.

The “feature flag” pattern — used sparingly. Sometimes you genuinely need optional resources. Use count for simple on/off, but keep it to one or two flags per module. The moment you have five feature flags, you actually have five modules pretending to be one.

variable "enable_autoscaling" {
  description = "Enable autoscaling for the service"
  type        = bool
  default     = false
}

resource "aws_appautoscaling_target" "this" {
  count              = var.enable_autoscaling ? 1 : 0
  max_capacity       = 10
  min_capacity       = var.desired_count
  resource_id        = "service/${var.cluster_name}/${aws_ecs_service.this.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

Module Structure and Documentation

Every module I write follows the same file layout:

modules/ecs-service/
├── main.tf          # Resources
├── variables.tf     # Input variables
├── outputs.tf       # Outputs
├── versions.tf      # Provider and Terraform version constraints
├── README.md        # Usage examples and requirements
└── tests/
    └── basic.tftest.hcl

The versions.tf file is one people skip, and it bites them later:

terraform {
  required_version = ">= 1.5"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = ">= 5.0, < 6.0"
    }
  }
}

Pin the provider to a major version range. Don’t pin to an exact version in a module — that creates conflicts when the caller uses a different patch version. The root module is where you pin exactly.

For documentation, I use terraform-docs to auto-generate the variable and output tables, but I always write the usage example by hand. Auto-generated docs tell you what the inputs are. A hand-written example tells you how to actually use the thing. Both matter.


When Not to Modularize

Not everything needs to be a module. I’ve seen teams go so module-crazy that they wrap a single resource in a module, adding indirection without adding value. If your “module” is just a thin wrapper around aws_s3_bucket with the same variables, you haven’t abstracted anything — you’ve just added a layer of indirection that makes debugging harder.

I reach for a module when:

  • The same pattern appears in three or more places (the rule of three)
  • The configuration encodes operational knowledge that shouldn’t live in someone’s head
  • Multiple resources need to be created together and have internal dependencies
  • I want to enforce standards across teams

I don’t create a module when:

  • It’s a single resource with straightforward configuration
  • The “module” would just pass through every variable to one resource
  • It’s a one-off piece of infrastructure that won’t be replicated

This connects to broader IaC best practices — abstraction should reduce complexity, not relocate it.


The Migration Path

If you’re staring at a monolithic Terraform configuration right now (like that 3,000-line main.tf I inherited), here’s how I’d approach the refactor:

Start with Terraform state management — make sure your state is backed up and you understand terraform state mv. You’ll be using it a lot.

Then pick the most self-contained group of resources. Usually that’s the network layer — VPC, subnets, route tables, NAT gateways. Extract those into a module, run terraform plan, and verify you get “no changes.” If the plan shows destroys and recreates, your state moves aren’t right.

Work outward from there. Database next, then compute, then the glue (IAM roles, security groups, DNS). Each extraction should result in a clean terraform plan with no changes.

Don’t try to do it all at once. I spent three weeks on that 3,000-line file, and I did it in small PRs — one module extraction per PR, each one verified with a plan. It was tedious. It was also the only way to do it safely.

The end result was worth it. What had been an untouchable monolith became a set of composable, testable, versionable modules that the whole team could work on simultaneously. Deployments went from “everyone hold your breath” to routine. That’s what good module design buys you — not elegance for its own sake, but the ability to move fast without breaking things.

That 3,000-line file taught me something I keep coming back to: the goal of infrastructure code isn’t to be clever. It’s to be boring. Modules should be so predictable, so well-structured, so obviously correct that deploying infrastructure feels like filling out a form. Save the creativity for the application layer.