Locking and Concurrency

When multiple team members work with the same Terraform configuration, state corruption becomes a real risk. Without proper locking mechanisms, concurrent operations can overwrite each other’s changes, leading to inconsistent state files and potentially dangerous infrastructure modifications.

This part covers state locking strategies, concurrency control patterns, and recovery techniques that ensure safe collaboration in team environments.

State Locking Fundamentals

Terraform uses state locking to prevent concurrent operations:

# Backend with locking support
terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "infrastructure/terraform.tfstate"
    region         = "us-west-2"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}
#!/bin/bash
# scripts/setup-state-locking.sh

set -e

BUCKET_NAME=${1:-"company-terraform-state"}
DYNAMODB_TABLE=${2:-"terraform-locks"}
AWS_REGION=${3:-"us-west-2"}

echo "Setting up Terraform state locking infrastructure..."

# Create S3 bucket for state storage
aws s3api create-bucket \
    --bucket "$BUCKET_NAME" \
    --region "$AWS_REGION" \
    --create-bucket-configuration LocationConstraint="$AWS_REGION"

# Enable versioning
aws s3api put-bucket-versioning \
    --bucket "$BUCKET_NAME" \
    --versioning-configuration Status=Enabled

# Enable encryption
aws s3api put-bucket-encryption \
    --bucket "$BUCKET_NAME" \
    --server-side-encryption-configuration '{
        "Rules": [{
            "ApplyServerSideEncryptionByDefault": {
                "SSEAlgorithm": "AES256"
            }
        }]
    }'

# Block public access
aws s3api put-public-access-block \
    --bucket "$BUCKET_NAME" \
    --public-access-block-configuration \
        BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true

# Create DynamoDB table for locking
aws dynamodb create-table \
    --table-name "$DYNAMODB_TABLE" \
    --attribute-definitions AttributeName=LockID,AttributeType=S \
    --key-schema AttributeName=LockID,KeyType=HASH \
    --billing-mode PAY_PER_REQUEST \
    --region "$AWS_REGION"

echo "✅ State locking infrastructure created successfully"
echo "Bucket: $BUCKET_NAME"
echo "DynamoDB Table: $DYNAMODB_TABLE"

Advanced Locking Strategies

Implement custom locking for complex scenarios:

#!/usr/bin/env python3
# scripts/terraform_lock_manager.py

import boto3
import time
import json
import sys
from datetime import datetime, timedelta
from typing import Optional, Dict, Any

class TerraformLockManager:
    def __init__(self, table_name: str, region: str = "us-west-2"):
        self.dynamodb = boto3.resource('dynamodb', region_name=region)
        self.table = self.dynamodb.Table(table_name)
        self.table_name = table_name
    
    def acquire_lock(self, lock_id: str, operation: str, who: str, 
                    timeout_minutes: int = 30) -> bool:
        """Acquire a lock with timeout and metadata"""
        
        lock_info = {
            'LockID': lock_id,
            'Operation': operation,
            'Who': who,
            'Version': '1',
            'Created': datetime.utcnow().isoformat(),
            'Expires': (datetime.utcnow() + timedelta(minutes=timeout_minutes)).isoformat(),
            'Info': json.dumps({
                'operation': operation,
                'user': who,
                'timestamp': datetime.utcnow().isoformat(),
                'timeout_minutes': timeout_minutes
            })
        }
        
        try:
            # Attempt to create lock (will fail if exists)
            self.table.put_item(
                Item=lock_info,
                ConditionExpression='attribute_not_exists(LockID)'
            )
            print(f"✅ Lock acquired: {lock_id}")
            return True
            
        except self.dynamodb.meta.client.exceptions.ConditionalCheckFailedException:
            # Lock already exists, check if expired
            existing_lock = self.get_lock_info(lock_id)
            if existing_lock and self._is_lock_expired(existing_lock):
                print(f"🔄 Existing lock expired, attempting to acquire...")
                return self._force_acquire_lock(lock_id, lock_info)
            
            print(f"❌ Lock already held: {lock_id}")
            if existing_lock:
                self._print_lock_info(existing_lock)
            return False
    
    def release_lock(self, lock_id: str, who: str) -> bool:
        """Release a lock with ownership verification"""
        
        try:
            existing_lock = self.get_lock_info(lock_id)
            if not existing_lock:
                print(f"⚠️  No lock found: {lock_id}")
                return True
            
            # Verify ownership
            if existing_lock.get('Who') != who:
                print(f"❌ Cannot release lock owned by {existing_lock.get('Who')}")
                return False
            
            self.table.delete_item(
                Key={'LockID': lock_id},
                ConditionExpression='Who = :who',
                ExpressionAttributeValues={':who': who}
            )
            
            print(f"✅ Lock released: {lock_id}")
            return True
            
        except Exception as e:
            print(f"❌ Failed to release lock: {e}")
            return False
    
    def get_lock_info(self, lock_id: str) -> Optional[Dict[str, Any]]:
        """Get information about a lock"""
        
        try:
            response = self.table.get_item(Key={'LockID': lock_id})
            return response.get('Item')
        except Exception:
            return None
    
    def list_locks(self) -> list:
        """List all active locks"""
        
        try:
            response = self.table.scan()
            return response.get('Items', [])
        except Exception as e:
            print(f"❌ Failed to list locks: {e}")
            return []
    
    def force_unlock(self, lock_id: str, reason: str) -> bool:
        """Force unlock (admin operation)"""
        
        existing_lock = self.get_lock_info(lock_id)
        if not existing_lock:
            print(f"⚠️  No lock found: {lock_id}")
            return True
        
        print(f"🚨 Force unlocking {lock_id}")
        self._print_lock_info(existing_lock)
        print(f"Reason: {reason}")
        
        try:
            self.table.delete_item(Key={'LockID': lock_id})
            print(f"✅ Force unlock completed: {lock_id}")
            return True
        except Exception as e:
            print(f"❌ Force unlock failed: {e}")
            return False
    
    def _is_lock_expired(self, lock_info: Dict[str, Any]) -> bool:
        """Check if a lock has expired"""
        
        expires_str = lock_info.get('Expires')
        if not expires_str:
            return False
        
        try:
            expires = datetime.fromisoformat(expires_str)
            return datetime.utcnow() > expires
        except Exception:
            return False
    
    def _force_acquire_lock(self, lock_id: str, lock_info: Dict[str, Any]) -> bool:
        """Force acquire an expired lock"""
        
        try:
            self.table.put_item(Item=lock_info)
            print(f"✅ Expired lock replaced: {lock_id}")
            return True
        except Exception as e:
            print(f"❌ Failed to replace expired lock: {e}")
            return False
    
    def _print_lock_info(self, lock_info: Dict[str, Any]):
        """Print formatted lock information"""
        
        print(f"  Lock ID: {lock_info.get('LockID')}")
        print(f"  Operation: {lock_info.get('Operation')}")
        print(f"  Owner: {lock_info.get('Who')}")
        print(f"  Created: {lock_info.get('Created')}")
        print(f"  Expires: {lock_info.get('Expires')}")

def main():
    import argparse
    
    parser = argparse.ArgumentParser(description='Terraform Lock Manager')
    parser.add_argument('--table', required=True, help='DynamoDB table name')
    parser.add_argument('--region', default='us-west-2', help='AWS region')
    
    subparsers = parser.add_subparsers(dest='command', help='Commands')
    
    # Acquire lock
    acquire_parser = subparsers.add_parser('acquire', help='Acquire a lock')
    acquire_parser.add_argument('--lock-id', required=True, help='Lock ID')
    acquire_parser.add_argument('--operation', required=True, help='Operation name')
    acquire_parser.add_argument('--who', required=True, help='User/system acquiring lock')
    acquire_parser.add_argument('--timeout', type=int, default=30, help='Timeout in minutes')
    
    # Release lock
    release_parser = subparsers.add_parser('release', help='Release a lock')
    release_parser.add_argument('--lock-id', required=True, help='Lock ID')
    release_parser.add_argument('--who', required=True, help='User/system releasing lock')
    
    # List locks
    subparsers.add_parser('list', help='List all locks')
    
    # Force unlock
    force_parser = subparsers.add_parser('force-unlock', help='Force unlock (admin)')
    force_parser.add_argument('--lock-id', required=True, help='Lock ID')
    force_parser.add_argument('--reason', required=True, help='Reason for force unlock')
    
    args = parser.parse_args()
    
    if not args.command:
        parser.print_help()
        sys.exit(1)
    
    lock_manager = TerraformLockManager(args.table, args.region)
    
    if args.command == 'acquire':
        success = lock_manager.acquire_lock(
            args.lock_id, args.operation, args.who, args.timeout
        )
        sys.exit(0 if success else 1)
    
    elif args.command == 'release':
        success = lock_manager.release_lock(args.lock_id, args.who)
        sys.exit(0 if success else 1)
    
    elif args.command == 'list':
        locks = lock_manager.list_locks()
        if locks:
            print(f"Active locks ({len(locks)}):")
            for lock in locks:
                print(f"\n{lock['LockID']}:")
                lock_manager._print_lock_info(lock)
        else:
            print("No active locks")
    
    elif args.command == 'force-unlock':
        success = lock_manager.force_unlock(args.lock_id, args.reason)
        sys.exit(0 if success else 1)

if __name__ == "__main__":
    main()

Workspace-Based Concurrency

Use workspaces to isolate concurrent operations:

#!/bin/bash
# scripts/workspace-manager.sh

set -e

WORKSPACE_PREFIX=${1:-"feature"}
BRANCH_NAME=${2:-$(git branch --show-current)}
BASE_WORKSPACE=${3:-"default"}

# Generate workspace name from branch
WORKSPACE_NAME="${WORKSPACE_PREFIX}-${BRANCH_NAME//[^a-zA-Z0-9]/-}"

echo "Managing workspace: $WORKSPACE_NAME"

create_workspace() {
    echo "Creating workspace: $WORKSPACE_NAME"
    
    # Create new workspace
    terraform workspace new "$WORKSPACE_NAME" 2>/dev/null || {
        echo "Workspace already exists, selecting it..."
        terraform workspace select "$WORKSPACE_NAME"
    }
    
    # Copy state from base workspace if needed
    if [ "$BASE_WORKSPACE" != "default" ] && [ -n "$BASE_WORKSPACE" ]; then
        echo "Copying state from $BASE_WORKSPACE workspace..."
        
        # Switch to base workspace and export state
        terraform workspace select "$BASE_WORKSPACE"
        terraform state pull > "/tmp/base-state.tfstate"
        
        # Switch back and import state
        terraform workspace select "$WORKSPACE_NAME"
        
        # Only import if workspace is empty
        if [ "$(terraform state list | wc -l)" -eq 0 ]; then
            terraform state push "/tmp/base-state.tfstate"
            echo "✅ State copied from $BASE_WORKSPACE"
        fi
        
        rm -f "/tmp/base-state.tfstate"
    fi
    
    echo "✅ Workspace $WORKSPACE_NAME ready"
}

cleanup_workspace() {
    echo "Cleaning up workspace: $WORKSPACE_NAME"
    
    # Switch to default workspace
    terraform workspace select default
    
    # Destroy resources in the workspace
    terraform workspace select "$WORKSPACE_NAME"
    echo "Destroying resources in workspace..."
    terraform destroy -auto-approve
    
    # Delete the workspace
    terraform workspace select default
    terraform workspace delete "$WORKSPACE_NAME"
    
    echo "✅ Workspace $WORKSPACE_NAME cleaned up"
}

case "${4:-create}" in
    "create")
        create_workspace
        ;;
    "cleanup")
        cleanup_workspace
        ;;
    *)
        echo "Usage: $0 <prefix> <branch> <base_workspace> [create|cleanup]"
        exit 1
        ;;
esac

Lock Monitoring and Alerting

Monitor lock status and alert on issues:

#!/usr/bin/env python3
# scripts/lock_monitor.py

import boto3
import json
import time
from datetime import datetime, timedelta
from typing import List, Dict, Any

class LockMonitor:
    def __init__(self, table_name: str, region: str = "us-west-2"):
        self.dynamodb = boto3.resource('dynamodb', region_name=region)
        self.table = self.dynamodb.Table(table_name)
        self.sns = boto3.client('sns', region_name=region)
    
    def check_stale_locks(self, max_age_hours: int = 2) -> List[Dict[str, Any]]:
        """Find locks that have been held too long"""
        
        stale_locks = []
        cutoff_time = datetime.utcnow() - timedelta(hours=max_age_hours)
        
        try:
            response = self.table.scan()
            locks = response.get('Items', [])
            
            for lock in locks:
                created_str = lock.get('Created')
                if created_str:
                    try:
                        created = datetime.fromisoformat(created_str)
                        if created < cutoff_time:
                            stale_locks.append(lock)
                    except ValueError:
                        # Invalid date format, consider it stale
                        stale_locks.append(lock)
            
        except Exception as e:
            print(f"Error checking stale locks: {e}")
        
        return stale_locks
    
    def check_expired_locks(self) -> List[Dict[str, Any]]:
        """Find locks that have expired but not been cleaned up"""
        
        expired_locks = []
        now = datetime.utcnow()
        
        try:
            response = self.table.scan()
            locks = response.get('Items', [])
            
            for lock in locks:
                expires_str = lock.get('Expires')
                if expires_str:
                    try:
                        expires = datetime.fromisoformat(expires_str)
                        if now > expires:
                            expired_locks.append(lock)
                    except ValueError:
                        pass
            
        except Exception as e:
            print(f"Error checking expired locks: {e}")
        
        return expired_locks
    
    def check_lock_conflicts(self) -> List[Dict[str, Any]]:
        """Check for potential lock conflicts"""
        
        conflicts = []
        
        try:
            response = self.table.scan()
            locks = response.get('Items', [])
            
            # Group locks by similar patterns
            lock_groups = {}
            for lock in locks:
                lock_id = lock.get('LockID', '')
                
                # Extract base path (remove workspace/environment suffixes)
                base_path = lock_id.split('/')[0] if '/' in lock_id else lock_id
                
                if base_path not in lock_groups:
                    lock_groups[base_path] = []
                lock_groups[base_path].append(lock)
            
            # Check for multiple locks on similar resources
            for base_path, group_locks in lock_groups.items():
                if len(group_locks) > 1:
                    conflicts.append({
                        'base_path': base_path,
                        'locks': group_locks,
                        'count': len(group_locks)
                    })
        
        except Exception as e:
            print(f"Error checking lock conflicts: {e}")
        
        return conflicts
    
    def send_alert(self, topic_arn: str, subject: str, message: str):
        """Send SNS alert"""
        
        try:
            self.sns.publish(
                TopicArn=topic_arn,
                Subject=subject,
                Message=message
            )
            print(f"✅ Alert sent: {subject}")
        except Exception as e:
            print(f"❌ Failed to send alert: {e}")
    
    def generate_report(self) -> Dict[str, Any]:
        """Generate comprehensive lock status report"""
        
        stale_locks = self.check_stale_locks()
        expired_locks = self.check_expired_locks()
        conflicts = self.check_lock_conflicts()
        
        try:
            response = self.table.scan()
            total_locks = len(response.get('Items', []))
        except Exception:
            total_locks = 0
        
        report = {
            'timestamp': datetime.utcnow().isoformat(),
            'total_locks': total_locks,
            'stale_locks': len(stale_locks),
            'expired_locks': len(expired_locks),
            'conflicts': len(conflicts),
            'details': {
                'stale_locks': stale_locks,
                'expired_locks': expired_locks,
                'conflicts': conflicts
            }
        }
        
        return report
    
    def run_monitoring_cycle(self, alert_topic_arn: str = None):
        """Run a complete monitoring cycle"""
        
        print(f"🔍 Running lock monitoring cycle at {datetime.utcnow()}")
        
        report = self.generate_report()
        
        # Print summary
        print(f"Total locks: {report['total_locks']}")
        print(f"Stale locks: {report['stale_locks']}")
        print(f"Expired locks: {report['expired_locks']}")
        print(f"Conflicts: {report['conflicts']}")
        
        # Send alerts if configured
        if alert_topic_arn:
            alerts_sent = 0
            
            if report['stale_locks'] > 0:
                message = f"Found {report['stale_locks']} stale Terraform locks:\n\n"
                for lock in report['details']['stale_locks']:
                    message += f"- {lock['LockID']} (Owner: {lock.get('Who', 'Unknown')})\n"
                
                self.send_alert(alert_topic_arn, "Stale Terraform Locks Detected", message)
                alerts_sent += 1
            
            if report['expired_locks'] > 0:
                message = f"Found {report['expired_locks']} expired Terraform locks:\n\n"
                for lock in report['details']['expired_locks']:
                    message += f"- {lock['LockID']} (Expired: {lock.get('Expires', 'Unknown')})\n"
                
                self.send_alert(alert_topic_arn, "Expired Terraform Locks Found", message)
                alerts_sent += 1
            
            if report['conflicts'] > 0:
                message = f"Found {report['conflicts']} potential lock conflicts:\n\n"
                for conflict in report['details']['conflicts']:
                    message += f"- {conflict['base_path']} ({conflict['count']} locks)\n"
                
                self.send_alert(alert_topic_arn, "Terraform Lock Conflicts Detected", message)
                alerts_sent += 1
            
            print(f"📧 Sent {alerts_sent} alerts")
        
        return report

def main():
    import argparse
    
    parser = argparse.ArgumentParser(description='Terraform Lock Monitor')
    parser.add_argument('--table', required=True, help='DynamoDB table name')
    parser.add_argument('--region', default='us-west-2', help='AWS region')
    parser.add_argument('--alert-topic', help='SNS topic ARN for alerts')
    parser.add_argument('--max-age-hours', type=int, default=2, help='Max lock age in hours')
    parser.add_argument('--continuous', action='store_true', help='Run continuously')
    parser.add_argument('--interval', type=int, default=300, help='Check interval in seconds')
    
    args = parser.parse_args()
    
    monitor = LockMonitor(args.table, args.region)
    
    if args.continuous:
        print(f"🔄 Starting continuous monitoring (interval: {args.interval}s)")
        while True:
            try:
                monitor.run_monitoring_cycle(args.alert_topic)
                time.sleep(args.interval)
            except KeyboardInterrupt:
                print("\n👋 Monitoring stopped")
                break
            except Exception as e:
                print(f"❌ Monitoring error: {e}")
                time.sleep(60)  # Wait before retrying
    else:
        report = monitor.run_monitoring_cycle(args.alert_topic)
        
        # Output report as JSON
        print("\n📊 Full Report:")
        print(json.dumps(report, indent=2, default=str))

if __name__ == "__main__":
    main()

Recovery Procedures

Handle lock corruption and recovery scenarios:

#!/bin/bash
# scripts/lock-recovery.sh

set -e

DYNAMODB_TABLE=${1:-"terraform-locks"}
AWS_REGION=${2:-"us-west-2"}
BACKUP_DIR=${3:-"lock-backups"}

echo "Terraform Lock Recovery Utility"

backup_locks() {
    echo "Creating backup of all locks..."
    
    mkdir -p "$BACKUP_DIR"
    
    BACKUP_FILE="$BACKUP_DIR/locks-backup-$(date +%Y%m%d-%H%M%S).json"
    
    aws dynamodb scan \
        --table-name "$DYNAMODB_TABLE" \
        --region "$AWS_REGION" \
        --output json > "$BACKUP_FILE"
    
    echo "✅ Locks backed up to: $BACKUP_FILE"
}

force_unlock_all() {
    echo "⚠️  WARNING: This will force unlock ALL Terraform locks!"
    echo "This should only be used in emergency situations."
    read -p "Are you sure? Type 'FORCE_UNLOCK' to continue: " confirmation
    
    if [ "$confirmation" != "FORCE_UNLOCK" ]; then
        echo "Operation cancelled"
        exit 1
    fi
    
    # Backup first
    backup_locks
    
    # Get all lock IDs
    LOCK_IDS=$(aws dynamodb scan \
        --table-name "$DYNAMODB_TABLE" \
        --region "$AWS_REGION" \
        --projection-expression "LockID" \
        --output text \
        --query 'Items[*].LockID.S')
    
    if [ -z "$LOCK_IDS" ]; then
        echo "No locks found to remove"
        return
    fi
    
    # Delete each lock
    for lock_id in $LOCK_IDS; do
        echo "Removing lock: $lock_id"
        aws dynamodb delete-item \
            --table-name "$DYNAMODB_TABLE" \
            --region "$AWS_REGION" \
            --key "{\"LockID\":{\"S\":\"$lock_id\"}}"
    done
    
    echo "✅ All locks forcibly removed"
}

recover_from_backup() {
    BACKUP_FILE=${4:-""}
    
    if [ -z "$BACKUP_FILE" ] || [ ! -f "$BACKUP_FILE" ]; then
        echo "❌ Backup file not found: $BACKUP_FILE"
        exit 1
    fi
    
    echo "Recovering locks from backup: $BACKUP_FILE"
    
    # Clear existing locks first
    echo "Clearing existing locks..."
    force_unlock_all
    
    # Restore from backup
    echo "Restoring locks from backup..."
    
    # Extract items and restore each one
    jq -r '.Items[] | @base64' "$BACKUP_FILE" | while read -r item; do
        echo "$item" | base64 --decode | jq -c '.' | while read -r lock_item; do
            aws dynamodb put-item \
                --table-name "$DYNAMODB_TABLE" \
                --region "$AWS_REGION" \
                --item "$lock_item"
        done
    done
    
    echo "✅ Locks restored from backup"
}

check_lock_health() {
    echo "Checking lock table health..."
    
    # Check table status
    TABLE_STATUS=$(aws dynamodb describe-table \
        --table-name "$DYNAMODB_TABLE" \
        --region "$AWS_REGION" \
        --query 'Table.TableStatus' \
        --output text)
    
    echo "Table status: $TABLE_STATUS"
    
    if [ "$TABLE_STATUS" != "ACTIVE" ]; then
        echo "❌ Table is not active"
        exit 1
    fi
    
    # Count locks
    LOCK_COUNT=$(aws dynamodb scan \
        --table-name "$DYNAMODB_TABLE" \
        --region "$AWS_REGION" \
        --select COUNT \
        --query 'Count' \
        --output text)
    
    echo "Active locks: $LOCK_COUNT"
    
    # Check for expired locks
    CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S")
    
    aws dynamodb scan \
        --table-name "$DYNAMODB_TABLE" \
        --region "$AWS_REGION" \
        --filter-expression "Expires < :current_time" \
        --expression-attribute-values "{\":current_time\":{\"S\":\"$CURRENT_TIME\"}}" \
        --query 'Items[*].{LockID:LockID.S,Expires:Expires.S,Who:Who.S}' \
        --output table
    
    echo "✅ Lock health check completed"
}

case "${4:-help}" in
    "backup")
        backup_locks
        ;;
    "force-unlock-all")
        force_unlock_all
        ;;
    "recover")
        recover_from_backup "$@"
        ;;
    "health-check")
        check_lock_health
        ;;
    *)
        echo "Usage: $0 <table> <region> <backup_dir> [backup|force-unlock-all|recover|health-check] [backup_file]"
        echo ""
        echo "Commands:"
        echo "  backup           - Create backup of all locks"
        echo "  force-unlock-all - Remove all locks (DANGEROUS)"
        echo "  recover          - Restore locks from backup file"
        echo "  health-check     - Check lock table health"
        exit 1
        ;;
esac

What’s Next

Proper locking and concurrency control are essential for safe team collaboration with Terraform. These mechanisms prevent state corruption and ensure that infrastructure changes are applied consistently and safely.

In the next part, we’ll explore disaster recovery strategies that help you recover from state file corruption, accidental deletions, and other catastrophic scenarios that can occur despite the best preventive measures.