Implementing Zero-Trust Networking on AWS

VPNs are not zero trust. Stop calling them that.

I can’t count how many times I’ve sat in architecture reviews where someone points at a Site-to-Site VPN or a Client VPN endpoint and says “we’re zero trust.” No. You’ve built a tunnel. A tunnel that, once you’re inside, gives you access to everything on the network. That’s the opposite of zero trust. That’s a castle with a drawbridge and nothing inside but open hallways.

I learned this the hard way. About two years ago I was called in to help with an incident at a mid-size fintech company. An attacker had compromised a developer’s laptop through a phishing email — nothing exotic, just a well-crafted credential harvester. The developer had VPN access. Once the attacker was on the VPN, they had network-level access to every subnet in a flat VPC. They moved laterally from a dev bastion to a staging database, found credentials in environment variables, pivoted to production RDS, and exfiltrated customer records. The whole thing took about four hours.

The VPC had no segmentation. No endpoint policies. Security groups were wide open between subnets because “the services need to talk to each other.” There was no service-level authentication — if you could reach the port, you were in.

That engagement changed how I think about cloud networking. Everything I’m going to walk through here comes from rebuilding that environment and several others since.

What Zero Trust Actually Means on AWS

Zero trust isn’t a product you buy. It’s a design principle: never trust, always verify. Every request — whether it comes from inside your VPC or outside — must be authenticated, authorized, and encrypted. There’s no implicit trust based on network location.

On AWS, this translates to a few concrete things:

Identity is the perimeter. IAM policies, not security groups, are your primary access control.
Network segmentation is defense in depth, not the primary control.
Every service-to-service call is authenticated. Not just “can I reach the port” but “are you allowed to call this specific API?”
Least privilege everywhere. Services get exactly the permissions they need and nothing more.
Continuous verification. Access isn’t granted once — it’s evaluated on every request.

This is fundamentally different from the traditional model where you build a perimeter, put everything inside it, and trust internal traffic. If you’ve read my piece on security in distributed systems, you’ll recognize the pattern — the network boundary is not your security boundary.

Start With VPC Architecture: Segmentation That Matters

The flat VPC is the enemy. I don’t care how good your security groups are — if every service lives in the same VPC with the same route tables, you’ve made lateral movement trivially easy.

Here’s how I structure VPCs now:

# Separate VPCs per trust boundary
resource "aws_vpc" "workload" {
  cidr_block           = "10.1.0.0/16"
  enable_dns_support   = true
  enable_dns_hostnames = true

  tags = { Name = "workload-vpc" }
}

resource "aws_vpc" "data" {
  cidr_block           = "10.2.0.0/16"
  enable_dns_support   = true
  enable_dns_hostnames = true

  tags = { Name = "data-vpc" }
}

Workloads and data stores live in separate VPCs. Services that need to talk to each other do so through explicit, authenticated channels — not through shared subnets. Within each VPC, I use private subnets exclusively. No public subnets unless there’s an absolute requirement for internet-facing resources, and even then it’s behind an ALB or CloudFront.

The default security group gets locked down immediately:

resource "aws_default_security_group" "default" {
  vpc_id = aws_vpc.workload.id
  # No ingress or egress rules — intentionally empty
}

This is non-negotiable. The default security group in a new VPC allows all outbound traffic and all traffic between members of the group. That’s exactly the kind of implicit trust zero trust eliminates.

VPC Endpoints: Keeping AWS API Traffic Off the Internet

Here’s something that surprises people: by default, when your EC2 instance calls the S3 API, that traffic goes out through your NAT gateway, across the public internet, and hits the S3 public endpoint. Your AWS credentials are flying over the internet. Yes, it’s TLS-encrypted. But in a zero-trust model, we don’t want that traffic leaving the VPC at all.

VPC endpoints solve this. Gateway endpoints for S3 and DynamoDB, interface endpoints (powered by PrivateLink) for everything else.

# Gateway endpoint for S3 — no cost, no reason not to
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.workload.id
  service_name = "com.amazonaws.eu-west-1.s3"

  tags = { Name = "s3-gateway-endpoint" }
}

resource "aws_vpc_endpoint_route_table_association" "s3" {
  route_table_id  = aws_route_table.private.id
  vpc_endpoint_id = aws_vpc_endpoint.s3.id
}

# Interface endpoint for STS — critical for IAM role assumption
resource "aws_vpc_endpoint" "sts" {
  vpc_id              = aws_vpc.workload.id
  service_name        = "com.amazonaws.eu-west-1.sts"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]

  tags = { Name = "sts-interface-endpoint" }
}

The critical part most people miss is the endpoint policy. Without a policy, your VPC endpoint allows any principal to access any resource through it. That’s not zero trust — that’s just a private path with no access control.

{
  "Statement": [
    {
      "Sid": "AllowSpecificBuckets",
      "Effect": "Allow",
      "Principal": "*",
      "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::my-app-data-bucket",
        "arn:aws:s3:::my-app-data-bucket/*"
      ]
    }
  ]
}

Now even if someone compromises a workload in this VPC, they can only reach specific S3 buckets through the endpoint. Combine this with bucket policies that require aws:sourceVpce and you’ve got a tight loop — the bucket only accepts traffic from the endpoint, and the endpoint only allows traffic to that bucket.

I set up interface endpoints for every AWS service my workloads use: STS, KMS, Secrets Manager, CloudWatch Logs, ECR. Yes, interface endpoints cost money — about $7.50/month per AZ plus data processing. It’s worth it. The alternative is routing API calls through the internet, which is both a security risk and a latency hit.

PrivateLink for Service-to-Service Communication

PrivateLink is the backbone of zero-trust service communication on AWS. Instead of peering VPCs (which creates bidirectional network access) or using Transit Gateway with broad routing, PrivateLink creates a unidirectional, private connection to a specific service.

I use this pattern constantly: Service A needs to call Service B. Service B exposes itself through a Network Load Balancer. I create a VPC endpoint service in Service B’s VPC and an interface endpoint in Service A’s VPC.

# Service B's side — expose via NLB + endpoint service
resource "aws_vpc_endpoint_service" "service_b" {
  acceptance_required        = true
  network_load_balancer_arns = [aws_lb.service_b.arn]

  allowed_principals = [
    "arn:aws:iam::111111111111:root"  # Only Service A's account
  ]

  tags = { Name = "service-b-endpoint-service" }
}

# Service A's side — consume via interface endpoint
resource "aws_vpc_endpoint" "service_b" {
  vpc_id              = aws_vpc.service_a.id
  service_name        = aws_vpc_endpoint_service.service_b.service_name
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.service_b_endpoint.id]

  tags = { Name = "service-b-consumer-endpoint" }
}

The beauty here is that Service A can reach Service B, but Service B cannot initiate connections back to Service A. There’s no shared network. No route tables to manage. No transitive routing risks. If you’re connecting services across accounts — which you should be doing in a proper multi-account setup — this is the way.

For newer architectures, I’ve been moving toward VPC Lattice instead. Lattice gives you the same private connectivity but adds IAM-based auth policies directly on the service network. It’s PrivateLink with built-in identity verification — exactly what zero trust calls for.

IAM as the Real Perimeter

This is the part where zero trust on AWS diverges most sharply from traditional network security. In a zero-trust model, IAM isn’t just for human users accessing the console. It’s the authentication and authorization layer for every service, every API call, every data access.

Every workload gets its own IAM role with minimal permissions. Not a shared “application role” that can do everything. Not a role with * resources. A role that can do exactly what that specific service needs.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["dynamodb:GetItem", "dynamodb:Query"],
      "Resource": "arn:aws:dynamodb:eu-west-1:111111111111:table/orders",
      "Condition": {
        "ForAllValues:StringEquals": {
          "dynamodb:LeadingKeys": ["${aws:PrincipalTag/tenant_id}"]
        }
      }
    }
  ]
}

That condition key is doing real work — it restricts DynamoDB access not just to a specific table, but to rows belonging to a specific tenant. This is attribute-based access control (ABAC), and it’s incredibly powerful for multi-tenant systems.

For cross-service authentication, I use STS extensively. Service A assumes a role in Service B’s account to make API calls. The trust policy on Service B’s role specifies exactly which role from Service A can assume it:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111111111111:role/service-a-role"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "cross-account-service-b"
        }
      }
    }
  ]
}

At the organization level, Service Control Policies (SCPs) and Resource Control Policies (RCPs) form the outer boundary. I use SCPs to enforce that all data access must come from expected networks:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyAccessFromOutsideVPC",
      "Effect": "Deny",
      "Action": ["s3:*", "dynamodb:*", "sqs:*"],
      "Resource": "*",
      "Condition": {
        "StringNotEqualsIfExists": {
          "aws:SourceVpc": ["vpc-workload123", "vpc-data456"]
        },
        "BoolIfExists": {
          "aws:ViaAWSService": "false"
        }
      }
    }
  ]
}

This creates what AWS calls a “data perimeter” — even if someone has valid credentials, they can’t access resources unless the request originates from an approved VPC. The aws:ViaAWSService condition is important because some AWS services make calls on your behalf (like S3 replication), and you don’t want to break those.

VPC Lattice: The Service Mesh That Gets It Right

I’ve used App Mesh. I’ve used Istio on EKS. They work, but the operational overhead is real — sidecar proxies, control planes, certificate management, the whole circus. VPC Lattice takes a different approach: it’s a managed service network that handles connectivity, authentication, and authorization without any proxies in your workloads.

Here’s what a Lattice service network looks like:

resource "aws_vpclattice_service_network" "main" {
  name      = "zero-trust-network"
  auth_type = "AWS_IAM"

  tags = { Environment = "production" }
}

resource "aws_vpclattice_service" "orders_api" {
  name      = "orders-api"
  auth_type = "AWS_IAM"

  tags = { Service = "orders" }
}

resource "aws_vpclattice_service_network_service_association" "orders" {
  service_identifier         = aws_vpclattice_service.orders_api.id
  service_network_identifier = aws_vpclattice_service_network.main.id
}

The auth_type = "AWS_IAM" is the key. Every request to this service must be SigV4-signed. No valid IAM credentials, no access — regardless of network connectivity. You then attach auth policies using Cedar policy language:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111111111111:role/payments-service-role"
      },
      "Action": "vpc-lattice-svcs:Invoke",
      "Resource": "arn:aws:vpc-lattice:eu-west-1:222222222222:service/svc-orders/*",
      "Condition": {
        "StringEquals": {
          "vpc-lattice-svcs:RequestMethod": "GET",
          "vpc-lattice-svcs:SourceVpc": "vpc-workload123"
        }
      }
    }
  ]
}

This policy says: the payments service can invoke GET requests on the orders API, but only from a specific VPC. That’s network location AND identity AND method-level authorization in a single policy. Try doing that with security groups.

The monitoring story is solid too. Lattice access logs give you the source VPC, the authenticated principal, the request path, and the response code for every single request. Pipe those into CloudWatch Logs and you’ve got an audit trail that would make your compliance team weep with joy.

Secrets and Credentials: The Weakest Link

Remember that fintech breach I mentioned? The attacker found database credentials in environment variables. Plain text. Sitting right there in the ECS task definition. This is depressingly common.

In a zero-trust architecture, secrets management isn’t optional — it’s foundational. Every credential, every API key, every certificate lives in Secrets Manager or Parameter Store, accessed through IAM roles with VPC endpoint policies controlling the path.

resource "aws_secretsmanager_secret" "db_credentials" {
  name       = "production/orders-db"
  kms_key_id = aws_kms_key.secrets.arn
}

resource "aws_secretsmanager_secret_policy" "db_credentials" {
  secret_arn = aws_secretsmanager_secret.db_credentials.arn

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Deny"
      Principal = "*"
      Action    = "secretsmanager:GetSecretValue"
      Resource  = "*"
      Condition = {
        StringNotEquals = {
          "aws:sourceVpce" = aws_vpc_endpoint.secretsmanager.id
        }
      }
    }]
  })
}

That policy denies secret retrieval unless the request comes through the Secrets Manager VPC endpoint. Even if someone exfiltrates an IAM role’s temporary credentials, they can’t use them from outside the VPC to grab secrets.

For database access specifically, I’ve moved entirely to IAM database authentication where possible. RDS supports it for MySQL and PostgreSQL. No passwords to rotate, no credentials to store — the application uses its IAM role to generate a short-lived authentication token:

aws rds generate-db-auth-token \
  --hostname mydb.cluster-abc123.eu-west-1.rds.amazonaws.com \
  --port 3306 \
  --username app_user \
  --region eu-west-1

The token is valid for 15 minutes. Even if it’s intercepted, the window is tiny. Combine this with security groups that only allow database connections from specific application subnets and you’ve got defense in depth that actually means something.

AWS Verified Access: Zero Trust for Human Users

For human access to internal applications, AWS Verified Access replaces the VPN entirely. It evaluates every request against policies that consider user identity, device posture, and context — not just “are you on the network.”

I won’t pretend the setup is trivial, but the security model is right. You define trust providers (your IdP, your device management platform), create access groups with policies, and point endpoints at your internal applications. Users hit a public URL, get authenticated, and Verified Access proxies the request to your private application. No VPN. No network-level access.

The policies use Cedar and can get granular:

permit(principal, action, resource)
when {
    context.http_request.http_method == "GET" &&
    context.identity.groups.contains("engineering") &&
    context.identity.email.address like "*@company.com"
};

This is what zero trust looks like for human access. Not “connect to the VPN and you’re in.” Instead: “prove who you are, prove your device is compliant, and we’ll give you access to exactly the application you need.”

Monitoring: Trust But Verify (Actually, Don’t Trust At All)

Zero trust without monitoring is just a nice architecture diagram. You need to see every request, every denied access attempt, every anomalous pattern.

Here’s my baseline monitoring stack:

# Enable VPC Flow Logs — capture ALL traffic, not just accepted
aws ec2 create-flow-logs \
  --resource-type VPC \
  --resource-ids vpc-workload123 \
  --traffic-type ALL \
  --log-destination-type cloud-watch-logs \
  --log-group-name /vpc/flow-logs/workload \
  --deliver-logs-permission-arn arn:aws:iam::111111111111:role/flow-logs-role

# Enable CloudTrail for API-level auditing
aws cloudtrail create-trail \
  --name zero-trust-audit \
  --s3-bucket-name audit-logs-bucket \
  --is-multi-region-trail \
  --enable-log-file-validation

VPC Flow Logs tell you who’s talking to whom at the network level. CloudTrail tells you who’s calling which APIs. Together, they give you the full picture.

GuardDuty is non-negotiable. It catches things like unusual API calls, cryptocurrency mining, DNS exfiltration, and credential compromise. Turn it on in every account, every region:

aws guardduty create-detector --enable --finding-publishing-frequency FIFTEEN_MINUTES

For API-level security, I set up CloudWatch alarms on specific patterns: failed authentication attempts, access denied errors on sensitive resources, unusual cross-account role assumptions. The goal isn’t to prevent every attack — it’s to detect and respond fast enough that the blast radius stays small.

Putting It All Together

Here’s what the full zero-trust stack looks like in practice:

Separate VPCs per trust boundary. Workloads, data, and shared services each get their own VPC.
VPC endpoints for all AWS service access. Gateway endpoints for S3/DynamoDB, interface endpoints for everything else. Endpoint policies restrict access to specific resources.
PrivateLink or VPC Lattice for service-to-service communication. No VPC peering. No broad Transit Gateway routing.
IAM everywhere. Every workload has its own role. ABAC for fine-grained access. SCPs for organizational guardrails.
Secrets Manager with VPC endpoint policies. IAM database auth where possible. No credentials in environment variables. Ever.
Verified Access for human users. Kill the VPN.
Flow Logs + CloudTrail + GuardDuty as the monitoring baseline. Alert on anomalies, not just failures.

None of this is theoretical. I’ve built this stack multiple times, and each time the hardest part isn’t the technology — it’s convincing teams to give up the convenience of flat networks and shared credentials. “But it’s all internal traffic” is the most dangerous sentence in cloud security.

The fintech company I mentioned at the start? After we rebuilt their environment with this architecture, they passed their SOC 2 audit for the first time. More importantly, when a similar phishing attempt succeeded six months later, the attacker got onto a developer’s machine and… couldn’t go anywhere. No VPN to pivot through. No flat network to traverse. The compromised credentials couldn’t reach any AWS APIs from outside the VPC. The blast radius was one laptop.

That’s what zero trust buys you. Not invulnerability — there’s no such thing. But containment. When something goes wrong (and it will), the damage stays small and the detection is fast.

Stop building castles. Start building checkpoints.