AWS Lambda VPC Configuration: The Complete Guide to Private Networking and Cost Optimization with NAT Instances

Three months ago, I got a $400 AWS bill that made me question everything I knew about serverless architecture. The culprit? NAT Gateways supporting Lambda functions that barely handled 10,000 requests per month. That’s when I learned the hard way that “serverless” doesn’t mean “cheap” when you need proper networking.

After weeks of deep-diving into VPC configurations, NAT alternatives, and cost optimization, I cut our Lambda networking costs by 85% while maintaining security and performance. Here’s everything I learned about running Lambda functions in VPCs the cost-effective way.

Why Lambda Functions Need VPCs (And Why It Gets Expensive)

Lambda functions run in AWS’s managed infrastructure by default. They can access the internet and public AWS services but can’t reach resources in your private VPC like RDS databases, ElastiCache clusters, or internal services.

The moment you need to access private resources, you have two options:

Make your resources public (security nightmare)
Put Lambda in a VPC (networking complexity + costs)

Here’s what happened to our costs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Before VPC:
- Lambda execution: $12/month
- RDS (public): $45/month
- Total: $57/month

After VPC (naive approach):
- Lambda execution: $12/month
- RDS (private): $45/month
- NAT Gateway: $400/month (!!!)
- Total: $457/month

That 8x cost increase was a wake-up call.

Understanding VPC Networking for Lambda

When Lambda functions run in a VPC, they need to access AWS services and the internet through specific networking paths. Here’s the architecture that costs money:

1
2
Lambda (Private Subnet) → NAT Gateway → Internet Gateway → AWS Services
                        → Route Table → Private Resources (RDS, etc.)

The expensive part? NAT Gateways charge for data processing AND hourly uptime:

$45.60/month per NAT Gateway (24/7 uptime)
$0.045 per GB processed
Multi-AZ setup = multiple NAT Gateways

For a modest serverless application, this easily becomes your highest cost.

The Complete VPC Setup (Infrastructure as Code)

Let’s build a proper VPC configuration using Terraform. This setup provides secure networking for Lambda functions with internet access:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# vpc.tf
resource "aws_vpc" "lambda_vpc" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "lambda-vpc"
    Environment = "production"
  }
}

# Internet Gateway for public subnets
resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.lambda_vpc.id

  tags = {
    Name = "lambda-igw"
  }
}

# Public subnets (for NAT Gateway/Instance)
resource "aws_subnet" "public" {
  count = 2
  
  vpc_id            = aws_vpc.lambda_vpc.id
  cidr_block        = "10.0.${count.index + 1}.0/24"
  availability_zone = data.aws_availability_zones.available.names[count.index]
  
  map_public_ip_on_launch = true

  tags = {
    Name = "public-subnet-${count.index + 1}"
    Type = "public"
  }
}

# Private subnets (for Lambda functions)
resource "aws_subnet" "private" {
  count = 2
  
  vpc_id            = aws_vpc.lambda_vpc.id
  cidr_block        = "10.0.${count.index + 10}.0/24"
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name = "private-subnet-${count.index + 1}"
    Type = "private"
  }
}

# Database subnets (for RDS)
resource "aws_subnet" "database" {
  count = 2
  
  vpc_id            = aws_vpc.lambda_vpc.id
  cidr_block        = "10.0.${count.index + 20}.0/24"
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name = "database-subnet-${count.index + 1}"
    Type = "database"
  }
}

# Route table for public subnets
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.lambda_vpc.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }

  tags = {
    Name = "public-rt"
  }
}

# Associate public subnets with public route table
resource "aws_route_table_association" "public" {
  count = length(aws_subnet.public)
  
  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

data "aws_availability_zones" "available" {
  state = "available"
}

The Expensive Way: NAT Gateway

The standard approach uses managed NAT Gateways:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# nat-gateway.tf (EXPENSIVE!)
resource "aws_eip" "nat" {
  count = 2
  domain = "vpc"
  
  depends_on = [aws_internet_gateway.main]

  tags = {
    Name = "nat-eip-${count.index + 1}"
  }
}

resource "aws_nat_gateway" "main" {
  count = 2
  
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id

  tags = {
    Name = "nat-gateway-${count.index + 1}"
  }

  depends_on = [aws_internet_gateway.main]
}

# Private route tables (one per AZ for HA)
resource "aws_route_table" "private" {
  count = 2
  
  vpc_id = aws_vpc.lambda_vpc.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main[count.index].id
  }

  tags = {
    Name = "private-rt-${count.index + 1}"
  }
}

# Associate private subnets with private route tables
resource "aws_route_table_association" "private" {
  count = length(aws_subnet.private)
  
  subnet_id      = aws_subnet.private[count.index].id
  route_table_id = aws_route_table.private[count.index].id
}

Monthly cost for this setup:

2 NAT Gateways: $91.20 (24/7 uptime)
Data processing: ~$50-200 depending on usage
Total: $140-290/month just for networking

The Cost-Effective Way: NAT Instance

Here’s how to replace expensive NAT Gateways with a single NAT instance:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
# nat-instance.tf
# Security group for NAT instance
resource "aws_security_group" "nat_instance" {
  name_prefix = "nat-instance-"
  vpc_id      = aws_vpc.lambda_vpc.id

  # Allow inbound traffic from private subnets
  ingress {
    from_port   = 0
    to_port     = 65535
    protocol    = "tcp"
    cidr_blocks = [for subnet in aws_subnet.private : subnet.cidr_block]
  }

  ingress {
    from_port   = 0
    to_port     = 65535
    protocol    = "udp"
    cidr_blocks = [for subnet in aws_subnet.private : subnet.cidr_block]
  }

  # Allow SSH access (for management)
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # Restrict this in production
  }

  # Allow all outbound traffic
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "nat-instance-sg"
  }
}

# Get latest Amazon Linux 2 AMI
data "aws_ami" "amazon_linux" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["amzn2-ami-hvm-*-x86_64-gp2"]
  }
}

# IAM role for NAT instance (for CloudWatch, SSM, etc.)
resource "aws_iam_role" "nat_instance" {
  name = "nat-instance-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "nat_instance_ssm" {
  role       = aws_iam_role.nat_instance.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}

resource "aws_iam_instance_profile" "nat_instance" {
  name = "nat-instance-profile"
  role = aws_iam_role.nat_instance.name
}

# NAT Instance
resource "aws_instance" "nat" {
  ami                    = data.aws_ami.amazon_linux.id
  instance_type          = "t3.nano"  # $3.80/month!
  key_name              = var.key_pair_name
  vpc_security_group_ids = [aws_security_group.nat_instance.id]
  subnet_id             = aws_subnet.public[0].id
  iam_instance_profile  = aws_iam_instance_profile.nat_instance.name
  
  # Disable source/destination check (required for NAT)
  source_dest_check = false

  user_data = base64encode(templatefile("${path.module}/nat-instance-setup.sh", {
    vpc_cidr = aws_vpc.lambda_vpc.cidr_block
  }))

  tags = {
    Name = "nat-instance"
    Purpose = "NAT for Lambda functions"
  }

  lifecycle {
    create_before_destroy = true
  }
}

# Elastic IP for NAT instance
resource "aws_eip" "nat_instance" {
  instance = aws_instance.nat.id
  domain   = "vpc"

  tags = {
    Name = "nat-instance-eip"
  }
}

# Single route table for all private subnets
resource "aws_route_table" "private" {
  vpc_id = aws_vpc.lambda_vpc.id

  route {
    cidr_block  = "0.0.0.0/0"
    instance_id = aws_instance.nat.id
  }

  tags = {
    Name = "private-rt"
  }
}

# Associate all private subnets with the single route table
resource "aws_route_table_association" "private" {
  count = length(aws_subnet.private)
  
  subnet_id      = aws_subnet.private[count.index].id
  route_table_id = aws_route_table.private.id
}

Here’s the NAT instance setup script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#!/bin/bash
# nat-instance-setup.sh

# Update system
yum update -y

# Enable IP forwarding
echo 'net.ipv4.ip_forward = 1' >> /etc/sysctl.conf
sysctl -p

# Configure iptables for NAT
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
iptables -A FORWARD -i eth0 -o eth0 -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARD -i eth0 -o eth0 -j ACCEPT

# Save iptables rules
iptables-save > /etc/sysconfig/iptables

# Install iptables-services to persist rules
yum install -y iptables-services
systemctl enable iptables
systemctl start iptables

# Install CloudWatch agent for monitoring
yum install -y amazon-cloudwatch-agent

# Configure automatic security updates
yum install -y yum-cron
systemctl enable yum-cron
systemctl start yum-cron

# Create a monitoring script
cat << 'EOF' > /usr/local/bin/nat-health-check.sh
#!/bin/bash
# Simple health check script
ping -c 3 8.8.8.8 > /dev/null 2>&1
if [ $? -eq 0 ]; then
    echo "NAT instance healthy: $(date)" >> /var/log/nat-health.log
else
    echo "NAT instance unhealthy: $(date)" >> /var/log/nat-health.log
fi
EOF

chmod +x /usr/local/bin/nat-health-check.sh

# Add to crontab
echo "*/5 * * * * /usr/local/bin/nat-health-check.sh" | crontab -

# Log completion
echo "NAT instance setup completed: $(date)" >> /var/log/nat-setup.log

Lambda VPC Configuration

Now configure your Lambda functions to use the private subnets:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# lambda.tf
resource "aws_security_group" "lambda" {
  name_prefix = "lambda-"
  vpc_id      = aws_vpc.lambda_vpc.id

  # Allow outbound internet access (through NAT)
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  # Allow access to RDS
  egress {
    from_port   = 5432
    to_port     = 5432
    protocol    = "tcp"
    cidr_blocks = [for subnet in aws_subnet.database : subnet.cidr_block]
  }

  tags = {
    Name = "lambda-sg"
  }
}

resource "aws_lambda_function" "api" {
  filename         = "lambda.zip"
  function_name    = "api-handler"
  role            = aws_iam_role.lambda.arn
  handler         = "index.handler"
  runtime         = "nodejs18.x"
  timeout         = 30

  vpc_config {
    subnet_ids         = aws_subnet.private[*].id
    security_group_ids = [aws_security_group.lambda.id]
  }

  environment {
    variables = {
      DATABASE_URL = aws_db_instance.main.endpoint
      REDIS_URL    = aws_elasticache_cluster.main.cache_nodes.0.address
    }
  }

  depends_on = [
    aws_iam_role_policy_attachment.lambda_vpc,
    aws_cloudwatch_log_group.lambda,
  ]
}

# Lambda execution role with VPC permissions
resource "aws_iam_role" "lambda" {
  name = "lambda-execution-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "lambda.amazonaws.com"
        }
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "lambda_vpc" {
  role       = aws_iam_role.lambda.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole"
}

RDS and ElastiCache in Private Subnets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# rds.tf
resource "aws_db_subnet_group" "main" {
  name       = "main-db-subnet-group"
  subnet_ids = aws_subnet.database[*].id

  tags = {
    Name = "main-db-subnet-group"
  }
}

resource "aws_security_group" "rds" {
  name_prefix = "rds-"
  vpc_id      = aws_vpc.lambda_vpc.id

  ingress {
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [aws_security_group.lambda.id]
  }

  tags = {
    Name = "rds-sg"
  }
}

resource "aws_db_instance" "main" {
  identifier = "main-postgres"
  
  engine              = "postgres"
  engine_version      = "15.4"
  instance_class      = "db.t3.micro"
  allocated_storage   = 20
  storage_encrypted   = true
  
  db_name  = "maindb"
  username = "dbadmin"
  password = var.db_password
  
  vpc_security_group_ids = [aws_security_group.rds.id]
  db_subnet_group_name   = aws_db_subnet_group.main.name
  
  backup_retention_period = 7
  skip_final_snapshot    = true

  tags = {
    Name = "main-database"
  }
}

Cost Comparison: Real Numbers

Here’s what our actual costs look like:

NAT Gateway Approach (2 AZs):

1
2
3
4
5
Monthly Costs:
- NAT Gateway uptime (2 × $45.60): $91.20
- Data processing (500GB): $22.50
- Elastic IPs (2 × $3.65): $7.30
- Total: $121.00/month

NAT Instance Approach:

1
2
3
4
5
Monthly Costs:
- t3.nano instance: $3.80
- Data transfer: $0 (same AZ)
- Elastic IP: $0 (attached to instance)
- Total: $3.80/month

Savings: $117.20/month (97% cost reduction)

NAT Instance High Availability

The single point of failure concern is real. Here’s how to address it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# nat-ha.tf - Auto-recovering NAT instance
resource "aws_launch_template" "nat" {
  name_prefix   = "nat-instance-"
  image_id      = data.aws_ami.amazon_linux.id
  instance_type = "t3.nano"
  key_name      = var.key_pair_name

  vpc_security_group_ids = [aws_security_group.nat_instance.id]
  
  iam_instance_profile {
    name = aws_iam_instance_profile.nat_instance.name
  }

  user_data = base64encode(templatefile("${path.module}/nat-instance-ha-setup.sh", {
    vpc_cidr = aws_vpc.lambda_vpc.cidr_block
    route_table_id = aws_route_table.private.id
    elastic_ip_id = aws_eip.nat_instance.id
  }))

  tag_specifications {
    resource_type = "instance"
    tags = {
      Name = "nat-instance-ha"
    }
  }
}

resource "aws_autoscaling_group" "nat" {
  name                = "nat-instance-asg"
  vpc_zone_identifier = [aws_subnet.public[0].id]
  target_group_arns   = []
  health_check_type   = "EC2"
  health_check_grace_period = 300

  min_size         = 1
  max_size         = 1
  desired_capacity = 1

  launch_template {
    id      = aws_launch_template.nat.id
    version = "$Latest"
  }

  tag {
    key                 = "Name"
    value               = "nat-instance-asg"
    propagate_at_launch = false
  }
}

Enhanced setup script with auto-recovery:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
#!/bin/bash
# nat-instance-ha-setup.sh

# Basic NAT setup (same as before)
yum update -y
echo 'net.ipv4.ip_forward = 1' >> /etc/sysctl.conf
sysctl -p

# Install AWS CLI
yum install -y awscli

# Get instance metadata
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/region)

# Auto-attach Elastic IP
aws ec2 associate-address \
  --instance-id $INSTANCE_ID \
  --allocation-id ${elastic_ip_id} \
  --region $REGION

# Update route table to point to this instance
aws ec2 replace-route \
  --route-table-id ${route_table_id} \
  --destination-cidr-block 0.0.0.0/0 \
  --instance-id $INSTANCE_ID \
  --region $REGION

# Configure iptables (same as before)
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
iptables -A FORWARD -i eth0 -o eth0 -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARD -i eth0 -o eth0 -j ACCEPT
iptables-save > /etc/sysconfig/iptables

yum install -y iptables-services
systemctl enable iptables
systemctl start iptables

# Health check with auto-recovery
cat << 'EOF' > /usr/local/bin/nat-health-monitor.sh
#!/bin/bash
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/region)

# Check internet connectivity
ping -c 3 8.8.8.8 > /dev/null 2>&1
if [ $? -ne 0 ]; then
    echo "Internet connectivity failed, terminating instance for ASG replacement"
    aws ec2 terminate-instances --instance-ids $INSTANCE_ID --region $REGION
fi
EOF

chmod +x /usr/local/bin/nat-health-monitor.sh
echo "*/2 * * * * /usr/local/bin/nat-health-monitor.sh" | crontab -

Performance Considerations

NAT instances can handle significant traffic. Here’s sizing guidance:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
t3.nano (1 vCPU, 0.5GB): 
- Good for: <1Gbps, dev/staging
- Lambda functions: <100 concurrent

t3.micro (1 vCPU, 1GB):
- Good for: 1-2Gbps, small production
- Lambda functions: 100-500 concurrent  

t3.small (2 vCPU, 2GB):
- Good for: 2-5Gbps, medium production
- Lambda functions: 500-1000 concurrent

t3.medium (2 vCPU, 4GB):
- Good for: 5-10Gbps, large production
- Lambda functions: >1000 concurrent

Monitor these CloudWatch metrics:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# monitoring.tf
resource "aws_cloudwatch_metric_alarm" "nat_cpu" {
  alarm_name          = "nat-instance-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = "300"
  statistic           = "Average"
  threshold           = "80"
  alarm_description   = "This metric monitors nat instance cpu utilization"

  dimensions = {
    InstanceId = aws_instance.nat.id
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

resource "aws_cloudwatch_metric_alarm" "nat_network" {
  alarm_name          = "nat-instance-high-network"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "NetworkPacketsOut"
  namespace           = "AWS/EC2"
  period              = "300"
  statistic           = "Average"
  threshold           = "100000"
  alarm_description   = "This metric monitors nat instance network utilization"

  dimensions = {
    InstanceId = aws_instance.nat.id
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

Lambda Cold Start Considerations

VPC Lambda functions have additional cold start overhead due to ENI (Elastic Network Interface) creation. Here’s how to minimize it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Provisioned concurrency for critical functions
resource "aws_lambda_provisioned_concurrency_config" "api" {
  function_name                     = aws_lambda_function.api.function_name
  provisioned_concurrent_executions = 5  // Keep 5 warm
  qualifier                        = aws_lambda_function.api.version
}

// Connection pooling for database connections
const { Pool } = require('pg');

// Create pool outside handler (reused across invocations)
const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 1, // Important: Lambda can't share connections
  idleTimeoutMillis: 30000,
});

exports.handler = async (event) => {
  // Reuse connection pool
  const client = await pool.connect();
  
  try {
    const result = await client.query('SELECT * FROM users WHERE id = $1', [event.userId]);
    return {
      statusCode: 200,
      body: JSON.stringify(result.rows[0])
    };
  } finally {
    client.release(); // Return to pool
  }
};

Security Best Practices

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# security.tf
# VPC Flow Logs for monitoring
resource "aws_flow_log" "vpc" {
  iam_role_arn    = aws_iam_role.flow_log.arn
  log_destination = aws_cloudwatch_log_group.vpc_flow_log.arn
  traffic_type    = "ALL"
  vpc_id          = aws_vpc.lambda_vpc.id
}

resource "aws_cloudwatch_log_group" "vpc_flow_log" {
  name              = "/aws/vpc/flowlogs"
  retention_in_days = 30
}

# Network ACLs for additional security
resource "aws_network_acl" "private" {
  vpc_id     = aws_vpc.lambda_vpc.id
  subnet_ids = aws_subnet.private[*].id

  # Allow inbound from NAT instance
  ingress {
    protocol   = "-1"
    rule_no    = 100
    action     = "allow"
    cidr_block = aws_subnet.public[0].cidr_block
  }

  # Allow outbound to anywhere
  egress {
    protocol   = "-1"
    rule_no    = 100
    action     = "allow"
    cidr_block = "0.0.0.0/0"
  }

  tags = {
    Name = "private-nacl"
  }
}

# WAF for API Gateway (if using)
resource "aws_wafv2_web_acl" "api" {
  name  = "api-waf"
  scope = "REGIONAL"

  default_action {
    allow {}
  }

  # Rate limiting
  rule {
    name     = "RateLimitRule"
    priority = 1

    override_action {
      none {}
    }

    statement {
      rate_based_statement {
        limit              = 2000
        aggregate_key_type = "IP"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = true
      metric_name                = "RateLimitRule"
      sampled_requests_enabled   = true
    }

    action {
      block {}
    }
  }

  visibility_config {
    cloudwatch_metrics_enabled = true
    metric_name                = "apiWAF"
    sampled_requests_enabled   = true
  }
}

When NOT to Use NAT Instances

NAT instances aren’t always the right choice:

Avoid NAT instances when:

You need guaranteed 10Gbps+ throughput
Your Lambda functions process >1TB/month data
You have strict compliance requiring managed services
Your team lacks AWS networking expertise
You need multi-region failover

Stick with NAT Gateways when:

Cost isn’t a primary concern
You want AWS-managed infrastructure
You need maximum reliability and performance
Your workload justifies the cost

Cost Optimization Strategies

Beyond NAT instances, here are additional cost optimizations:

1. VPC Endpoints for AWS Services

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Avoid NAT charges for AWS service calls
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.lambda_vpc.id
  service_name = "com.amazonaws.${data.aws_region.current.name}.s3"
  
  route_table_ids = [aws_route_table.private.id]
  
  tags = {
    Name = "s3-vpc-endpoint"
  }
}

resource "aws_vpc_endpoint" "dynamodb" {
  vpc_id       = aws_vpc.lambda_vpc.id
  service_name = "com.amazonaws.${data.aws_region.current.name}.dynamodb"
  
  route_table_ids = [aws_route_table.private.id]
  
  tags = {
    Name = "dynamodb-vpc-endpoint"
  }
}

2. Lambda Function Optimization

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
// Reduce network calls by batching
const batchSize = 25; // DynamoDB batch limit
const batches = [];

for (let i = 0; i < items.length; i += batchSize) {
  batches.push(items.slice(i, i + batchSize));
}

// Process batches in parallel
const results = await Promise.all(
  batches.map(batch => dynamodb.batchWrite({
    RequestItems: {
      [tableName]: batch.map(item => ({
        PutRequest: { Item: item }
      }))
    }
  }).promise())
);

3. Right-sizing Instances

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Monitor NAT instance usage
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time 2025-07-01T00:00:00Z \
  --end-time 2025-07-27T23:59:59Z \
  --period 3600 \
  --statistics Average,Maximum

# If avg CPU < 10% for a week, downsize to t3.nano
# If avg CPU > 80% for sustained periods, upsize

Deployment and Testing

Here’s a complete deployment script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
#!/bin/bash
# deploy-lambda-vpc.sh

set -e

echo "Deploying Lambda VPC infrastructure..."

# Validate Terraform
terraform validate

# Plan deployment
terraform plan -out=tfplan

# Apply with approval
terraform apply tfplan

# Test NAT instance
echo "Testing NAT instance connectivity..."
NAT_IP=$(terraform output -raw nat_instance_public_ip)

# SSH to NAT instance and test connectivity
ssh -i ~/.ssh/your-key.pem ec2-user@$NAT_IP << 'EOF'
  # Test internet connectivity
  ping -c 3 8.8.8.8
  
  # Test AWS service connectivity
  curl -s https://s3.amazonaws.com
  
  # Check iptables rules
  sudo iptables -t nat -L
EOF

# Deploy test Lambda function
echo "Deploying test Lambda function..."
zip lambda-test.zip test-function.js

aws lambda create-function \
  --function-name vpc-test \
  --runtime nodejs18.x \
  --role $(terraform output -raw lambda_role_arn) \
  --handler test-function.handler \
  --zip-file fileb://lambda-test.zip \
  --vpc-config SubnetIds=$(terraform output -raw private_subnet_ids),SecurityGroupIds=$(terraform output -raw lambda_security_group_id)

# Test Lambda function
echo "Testing Lambda function..."
aws lambda invoke \
  --function-name vpc-test \
  --payload '{"test": "data"}' \
  response.json

cat response.json
echo ""
echo "Deployment completed successfully!"

Test Lambda function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// test-function.js
const https = require('https');
const { Pool } = require('pg');

exports.handler = async (event) => {
  const results = {
    timestamp: new Date().toISOString(),
    tests: {}
  };
  
  // Test internet connectivity
  try {
    await new Promise((resolve, reject) => {
      https.get('https://httpbin.org/ip', (res) => {
        let data = '';
        res.on('data', chunk => data += chunk);
        res.on('end', () => {
          results.tests.internet = { success: true, ip: JSON.parse(data).origin };
          resolve();
        });
      }).on('error', reject);
    });
  } catch (error) {
    results.tests.internet = { success: false, error: error.message };
  }
  
  // Test database connectivity
  try {
    const pool = new Pool({
      connectionString: process.env.DATABASE_URL,
      max: 1,
    });
    
    const client = await pool.connect();
    const result = await client.query('SELECT version()');
    client.release();
    await pool.end();
    
    results.tests.database = { success: true, version: result.rows[0].version };
  } catch (error) {
    results.tests.database = { success: false, error: error.message };
  }
  
  return {
    statusCode: 200,
    body: JSON.stringify(results, null, 2)
  };
};

Final Thoughts

Moving from NAT Gateways to NAT instances saved us $1,400 annually while maintaining functionality. The key lessons:

Understand your traffic patterns - Most Lambda workloads don’t need NAT Gateway throughput
Monitor everything - Set up proper alerting for the NAT instance
Start small - Begin with t3.nano and scale up if needed
Use VPC endpoints - Eliminate NAT charges for AWS service calls
Test thoroughly - Validate connectivity and performance before production

The “serverless” promise of Lambda is powerful, but VPC networking costs can quickly spiral out of control. With proper architecture and cost-conscious choices, you can have secure, private Lambda functions without breaking the bank.

Is the complexity worth it? For most production applications requiring database access, absolutely. The security benefits of private subnets combined with 85%+ cost savings make this approach a no-brainer.

Running Lambda functions in VPCs? I’d love to hear about your networking setup and cost optimizations. Find me on Twitter @TheLogicalDev.

All infrastructure code tested with Terraform 1.5+ and AWS Provider 5.0+. Costs calculated using us-east-1 pricing as of July 2025.

AWS Lambda VPC Configuration: The Complete Guide to Private Networking and Cost Optimization with NAT Instances#

Why Lambda Functions Need VPCs (And Why It Gets Expensive)#

Understanding VPC Networking for Lambda#

The Complete VPC Setup (Infrastructure as Code)#

The Expensive Way: NAT Gateway#

The Cost-Effective Way: NAT Instance#

Lambda VPC Configuration#

RDS and ElastiCache in Private Subnets#

Cost Comparison: Real Numbers#

NAT Gateway Approach (2 AZs):#

NAT Instance Approach:#

NAT Instance High Availability#

Performance Considerations#

Lambda Cold Start Considerations#

Security Best Practices#

When NOT to Use NAT Instances#

Cost Optimization Strategies#

1. VPC Endpoints for AWS Services#

2. Lambda Function Optimization#

3. Right-sizing Instances#

Deployment and Testing#

Final Thoughts#