Your PostgreSQL queries were fast in development. Now in production on RDS, they crawl. Here’s how to find what’s really slowing them down and choose the right instance type.
I’ve debugged hundreds of slow PostgreSQL queries on RDS. The problem is rarely what you think it is, and throwing a bigger instance at it usually doesn’t help. Let me show you how to find the real bottlenecks and pick the right instance for your workload.
First Things First: Enable the Right Tools
Before you can fix slow queries, you need to see them. RDS gives you several tools, but they’re not all enabled by default.
1. Enable Performance Insights
Performance Insights is AWS’s best tool for database performance. It’s free for 7 days of data retention (enough for most debugging).
To enable it:
- Go to your RDS instance in AWS Console
- Click “Modify”
- Under “Performance Insights”, toggle it on
- Apply immediately
What you get:
- Top SQL queries by total time
- Wait events (what queries are waiting for)
- Database load over time
- Host metrics correlated with database metrics
2. Enable pg_stat_statements
This PostgreSQL extension tracks execution statistics of all SQL statements. It’s crucial for finding slow queries.
First, add it to your parameter group:
|
|
After the reboot, create the extension:
|
|
Now you can query it:
|
|
3. Configure Slow Query Logging
For queries that are exceptionally slow, enable slow query logging:
|
|
Access logs via CloudWatch Logs (make sure log exports are enabled for your RDS instance).
Finding the Real Culprits
Now that you have the tools, let’s find what’s actually slow.
Understanding Wait Events
Performance Insights shows wait events - what your queries are waiting for. Here are the common ones:
CPU - Query is actively using CPU
- Good: Query is actually doing work
- Bad: If consistently high, might need compute-optimized instance
IO:DataFileRead - Reading data from disk
- Usually means data isn’t in memory
- Solution: More memory or better indexes
Lock:transactionid - Waiting for another transaction
- Check for long-running transactions
- Look for lock conflicts
Client:ClientRead - Waiting for client to read data
- Application is slow consuming results
- Not a database problem
Query Analysis Workflow
Here’s my workflow for analyzing a slow query:
- Get the query execution plan:
|
|
- Look for these red flags in EXPLAIN output:
|
|
Problems here:
- Sequential scan on large table
- Removed 5.9M rows to find 0 results
- High buffer reads (data not in cache)
- Check table statistics:
|
|
Common Performance Killers
1. Missing Indexes on Foreign Keys
PostgreSQL doesn’t automatically index foreign keys. This query finds them:
|
|
2. Inefficient Pagination
This is slow with large offsets:
|
|
Do this instead:
|
|
3. N+1 Queries in Disguise
Look for patterns like this in pg_stat_statements:
|
|
4. JSON/JSONB Performance
JSONB operations can be slow without proper indexes:
|
|
Choosing the Right RDS Instance Type
After optimizing queries, you might still need to pick a better instance. Here’s how to choose:
Memory-Optimized (r6g, r5) - Best For:
- Working set fits in memory: If your
shared_buffers
+ OS cache can hold most accessed data - High cache hit ratio needed: Check with this query:
|
|
If cache hit ratio < 0.99, you probably need more memory.
Real Example: E-commerce database with 50GB of hot data
- r6g.xlarge (32GB RAM) - Good performance
- m6g.2xlarge (32GB RAM) - Similar memory but more expensive
- Winner: r6g.xlarge
Compute-Optimized (c6g, c5) - Best For:
- Complex queries: Lots of JOINs, aggregations, calculations
- CPU-bound workloads: Check with:
|
|
Real Example: Analytics workload with complex aggregations
- c6g.2xlarge - Handles complex queries well
- r6g.2xlarge - More memory didn’t help (data was already cached)
- Winner: c6g.2xlarge
General Purpose (m6g, m5) - Best For:
- Balanced workloads: Mix of reads/writes
- Unpredictable patterns: When you can’t optimize for one dimension
- Starting point: Good default before you understand your workload
Storage Considerations
Instance type isn’t everything. Storage type matters too:
|
|
- gp3: Best for most workloads (3000 IOPS baseline, burstable to 16,000)
- io1/io2: When you need guaranteed IOPS (expensive)
- gp2: Legacy, upgrade to gp3
Real-World Troubleshooting Example
Let me walk through a real case. Application was timing out, queries taking 30+ seconds.
Step 1: Check Performance Insights
- Top wait event: IO:DataFileRead (60%)
- Top SQL: Complex JOIN query
Step 2: Examine the query:
|
|
Output showed:
- Nested Loop Join with 2M+ loops
- Sequential scan on order_items
- Buffer reads: 180,000+
Step 3: Found missing index:
|
|
Step 4: Results:
- Query time: 30 seconds → 200ms
- Wait events: IO:DataFileRead dropped to 5%
- No instance upgrade needed
Monitoring and Alerts
Set up these CloudWatch alarms:
- Database connections:
|
|
Alert when > 80% of max connections.
- Long-running queries:
|
|
- Replication lag (if using read replicas):
|
|
Quick Reference: Instance Selection
Workload Type | First Choice | Alternative | Key Metric |
---|---|---|---|
High concurrent reads | r6g (memory) | r5 | Cache hit ratio > 99% |
Complex analytics | c6g (compute) | c5n | CPU utilization |
Mixed OLTP | m6g (general) | m5 | Balanced CPU/memory |
Write-heavy | m6g + gp3 storage | r6g + io2 | IOPS requirements |
Dev/Test | t4g (burstable) | t3 | Cost optimization |
Key Takeaways
- Enable monitoring first - You can’t fix what you can’t see
- Slow queries are rarely about instance size - Usually it’s missing indexes or bad queries
- Memory matters most - Data in RAM is 1000x faster than disk
- Test with production data - Dev performance means nothing
- Monitor after changes - Performance can degrade over time
The best instance type is the one that handles your specific workload efficiently. Start with general purpose, measure everything, then optimize based on data.
Found a slow query pattern I didn’t cover? Share it @TheLogicalDev - I’m always collecting new ways databases surprise us in production.