Your PostgreSQL queries were fast in development. Now in production on RDS, they crawl. Here’s how to find what’s really slowing them down and choose the right instance type.

I’ve debugged hundreds of slow PostgreSQL queries on RDS. The problem is rarely what you think it is, and throwing a bigger instance at it usually doesn’t help. Let me show you how to find the real bottlenecks and pick the right instance for your workload.

First Things First: Enable the Right Tools

Before you can fix slow queries, you need to see them. RDS gives you several tools, but they’re not all enabled by default.

1. Enable Performance Insights

Performance Insights is AWS’s best tool for database performance. It’s free for 7 days of data retention (enough for most debugging).

To enable it:

  1. Go to your RDS instance in AWS Console
  2. Click “Modify”
  3. Under “Performance Insights”, toggle it on
  4. Apply immediately

What you get:

  • Top SQL queries by total time
  • Wait events (what queries are waiting for)
  • Database load over time
  • Host metrics correlated with database metrics

2. Enable pg_stat_statements

This PostgreSQL extension tracks execution statistics of all SQL statements. It’s crucial for finding slow queries.

First, add it to your parameter group:

1
2
3
4
5
6
-- Check if it's already enabled
SELECT * FROM pg_available_extensions WHERE name = 'pg_stat_statements';

-- If not loaded, you need to modify your RDS parameter group
-- Add 'pg_stat_statements' to shared_preload_libraries parameter
-- Then reboot your instance (yes, this requires downtime)

After the reboot, create the extension:

1
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Now you can query it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
-- Top 10 slowest queries by total time
SELECT 
    round(total_exec_time::numeric, 2) AS total_time_ms,
    calls,
    round(mean_exec_time::numeric, 2) AS mean_time_ms,
    round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS percentage_cpu,
    regexp_replace(query, '\s+', ' ', 'g') AS query_text
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

3. Configure Slow Query Logging

For queries that are exceptionally slow, enable slow query logging:

1
2
3
4
5
6
7
-- Check current setting
SHOW log_min_duration_statement;

-- Set to log queries taking more than 1 second
-- (Do this in your parameter group for persistence)
ALTER SYSTEM SET log_min_duration_statement = 1000;
SELECT pg_reload_conf();

Access logs via CloudWatch Logs (make sure log exports are enabled for your RDS instance).

Finding the Real Culprits

Now that you have the tools, let’s find what’s actually slow.

Understanding Wait Events

Performance Insights shows wait events - what your queries are waiting for. Here are the common ones:

CPU - Query is actively using CPU

  • Good: Query is actually doing work
  • Bad: If consistently high, might need compute-optimized instance

IO:DataFileRead - Reading data from disk

  • Usually means data isn’t in memory
  • Solution: More memory or better indexes

Lock:transactionid - Waiting for another transaction

  • Check for long-running transactions
  • Look for lock conflicts

Client:ClientRead - Waiting for client to read data

  • Application is slow consuming results
  • Not a database problem

Query Analysis Workflow

Here’s my workflow for analyzing a slow query:

  1. Get the query execution plan:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
-- First, get the actual query from pg_stat_statements
SELECT query FROM pg_stat_statements 
WHERE query LIKE '%your_table%' 
ORDER BY mean_exec_time DESC LIMIT 5;

-- Then explain it
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM orders 
WHERE customer_id = 12345 
AND created_at > '2025-01-01';
  1. Look for these red flags in EXPLAIN output:
1
2
3
4
Seq Scan on large_table  (cost=0.00..195842.47 rows=3 width=64) (actual time=1543.201..1543.201 rows=0 loops=1)
   Filter: (status = 'active'::text)
   Rows Removed by Filter: 5942857
   Buffers: shared hit=29558 read=85373

Problems here:

  • Sequential scan on large table
  • Removed 5.9M rows to find 0 results
  • High buffer reads (data not in cache)
  1. Check table statistics:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
-- When were statistics last updated?
SELECT 
    schemaname,
    tablename,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze
FROM pg_stat_user_tables
WHERE tablename = 'your_table';

-- Check if statistics are way off
SELECT 
    tablename,
    n_live_tup,
    n_dead_tup,
    round(100.0 * n_dead_tup / NULLIF(n_live_tup, 0), 2) AS dead_percentage
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY dead_percentage DESC;

Common Performance Killers

1. Missing Indexes on Foreign Keys

PostgreSQL doesn’t automatically index foreign keys. This query finds them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
SELECT
    tc.table_schema, 
    tc.table_name, 
    kcu.column_name,
    'CREATE INDEX idx_' || tc.table_name || '_' || kcu.column_name || 
    ' ON ' || tc.table_schema || '.' || tc.table_name || ' (' || kcu.column_name || ');' AS index_ddl
FROM 
    information_schema.table_constraints AS tc 
    JOIN information_schema.key_column_usage AS kcu
      ON tc.constraint_name = kcu.constraint_name
      AND tc.table_schema = kcu.table_schema
    JOIN information_schema.constraint_column_usage AS ccu
      ON ccu.constraint_name = tc.constraint_name
      AND ccu.table_schema = tc.table_schema
WHERE tc.constraint_type = 'FOREIGN KEY'
AND NOT EXISTS (
    SELECT 1 
    FROM pg_indexes 
    WHERE schemaname = tc.table_schema 
    AND tablename = tc.table_name 
    AND indexdef LIKE '%' || kcu.column_name || '%'
);

2. Inefficient Pagination

This is slow with large offsets:

1
2
3
4
-- DON'T DO THIS
SELECT * FROM posts 
ORDER BY created_at DESC 
LIMIT 20 OFFSET 10000;  -- Scans 10,020 rows!

Do this instead:

1
2
3
4
5
-- DO THIS - Keyset pagination
SELECT * FROM posts 
WHERE created_at < '2025-01-15 10:00:00'  -- Last seen timestamp
ORDER BY created_at DESC 
LIMIT 20;

3. N+1 Queries in Disguise

Look for patterns like this in pg_stat_statements:

1
2
3
4
5
6
7
-- If you see this pattern repeated hundreds of times:
SELECT * FROM users WHERE id = $1;
SELECT * FROM users WHERE id = $2;
SELECT * FROM users WHERE id = $3;

-- Replace with:
SELECT * FROM users WHERE id = ANY(ARRAY[1, 2, 3, ...]);

4. JSON/JSONB Performance

JSONB operations can be slow without proper indexes:

1
2
3
4
5
6
7
8
9
-- Slow without index
SELECT * FROM events 
WHERE data->>'user_id' = '12345';

-- Create a functional index
CREATE INDEX idx_events_user_id ON events ((data->>'user_id'));

-- Or use a GIN index for multiple fields
CREATE INDEX idx_events_data ON events USING gin (data);

Choosing the Right RDS Instance Type

After optimizing queries, you might still need to pick a better instance. Here’s how to choose:

Memory-Optimized (r6g, r5) - Best For:

  • Working set fits in memory: If your shared_buffers + OS cache can hold most accessed data
  • High cache hit ratio needed: Check with this query:
1
2
3
4
5
SELECT 
    sum(heap_blks_read) as heap_read,
    sum(heap_blks_hit)  as heap_hit,
    round(sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read))::numeric, 4) as cache_hit_ratio
FROM pg_statio_user_tables;

If cache hit ratio < 0.99, you probably need more memory.

Real Example: E-commerce database with 50GB of hot data

  • r6g.xlarge (32GB RAM) - Good performance
  • m6g.2xlarge (32GB RAM) - Similar memory but more expensive
  • Winner: r6g.xlarge

Compute-Optimized (c6g, c5) - Best For:

  • Complex queries: Lots of JOINs, aggregations, calculations
  • CPU-bound workloads: Check with:
1
2
3
4
5
6
7
-- In Performance Insights, if CPU waits > 80% consistently
-- Or check top wait events
SELECT wait_event_type, wait_event, count(*) 
FROM pg_stat_activity 
WHERE state = 'active' 
GROUP BY wait_event_type, wait_event 
ORDER BY count(*) DESC;

Real Example: Analytics workload with complex aggregations

  • c6g.2xlarge - Handles complex queries well
  • r6g.2xlarge - More memory didn’t help (data was already cached)
  • Winner: c6g.2xlarge

General Purpose (m6g, m5) - Best For:

  • Balanced workloads: Mix of reads/writes
  • Unpredictable patterns: When you can’t optimize for one dimension
  • Starting point: Good default before you understand your workload

Storage Considerations

Instance type isn’t everything. Storage type matters too:

1
2
3
4
5
6
-- Check your IOPS usage
SELECT 
    sum(tup_inserted + tup_updated + tup_deleted) as write_ops,
    sum(tup_fetched) as read_ops
FROM pg_stat_database 
WHERE datname = current_database();
  • gp3: Best for most workloads (3000 IOPS baseline, burstable to 16,000)
  • io1/io2: When you need guaranteed IOPS (expensive)
  • gp2: Legacy, upgrade to gp3

Real-World Troubleshooting Example

Let me walk through a real case. Application was timing out, queries taking 30+ seconds.

Step 1: Check Performance Insights

  • Top wait event: IO:DataFileRead (60%)
  • Top SQL: Complex JOIN query

Step 2: Examine the query:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
EXPLAIN (ANALYZE, BUFFERS) 
SELECT 
    o.id, o.total, c.name, c.email,
    COUNT(oi.id) as item_count
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN order_items oi ON oi.order_id = o.id
WHERE o.created_at > CURRENT_DATE - INTERVAL '30 days'
AND o.status = 'completed'
GROUP BY o.id, o.total, c.name, c.email;

Output showed:

  • Nested Loop Join with 2M+ loops
  • Sequential scan on order_items
  • Buffer reads: 180,000+

Step 3: Found missing index:

1
2
CREATE INDEX idx_order_items_order_id ON order_items(order_id);
CREATE INDEX idx_orders_created_status ON orders(created_at, status);

Step 4: Results:

  • Query time: 30 seconds → 200ms
  • Wait events: IO:DataFileRead dropped to 5%
  • No instance upgrade needed

Monitoring and Alerts

Set up these CloudWatch alarms:

  1. Database connections:
1
2
3
4
5
6
7
-- Check current connections
SELECT count(*) FROM pg_stat_activity;

-- See connections by state
SELECT state, count(*) 
FROM pg_stat_activity 
GROUP BY state;

Alert when > 80% of max connections.

  1. Long-running queries:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
-- Queries running > 5 minutes
SELECT 
    pid,
    now() - query_start AS duration,
    state,
    query
FROM pg_stat_activity
WHERE state = 'active'
AND now() - query_start > interval '5 minutes'
ORDER BY duration DESC;
  1. Replication lag (if using read replicas):
1
2
3
-- On primary
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn
FROM pg_stat_replication;

Quick Reference: Instance Selection

Workload TypeFirst ChoiceAlternativeKey Metric
High concurrent readsr6g (memory)r5Cache hit ratio > 99%
Complex analyticsc6g (compute)c5nCPU utilization
Mixed OLTPm6g (general)m5Balanced CPU/memory
Write-heavym6g + gp3 storager6g + io2IOPS requirements
Dev/Testt4g (burstable)t3Cost optimization

Key Takeaways

  1. Enable monitoring first - You can’t fix what you can’t see
  2. Slow queries are rarely about instance size - Usually it’s missing indexes or bad queries
  3. Memory matters most - Data in RAM is 1000x faster than disk
  4. Test with production data - Dev performance means nothing
  5. Monitor after changes - Performance can degrade over time

The best instance type is the one that handles your specific workload efficiently. Start with general purpose, measure everything, then optimize based on data.


Found a slow query pattern I didn’t cover? Share it @TheLogicalDev - I’m always collecting new ways databases surprise us in production.