A soak test runs at sustained load for 2+ hours. Short tests miss failure modes that only appear over time — a 5-minute test will not reveal a memory leak that takes 90 minutes to exhaust available RAM.

Failure modes unique to soak tests

Connection and file handle exhaustion — temporary resources that reset quickly under short load accumulate under sustained load. Connection pools drain, socket file descriptors are not released fast enough, and new requests start being dropped or queued.

Cache saturation — caches fill completely and stop absorbing reads. All requests fall through to the database, causing a step-change increase in DB I/O that wasn’t visible at short durations.

Database degradation — sustained write load causes index fragmentation and transaction log bloat, degrading query performance progressively. Storage can fill up if log rotation isn’t configured.

Infrastructure throttling — burstable AWS instance types (T-class) accumulate CPU credits during idle periods and spend them under load. A short test may run entirely within the credit balance; a soak test exhausts credits and the instance is throttled to its baseline CPU, causing a sudden latency cliff.

Warning

CPU credit exhaustion looks like a sudden performance degradation — not a gradual one. If latency spikes sharply after 30–60 minutes on a T-class instance, check CPU credit balance in CloudWatch before investigating the application.

Third-party API quota exhaustion — rate limits reset per minute or per hour and are visible in short tests. Billing-period quotas (monthly call limits on payment gateways, SMS providers, external APIs) only surface under sustained load that accumulates against the quota.

What to monitor during a soak test

  • DB connection pool — watch for pool exhaustion; connections should not trend upward over time
  • Heap / GC metrics — memory should stabilise, not grow continuously
  • Cache hit rate — a declining hit rate over time indicates cache saturation
  • CPU credit balance (T-class instances only) — dropping to zero triggers throttling
  • Error rate trend — a flat error rate is healthy; a slowly rising one indicates progressive resource exhaustion

See also