gRPC Keepalive Conflict — ENHANCE_YOUR_CALM

Observed on UAT when GRPC_CONNECTION_POOLING_ENABLED=true: a perpetual reconnect loop triggered by mismatched keepalive configuration between the gRPC client and downstream servers.

Symptom

ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings"
rpc error: code = Canceled desc = context canceled

~1 error per 100s on UAT (exponential backoff slowing the cycle after initial burst). This is why GRPC_CONNECTION_POOLING_ENABLED was reverted on UAT.

Root cause

Two settings interact:

Client PermitWithoutStream: true — sends keepalive pings even when there are no active RPCs on the connection
Server EnforcementPolicy.MinTime — rejects clients that ping more frequently than this threshold; gRPC default is 5 minutes

The connection registry uses a default keepalive of 10s. Downstream services have no explicit EnforcementPolicy, so they fall back to gRPC’s default MinTime: 5 minutes. Server sees a ping every 10s on an idle connection → GOAWAY ENHANCE_YOUR_CALM.

Reconnect loop

GOAWAY ENHANCE_YOUR_CALM closes the HTTP/2 connection but does not ban the client. The cycle repeats with exponential backoff (1s, 2s, 4s, 8s…):

Client connects → SubConn enters READY
10s passes → client sends keepalive ping on idle connection
Server: ping interval < MinTime → sends GOAWAY ENHANCE_YOUR_CALM
Client: SubConn → TRANSIENT_FAILURE → reconnects with backoff
Server accepts new connection → repeat from step 1

RPCs landing during a backoff window fail with UNAVAILABLE. The retry policy catches most, but adds latency. At 2–3 replicas, if multiple SubConns enter backoff simultaneously, the remaining healthy SubConn absorbs all traffic.

How to replicate

Enable GRPC_CONNECTION_POOLING_ENABLED=true on UAT, then run:

k6 run -e TEST_V03=true \
  -e GRAB_VOLTNET_API_KEY=<key> \
  -e START_RPS=10 -e MAX_RPS=10 -e STAGE_DURATION=60s \
  load-testing/ev/omni_nearby_rate_limit_test.js
 
kubectl logs -n ev deploy/voltnet-omni -f | grep "too_many_pings\|ENHANCE_YOUR_CALM"

Restore with GRPC_CONNECTION_POOLING_ENABLED=false.

Fix

Option A (quick mitigation): raise DEFAULT_KEEPALIVE_TIME in voltnet-common to 5 * time.Minute. Only voltnet-common and voltnet-omni need rebuilding. Slows the reconnect cycle to every ~10 minutes but does not stop it — PermitWithoutStream: true on the client still conflicts with the server’s default PermitWithoutStream: false.

Option B (fully resolves): add an explicit EnforcementPolicy to every downstream server with MinTime ≤ client keepaliveTime and matching PermitWithoutStream. Requires modifying and redeploying every downstream service. Coordinated values: client keepaliveTime: 30s, server MinTime: 20s, PermitWithoutStream: true on both.

Tip

Rule: server EnforcementPolicy.MinTime must be ≤ client keepaliveTime, and PermitWithoutStream must match on both sides.

Note

Side effect of raising keepalive interval: with keepaliveTime: 5min, a silently dead connection (network partition, OOM-kill without clean socket close) won’t be detected for ~5min10s instead of ~10s. In practice inside Kubernetes this barely matters — clean pod death sends TCP FIN/RST detected in milliseconds regardless of keepalive. The 30s DNS re-resolution timer is unaffected by keepalive interval.

Zhu Yuechen's Tech Notes

Explorer

gRPC Keepalive Conflict — ENHANCE_YOUR_CALM

Symptom

Root cause

Reconnect loop

How to replicate

Fix

See also

Graph View

Table of Contents

Backlinks