Observed on UAT when GRPC_CONNECTION_POOLING_ENABLED=true: a perpetual reconnect loop triggered by mismatched keepalive configuration between the gRPC client and downstream servers.
Symptom
ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings"
rpc error: code = Canceled desc = context canceled~1 error per 100s on UAT (exponential backoff slowing the cycle after initial burst). This is why GRPC_CONNECTION_POOLING_ENABLED was reverted on UAT.
Root cause
Two settings interact:
- Client
PermitWithoutStream: true— sends keepalive pings even when there are no active RPCs on the connection - Server
EnforcementPolicy.MinTime— rejects clients that ping more frequently than this threshold; gRPC default is 5 minutes
The connection registry uses a default keepalive of 10s. Downstream services have no explicit EnforcementPolicy, so they fall back to gRPC’s default MinTime: 5 minutes. Server sees a ping every 10s on an idle connection → GOAWAY ENHANCE_YOUR_CALM.
Reconnect loop
GOAWAY ENHANCE_YOUR_CALM closes the HTTP/2 connection but does not ban the client. The cycle repeats with exponential backoff (1s, 2s, 4s, 8s…):
- Client connects → SubConn enters
READY - 10s passes → client sends keepalive ping on idle connection
- Server: ping interval <
MinTime→ sendsGOAWAY ENHANCE_YOUR_CALM - Client: SubConn →
TRANSIENT_FAILURE→ reconnects with backoff - Server accepts new connection → repeat from step 1
RPCs landing during a backoff window fail with UNAVAILABLE. The retry policy catches most, but adds latency. At 2–3 replicas, if multiple SubConns enter backoff simultaneously, the remaining healthy SubConn absorbs all traffic.
How to replicate
Enable GRPC_CONNECTION_POOLING_ENABLED=true on UAT, then run:
k6 run -e TEST_V03=true \
-e GRAB_VOLTNET_API_KEY=<key> \
-e START_RPS=10 -e MAX_RPS=10 -e STAGE_DURATION=60s \
load-testing/ev/omni_nearby_rate_limit_test.js
kubectl logs -n ev deploy/voltnet-omni -f | grep "too_many_pings\|ENHANCE_YOUR_CALM"Restore with GRPC_CONNECTION_POOLING_ENABLED=false.
Fix
Option A (quick mitigation): raise DEFAULT_KEEPALIVE_TIME in voltnet-common to 5 * time.Minute. Only voltnet-common and voltnet-omni need rebuilding. Slows the reconnect cycle to every ~10 minutes but does not stop it — PermitWithoutStream: true on the client still conflicts with the server’s default PermitWithoutStream: false.
Option B (fully resolves): add an explicit EnforcementPolicy to every downstream server with MinTime ≤ client keepaliveTime and matching PermitWithoutStream. Requires modifying and redeploying every downstream service. Coordinated values: client keepaliveTime: 30s, server MinTime: 20s, PermitWithoutStream: true on both.
Tip
Rule: server
EnforcementPolicy.MinTimemust be ≤ clientkeepaliveTime, andPermitWithoutStreammust match on both sides.
Note
Side effect of raising keepalive interval: with
keepaliveTime: 5min, a silently dead connection (network partition, OOM-kill without clean socket close) won’t be detected for ~5min10s instead of ~10s. In practice inside Kubernetes this barely matters — clean pod death sends TCP FIN/RST detected in milliseconds regardless of keepalive. The 30s DNS re-resolution timer is unaffected by keepalive interval.
See also
- grpc-connection-registry — connection registry implementation and the
EnableClientSideLoadBalancing()option - grpc-load-balancing — the core gRPC load balancing problem