grpc-go ignores the TTL returned by CoreDNS. Re-resolution is controlled by two signals in the DNS watcher (source):
- 30-second timer (
MinResolutionInterval) — fires unconditionally after 30s regardless of connection state ResolveNow()— called when a SubConn fails, but subject to the same 30s floor: if fewer than 30s have elapsed since the last resolution, it waits out the remaining time
For a rolling deploy where existing pods stay healthy, new pods are invisible for strictly 30 seconds. No SubConn fails, so ResolveNow() is never triggered; the 30s timer drives re-resolution alone.
This is a known open issue (grpc/grpc#12295, open since 2017):
“DNS is fundamentally unsuited to the kind of dynamic environment you’re describing, because DNS is a polling-based mechanism, whereas what you really want is a push-based mechanism.”
Impact at low replica counts
Start: Pod A, Pod B — traffic split 50/50
Rolling deploy:
Pod B terminated → Pod C starts and becomes Ready
Discovery window (up to 30s) where Pod C receives zero traffic:
100% of traffic → Pod A only
No SubConn failure → ResolveNow never triggered
Re-resolution waits on the 30s timer aloneAt UAT (2 replicas): one pod absorbs 100% of traffic for 30s on every deploy. At prod (4 replicas with maxUnavailable: 1): load shifts from 25% to 33% per pod — smaller but structurally the same.
Unplanned restarts (OOM, node eviction, spot interruption) also trigger this window and are outside deploy schedule control.
Mitigations
These are defence-in-depth; none eliminates the 30s window.
PreStop sleep — keeps the departing pod alive and serving during the DNS propagation window. Without it, SIGTERM fires immediately while clients still have the old pod IP cached, causing failures for up to 30s.
lifecycle:
preStop:
exec:
command: ["/bin/sleep", "15"]
terminationGracePeriodSeconds: 45maxSurge: 1 / maxUnavailable: 0 — brings the replacement pod up before terminating the old one, so capacity never drops during a deploy. Prod voltnet-omni currently uses the opposite (maxUnavailable: 1, maxSurge: 0), which removes a pod before its replacement is ready.
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0maxSurge: 1, maxUnavailable: 0 | maxUnavailable: 1, maxSurge: 0 (current prod) | |
|---|---|---|
| Capacity during rollout | Always at desired count | Drops to N-1 during transition |
| Peak pod count | N+1 temporarily | Never exceeds N |
| DNS discovery lag exposure | Shorter — new pod starts before old one dies | Longer — old pod removed first |
Requires cluster headroom for 1 extra pod. With connection pooling, each surge pod holds 5 persistent connections (one per downstream service).
Readiness probes — a pod only enters DNS once it passes its readiness probe. Without one, a pod enters DNS before the gRPC listener is up and immediately absorbs traffic it cannot serve.
Tip
The probe must confirm the gRPC port is accepting connections, not just an HTTP health endpoint.
Liveness probes — do not affect DNS directly. When a pod enters a broken state (deadlock, OOM without crash), the liveness probe triggers a restart: pod becomes NotReady, removed from EndpointSlice, stops receiving traffic.
MaxConnectionAge — forces periodic reconnects server-side, which call ResolveNow() on the client. However ResolveNow() is subject to the same 30s floor. Does not meaningfully reduce the discovery window.
See also
- grpc-load-balancing — the core gRPC load balancing problem, including Linkerd comparison
- grpc-connection-registry — connection registry implementation and headless service setup
- grpc-keepalive-enhance-your-calm — keepalive conflict that caused UAT instability