gRPC DNS Discovery Lag in Kubernetes

grpc-go ignores the TTL returned by CoreDNS. Re-resolution is controlled by two signals in the DNS watcher (source):

30-second timer (MinResolutionInterval) — fires unconditionally after 30s regardless of connection state
ResolveNow() — called when a SubConn fails, but subject to the same 30s floor: if fewer than 30s have elapsed since the last resolution, it waits out the remaining time

For a rolling deploy where existing pods stay healthy, new pods are invisible for strictly 30 seconds. No SubConn fails, so ResolveNow() is never triggered; the 30s timer drives re-resolution alone.

This is a known open issue (grpc/grpc#12295, open since 2017):

“DNS is fundamentally unsuited to the kind of dynamic environment you’re describing, because DNS is a polling-based mechanism, whereas what you really want is a push-based mechanism.”

Impact at low replica counts

Start: Pod A, Pod B — traffic split 50/50
 
Rolling deploy:
  Pod B terminated → Pod C starts and becomes Ready
 
  Discovery window (up to 30s) where Pod C receives zero traffic:
    100% of traffic → Pod A only
    No SubConn failure → ResolveNow never triggered
    Re-resolution waits on the 30s timer alone

At UAT (2 replicas): one pod absorbs 100% of traffic for 30s on every deploy. At prod (4 replicas with maxUnavailable: 1): load shifts from 25% to 33% per pod — smaller but structurally the same.

Unplanned restarts (OOM, node eviction, spot interruption) also trigger this window and are outside deploy schedule control.

Mitigations

These are defence-in-depth; none eliminates the 30s window.

PreStop sleep — keeps the departing pod alive and serving during the DNS propagation window. Without it, SIGTERM fires immediately while clients still have the old pod IP cached, causing failures for up to 30s.

lifecycle:
  preStop:
    exec:
      command: ["/bin/sleep", "15"]
terminationGracePeriodSeconds: 45

maxSurge: 1 / maxUnavailable: 0 — brings the replacement pod up before terminating the old one, so capacity never drops during a deploy. Prod voltnet-omni currently uses the opposite (maxUnavailable: 1, maxSurge: 0), which removes a pod before its replacement is ready.

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0

	`maxSurge: 1, maxUnavailable: 0`	`maxUnavailable: 1, maxSurge: 0` (current prod)
Capacity during rollout	Always at desired count	Drops to N-1 during transition
Peak pod count	N+1 temporarily	Never exceeds N
DNS discovery lag exposure	Shorter — new pod starts before old one dies	Longer — old pod removed first

Requires cluster headroom for 1 extra pod. With connection pooling, each surge pod holds 5 persistent connections (one per downstream service).

Readiness probes — a pod only enters DNS once it passes its readiness probe. Without one, a pod enters DNS before the gRPC listener is up and immediately absorbs traffic it cannot serve.

Tip

The probe must confirm the gRPC port is accepting connections, not just an HTTP health endpoint.

Liveness probes — do not affect DNS directly. When a pod enters a broken state (deadlock, OOM without crash), the liveness probe triggers a restart: pod becomes NotReady, removed from EndpointSlice, stops receiving traffic.

MaxConnectionAge — forces periodic reconnects server-side, which call ResolveNow() on the client. However ResolveNow() is subject to the same 30s floor. Does not meaningfully reduce the discovery window.

Zhu Yuechen's Tech Notes

Explorer

gRPC DNS Discovery Lag in Kubernetes

Impact at low replica counts

Mitigations

See also

Graph View

Table of Contents

Backlinks