gRPC Connection Registry and Client-Side LB Setup

Client-side load balancing for gRPC in Kubernetes requires four changes working together: headless services, dns:/// resolver prefix, round_robin policy, and a persistent connection pool. Any one of these alone is insufficient.

Without pooling (current UAT/prod)

kube-proxy picks a pod on each TCP dial. Per-request dial gives load distribution at ~77ms per RPC (TCP + TLS handshake).

voltnet-omni                              downstream
  outbound RPC → dial TCP+TLS → kube-proxy → random pod
                → RPC
                → conn.Close()

With pooling (dev)

One persistent *grpc.ClientConn per pod IP. DNS resolves all pod IPs at startup; round-robin distributes RPCs across sub-connections.

flowchart LR
    registry["ConnectionRegistry<br/>addr → *grpc.ClientConn"]
    registry -->|conn-A| podA[pod-A]
    registry -->|conn-B| podB[pod-B]
    registry -->|conn-C| podC[pod-C]
    rpc["outbound RPC"] -->|"round_robin policy"| registry

Headless services

clusterIP: None causes DNS to return one A record per pod instead of a single ClusterIP.

spec:
  clusterIP: None  # DNS returns all pod IPs instead of a single VIP

Warning

Pod port, not service port — headless services bypass kube-proxy; the client dials the actual pod port. Standard ClusterIP services handle :80 → :8080 translation; headless services do not. See configmap for per-service port mappings.

Client-side round-robin (voltnet-common)

DialGRPCWithOptions in voltnet-common. DialGRPC is a zero-option wrapper kept for legacy callers.

Option	Default	Description
`EnableClientSideLoadBalancing()`	disabled	Adds `dns:///` prefix and `round_robin` policy
`WithKeepaliveTime(d)`	`10s`	Ping interval. Must be ≥ server `EnforcementPolicy.MinTime` — see grpc-keepalive-enhance-your-calm
`WithKeepaliveTimeout(d)`	`5s`	Time to wait for pong before declaring connection dead
`WithPermitWithoutStream(bool)`	`true`	Send keepalive pings when no active streams exist
`WithDialTimeout(d)`	`5s`	Hard timeout on initial connection establishment

The dns:/// prefix is required — without it, gRPC uses the passthrough resolver (single address, ignores LB policy). round_robin policy is required — gRPC defaults to pick_first without it.

Connection registry (voltnet-omni)

pkg/grpc/registry.go — singleton, thread-safe cache of address → *grpc.ClientConn. Connections established once and reused for the lifetime of the pod.

DialWithPooling handles GRPC_CONNECTION_POOLING_ENABLED transparently — when false it dials per-request; when true it returns a cached connection. All callers use defer cleanup(), so toggling the flag requires no code changes.

Cache key is the address, not the logical service name — prevents stale connections if an address changes across environments.

Graceful shutdown

cmd/main.go registers a shutdown handler that calls Close() on the registry before the process exits. Close() uses a 30-second timeout per connection — worst-case shutdown time with five downstream services: 2.5 minutes if all connections stall.

Note

Shutdown log lines show the address string (e.g. "service": "voltnet-pricing-svc:8080"), not a human-readable name — the loop variable is named service but holds the address.

Feature flag

Env var	Default	Effect
`GRPC_CONNECTION_POOLING_ENABLED`	`false`	`true` enables connection reuse; `false` preserves per-request dial

Rollback: set to false and redeploy — no code changes needed.

Zhu Yuechen's Tech Notes

Explorer

gRPC Connection Registry and Client-Side LB Setup

Without pooling (current UAT/prod)

With pooling (dev)

Headless services

Client-side round-robin (voltnet-common)

Connection registry (voltnet-omni)

Graceful shutdown

Feature flag

See also

Graph View

Table of Contents

Backlinks