Goal
Verify that omni uses real client-side round-robin (not pick_first) after moving downstream services to headless Kubernetes services.
Setup
Five downstream services were switched to headless (clusterIP: None):
voltnet-iamvoltnet-charge-pointsvoltnet-charging-sessionsvoltnet-pricingvoltnet-broccoli
On omni, the pooled path in pkg/grpc/registry.go now uses:
DialGRPCWithOptions(ctx, addr, EnableClientSideLoadBalancing())
EnableClientSideLoadBalancing() enables:
dns:///resolver (not one-shot OS resolution)round_robinpolicy (sub-conn per resolved pod IP)- keepalive (
10sping,5stimeout)
The registry caches one ClientConn per downstream address, so these sub-connections are long-lived and reused.
Verification approach
Two checks were run during a sampling window (default 60s) while k6 generated concurrent traffic through omni.
flowchart LR A[k6 load to omni endpoint] --> B[verify-grpc-lb.sh] B --> C[Check 1: /proc/net/tcp coverage + local-port reuse] B --> D[Check 2: pod log trace_id distribution]
Check 1 — Connection reuse and coverage
Method:
- Read omni pod
/proc/net/tcp(container has noss)/proc/net/tcpis the kernel socket table snapshot for that network namespace- each row includes local address/port, remote address/port, and TCP state
- Filter
ESTABLISHED(state=01)01means TCP handshake is complete and traffic can flow- this excludes
SYN_SENT,TIME_WAIT, and other transient states that would add noise
- Convert hex little-endian remote IP to dotted decimal
- For each headless target service:
- list running pod IPs
- snapshot peer pod IPs + local ports before window
- snapshot again after window
- compare
Note
The addresses in
/proc/net/tcpare hex + little-endian (e.g.0100007F=127.0.0.1), so decoding is required before matching against pod IPs.
Success criteria:
- Coverage: all pod IPs appear as peers
- Reuse: same local ports before/after (same underlying TCP connections)
Tip
Why local port equality matters Same local port before/after is strong proof of connection reuse. A stable connection count alone can hide close+reopen churn.
Check 2 — Distribution across replicas
Method:
- Count unique
trace_idper downstream pod from logs - Use
kubectl logs --since=${SAMPLE_SECONDS}s(windowed read)
Why windowed logs:
- Before/after cumulative subtraction breaks under log rotation
--sinceavoids negative-count artifacts
Probe script learning (probe-rpc-connectivity.sh)
The probe validates TCP reachability from inside the source pod.
For each service pod:
- read
*_RPC/*_GRPCenv vars - parse
host:port - run
nc -zw3 host port
nc -z = zero-I/O connect test (open TCP then close), -w3 = 3s timeout.
- exit
0→OK(TCP handshake succeeded) - non-zero →
FAIL(refused/timeout)
Warning
This proves network-path reachability only. It does not prove gRPC app health, TLS correctness, or protocol-level compatibility.
Notable findings
- A wrong omni image can silently revert behavior to
pick_first - Correct rollout is required before judging connection coverage or request distribution
- With
ENABLE_RATE_LIMITING=trueunder high RPS, requests may be rejected before downstream gRPC calls; disable rate limiting for LB verification runs
Takeaways
- Headless +
dns:///+round_robin+ pooledClientConngives practical client-side LB in gRPC - Reuse should be proven with local-port stability, not count-only snapshots
- Distribution should be measured with time-windowed logs (
--since) to avoid rotation artifacts - TCP probes are useful preflight checks, but they are not gRPC health checks
See also
- grpc-load-balancing — architecture and options in Kubernetes
- pod-management —
kubectl logs/kubectl execbasics used by the verification script - parallel-polling-per-entity — concurrency and parallelism pattern in RabbitMQ workers