Goal

Verify that omni uses real client-side round-robin (not pick_first) after moving downstream services to headless Kubernetes services.

Setup

Five downstream services were switched to headless (clusterIP: None):

  • voltnet-iam
  • voltnet-charge-points
  • voltnet-charging-sessions
  • voltnet-pricing
  • voltnet-broccoli

On omni, the pooled path in pkg/grpc/registry.go now uses:

  • DialGRPCWithOptions(ctx, addr, EnableClientSideLoadBalancing())

EnableClientSideLoadBalancing() enables:

  1. dns:/// resolver (not one-shot OS resolution)
  2. round_robin policy (sub-conn per resolved pod IP)
  3. keepalive (10s ping, 5s timeout)

The registry caches one ClientConn per downstream address, so these sub-connections are long-lived and reused.

Verification approach

Two checks were run during a sampling window (default 60s) while k6 generated concurrent traffic through omni.

flowchart LR
    A[k6 load to omni endpoint] --> B[verify-grpc-lb.sh]
    B --> C[Check 1: /proc/net/tcp coverage + local-port reuse]
    B --> D[Check 2: pod log trace_id distribution]

Check 1 — Connection reuse and coverage

Method:

  • Read omni pod /proc/net/tcp (container has no ss)
    • /proc/net/tcp is the kernel socket table snapshot for that network namespace
    • each row includes local address/port, remote address/port, and TCP state
  • Filter ESTABLISHED (state=01)
    • 01 means TCP handshake is complete and traffic can flow
    • this excludes SYN_SENT, TIME_WAIT, and other transient states that would add noise
  • Convert hex little-endian remote IP to dotted decimal
  • For each headless target service:
    • list running pod IPs
    • snapshot peer pod IPs + local ports before window
    • snapshot again after window
    • compare

Note

The addresses in /proc/net/tcp are hex + little-endian (e.g. 0100007F = 127.0.0.1), so decoding is required before matching against pod IPs.

Success criteria:

  • Coverage: all pod IPs appear as peers
  • Reuse: same local ports before/after (same underlying TCP connections)

Tip

Why local port equality matters Same local port before/after is strong proof of connection reuse. A stable connection count alone can hide close+reopen churn.

Check 2 — Distribution across replicas

Method:

  • Count unique trace_id per downstream pod from logs
  • Use kubectl logs --since=${SAMPLE_SECONDS}s (windowed read)

Why windowed logs:

  • Before/after cumulative subtraction breaks under log rotation
  • --since avoids negative-count artifacts

Probe script learning (probe-rpc-connectivity.sh)

The probe validates TCP reachability from inside the source pod.

For each service pod:

  1. read *_RPC / *_GRPC env vars
  2. parse host:port
  3. run nc -zw3 host port

nc -z = zero-I/O connect test (open TCP then close), -w3 = 3s timeout.

  • exit 0OK (TCP handshake succeeded)
  • non-zero → FAIL (refused/timeout)

Warning

This proves network-path reachability only. It does not prove gRPC app health, TLS correctness, or protocol-level compatibility.

Notable findings

  • A wrong omni image can silently revert behavior to pick_first
  • Correct rollout is required before judging connection coverage or request distribution
  • With ENABLE_RATE_LIMITING=true under high RPS, requests may be rejected before downstream gRPC calls; disable rate limiting for LB verification runs

Takeaways

  1. Headless + dns:/// + round_robin + pooled ClientConn gives practical client-side LB in gRPC
  2. Reuse should be proven with local-port stability, not count-only snapshots
  3. Distribution should be measured with time-windowed logs (--since) to avoid rotation artifacts
  4. TCP probes are useful preflight checks, but they are not gRPC health checks

See also