gRPC Client-Side Load Balancing — Verification Report

Goal

Verify that omni uses real client-side round-robin (not pick_first) after moving downstream services to headless Kubernetes services.

Setup

Five downstream services were switched to headless (clusterIP: None):

voltnet-iam
voltnet-charge-points
voltnet-charging-sessions
voltnet-pricing
voltnet-broccoli

On omni, the pooled path in pkg/grpc/registry.go now uses:

DialGRPCWithOptions(ctx, addr, EnableClientSideLoadBalancing())

EnableClientSideLoadBalancing() enables:

dns:/// resolver (not one-shot OS resolution)
round_robin policy (sub-conn per resolved pod IP)
keepalive (10s ping, 5s timeout)

The registry caches one ClientConn per downstream address, so these sub-connections are long-lived and reused.

Verification approach

Two checks were run during a sampling window (default 60s) while k6 generated concurrent traffic through omni.

flowchart LR
    A[k6 load to omni endpoint] --> B[verify-grpc-lb.sh]
    B --> C[Check 1: /proc/net/tcp coverage + local-port reuse]
    B --> D[Check 2: pod log trace_id distribution]

Check 1 — Connection reuse and coverage

Method:

Read omni pod /proc/net/tcp (container has no ss)
- /proc/net/tcp is the kernel socket table snapshot for that network namespace
- each row includes local address/port, remote address/port, and TCP state
Filter ESTABLISHED (state=01)
- 01 means TCP handshake is complete and traffic can flow
- this excludes SYN_SENT, TIME_WAIT, and other transient states that would add noise
Convert hex little-endian remote IP to dotted decimal
For each headless target service:
- list running pod IPs
- snapshot peer pod IPs + local ports before window
- snapshot again after window
- compare

Note

The addresses in /proc/net/tcp are hex + little-endian (e.g. 0100007F = 127.0.0.1), so decoding is required before matching against pod IPs.

Success criteria:

Coverage: all pod IPs appear as peers
Reuse: same local ports before/after (same underlying TCP connections)

Tip

Why local port equality matters Same local port before/after is strong proof of connection reuse. A stable connection count alone can hide close+reopen churn.

Check 2 — Distribution across replicas

Method:

Count unique trace_id per downstream pod from logs
Use kubectl logs --since=${SAMPLE_SECONDS}s (windowed read)

Why windowed logs:

Before/after cumulative subtraction breaks under log rotation
--since avoids negative-count artifacts

Probe script learning (`probe-rpc-connectivity.sh`)

The probe validates TCP reachability from inside the source pod.

For each service pod:

read *_RPC / *_GRPC env vars
parse host:port
run nc -zw3 host port

nc -z = zero-I/O connect test (open TCP then close), -w3 = 3s timeout.

exit 0 → OK (TCP handshake succeeded)
non-zero → FAIL (refused/timeout)

Warning

This proves network-path reachability only. It does not prove gRPC app health, TLS correctness, or protocol-level compatibility.

Notable findings

A wrong omni image can silently revert behavior to pick_first
Correct rollout is required before judging connection coverage or request distribution
With ENABLE_RATE_LIMITING=true under high RPS, requests may be rejected before downstream gRPC calls; disable rate limiting for LB verification runs

Takeaways

Headless + dns:/// + round_robin + pooled ClientConn gives practical client-side LB in gRPC
Reuse should be proven with local-port stability, not count-only snapshots
Distribution should be measured with time-windowed logs (--since) to avoid rotation artifacts
TCP probes are useful preflight checks, but they are not gRPC health checks

Zhu Yuechen's Tech Notes

Explorer

gRPC Client-Side Load Balancing — Verification Report

Goal

Setup

Verification approach

Check 1 — Connection reuse and coverage

Check 2 — Distribution across replicas

Probe script learning (`probe-rpc-connectivity.sh`)

Notable findings

Takeaways

See also

Graph View

Table of Contents

Backlinks

Zhu Yuechen's Tech Notes

Explorer

gRPC Client-Side Load Balancing — Verification Report

Goal

Setup

Verification approach

Check 1 — Connection reuse and coverage

Check 2 — Distribution across replicas

Probe script learning (probe-rpc-connectivity.sh)

Notable findings

Takeaways

See also

Graph View

Table of Contents

Backlinks

Probe script learning (`probe-rpc-connectivity.sh`)