UPDATE ... RETURNING * changes the transaction round-trip in a way that interacts badly with TCP stalls. A plain UPDATE commits and closes in one round-trip with no result to send back. With RETURNING *, PostgreSQL must assemble the result row and stream it to the client before the connection resolves — creating a window where a silently dead TCP connection can hold the session open for tens of minutes.
How the stall propagates
- PostgreSQL executes the
UPDATE, writes theRETURNINGresult to its kernel TCP send buffer, and commits — releasing all row locks at this point. - If the TCP connection is dead, the data never reaches the client. PostgreSQL has no way to detect this.
- Because there is unacknowledged data in the send buffer, the kernel enters TCP retransmission mode (not keepalive — keepalive only fires on idle connections with no data in flight). The kernel retransmits with exponential backoff, governed by
tcp_retries2(default: 15). With AWS VPC RTTs this sums to tens of minutes before the kernel gives up. - During that window the session stays visible in Performance Insights even though it holds no row locks and is doing no work.
sequenceDiagram participant G as gocpi participant PG as PostgreSQL G->>PG: UPDATE entity_party SET ... RETURNING * PG->>PG: Execute UPDATE, commit, release row locks PG-->>G: Write RETURNING result to TCP send buffer Note over G,PG: TCP connection silently dies Note over PG: Kernel retransmits with exponential backoff<br/>(tcp_retries2=15) — tens of minutes before giving up Note over G: Goroutine hangs waiting<br/>for response that never arrives
Lock:transactionid in Performance Insights
Every open PostgreSQL session holds an internal lock on its own transaction ID (Lock:transactionid) for bookkeeping. A session stuck waiting for TCP resolution keeps this lock alive — it surfaces in Performance Insights as Lock:transactionid with 0 rows/sec and near-zero calls/sec.
Lock:transactionid means different things depending on granted in pg_locks:
granted | Meaning |
|---|---|
true | The transaction’s normal self-lock — every transaction has this. Near-zero activity alongside it → zombie session. |
false | Waiting for a row-level lock held by another transaction. PostgreSQL’s mechanism: to wait for Txn A to finish, Txn B attempts a ShareLock on Txn A’s transactionid, which blocks until Txn A commits or rolls back. |
The TCP stall incident was granted = true — a zombie session holding its own self-lock, not blocking any other writer.
Warning
This is not a row lock.
Lock:transactionidin Performance Insights does not mean another writer is blocked. The incident was confirmed not a row lock: a second writer updated the same row 11 minutes into the stall and completed in 100ms.
Tip
How to distinguish TCP stall from real lock contention — check whether other writers on the same row are also blocked. If they complete normally, the session is a zombie, not a blocker.
Why removing RETURNING fixes it
Without RETURNING, PostgreSQL has nothing to send back. The transaction commits and fully resolves in a single round-trip. A dead TCP connection at that point leaves nothing open on the PostgreSQL side.
Context timeout fixes the goroutine, not the session
Adding a Go context timeout makes the goroutine give up after N seconds, but PostgreSQL still holds the session open until TCP keepalive fires. The zombie session in Performance Insights persists either way. The primary fix must be on the DB side — removing RETURNING eliminates the result payload that creates the stall window, making the context timeout largely moot from PostgreSQL’s perspective.
Note
Two independent problems — goroutine blocking (Go side) and zombie session (PostgreSQL side). A context timeout fixes only the former; removing
RETURNINGfixes both.
Why not idle_in_transaction_session_timeout on RDS
This is a global parameter — it would terminate any long-running transaction across all services sharing the instance, including legitimate slow queries. Prefer scoped fixes (remove RETURNING, add a per-query context timeout) over a global setting on a shared RDS instance.
See also
- gorm-returning-clause — the GORM-level fix: removing
clause.Returning{}and adding a context timeout - PostgreSQL docs: pg_locks
- PostgreSQL mailing list: What is locktype=transactionid?
- tcp(7) - Linux manual page —
tcp_retries2and TCP retransmission behaviour