Pool creates many connections when under high contention #3132

ivankelly · 2024-03-18T16:50:32Z

Bug Description

The initial symptom of this bug was pools not being at full capacity yet many clients getting kicked back with PoolTimeout errors.
We also noticed total_auth_attempts was very high (20k/s).
Checking netstat, it appeared that a lot (5k+) of connections were being created per second from a single node, and destroyed in quick succession. Most of the connections were in FIN_WAIT (i.e. the application had already closed the socket and the operating system was in the process of cleaning it up).

Further investigation and eyeballing of the code led us to acquire() in inner.rs.

There's a single big timeout around acquiring the semaphore, taking idle connections and creating new connections.
What was happening in our scenario was that there were thousands of clients waiting on the semaphore. If the acquire timeout happened to coincide with the time waiting on the semaphore, they would get the permit, either take an idle connection or start a new one, and then timeout the acquire, cancelling the task and then drops the connection it had created or acquired from the idle pool.
Under heavy pressure, assuming there's some fairness on the semaphore, once this starts happening it will keep happening as the permit acquisitions closest to the acquire timeout will be at the front of the queue.

A quick fix we made was to separate the permit acquisition from the connection acquisition. This may mean waiting 3x the acquire timeout, but that's preferable to hammering our postgres instances.
https://github.com/pinecone-io/sqlx-fork/blob/repro/sqlx-core/src/pool/inner.rs#L247

Minimal Reproduction

There is code to repro the condition here: https://github.com/pinecone-io/sqlx-fork/blob/repro/tests/repro.rs

Run the test for a few minutes, you'll eventually start to see pool size flap around max. Checking the number of tcp connection initialization via tcp dump will show there are a lot. You won't see too many FIN_WAITs in netstat if the postgres is local, as the same kernel owns both ends of the connection.

tcpdump -i lo 'tcp[tcpflags] & (tcp-syn) != 0 and dst port 5432'

Info

SQLx version: 0.7.3
SQLx features enabled: uuid, chrono, postgres, sqlite, tokio-rustls
Database server and version: Postgres aurora engine version 14.8
Operating system: Linux 6.5.0-1014-aws
rustc --version: rustc 1.75.0-nightly

The text was updated successfully, but these errors were encountered:

abonander · 2024-03-19T05:39:57Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pool creates many connections when under high contention #3132

Pool creates many connections when under high contention #3132

ivankelly commented Mar 18, 2024 •

edited

Loading

abonander commented Mar 19, 2024

Pool creates many connections when under high contention #3132

Pool creates many connections when under high contention #3132

Comments

ivankelly commented Mar 18, 2024 • edited Loading

Bug Description

Minimal Reproduction

Info

abonander commented Mar 19, 2024

ivankelly commented Mar 18, 2024 •

edited

Loading