You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The initial symptom of this bug was pools not being at full capacity yet many clients getting kicked back with PoolTimeout errors.
We also noticed total_auth_attempts was very high (20k/s).
Checking netstat, it appeared that a lot (5k+) of connections were being created per second from a single node, and destroyed in quick succession. Most of the connections were in FIN_WAIT (i.e. the application had already closed the socket and the operating system was in the process of cleaning it up).
Further investigation and eyeballing of the code led us to acquire() in inner.rs.
There's a single big timeout around acquiring the semaphore, taking idle connections and creating new connections.
What was happening in our scenario was that there were thousands of clients waiting on the semaphore. If the acquire timeout happened to coincide with the time waiting on the semaphore, they would get the permit, either take an idle connection or start a new one, and then timeout the acquire, cancelling the task and then drops the connection it had created or acquired from the idle pool.
Under heavy pressure, assuming there's some fairness on the semaphore, once this starts happening it will keep happening as the permit acquisitions closest to the acquire timeout will be at the front of the queue.
Run the test for a few minutes, you'll eventually start to see pool size flap around max. Checking the number of tcp connection initialization via tcp dump will show there are a lot. You won't see too many FIN_WAITs in netstat if the postgres is local, as the same kernel owns both ends of the connection.
tcpdump -i lo 'tcp[tcpflags] & (tcp-syn) != 0 and dst port 5432'
Info
SQLx version: 0.7.3
SQLx features enabled: uuid, chrono, postgres, sqlite, tokio-rustls
Database server and version: Postgres aurora engine version 14.8
Operating system: Linux 6.5.0-1014-aws
rustc --version: rustc 1.75.0-nightly
The text was updated successfully, but these errors were encountered:
Bug Description
The initial symptom of this bug was pools not being at full capacity yet many clients getting kicked back with PoolTimeout errors.
We also noticed total_auth_attempts was very high (20k/s).
Checking netstat, it appeared that a lot (5k+) of connections were being created per second from a single node, and destroyed in quick succession. Most of the connections were in FIN_WAIT (i.e. the application had already closed the socket and the operating system was in the process of cleaning it up).
Further investigation and eyeballing of the code led us to acquire() in inner.rs.
There's a single big timeout around acquiring the semaphore, taking idle connections and creating new connections.
What was happening in our scenario was that there were thousands of clients waiting on the semaphore. If the acquire timeout happened to coincide with the time waiting on the semaphore, they would get the permit, either take an idle connection or start a new one, and then timeout the acquire, cancelling the task and then drops the connection it had created or acquired from the idle pool.
Under heavy pressure, assuming there's some fairness on the semaphore, once this starts happening it will keep happening as the permit acquisitions closest to the acquire timeout will be at the front of the queue.
A quick fix we made was to separate the permit acquisition from the connection acquisition. This may mean waiting 3x the acquire timeout, but that's preferable to hammering our postgres instances.
https://github.com/pinecone-io/sqlx-fork/blob/repro/sqlx-core/src/pool/inner.rs#L247
Minimal Reproduction
There is code to repro the condition here: https://github.com/pinecone-io/sqlx-fork/blob/repro/tests/repro.rs
Run the test for a few minutes, you'll eventually start to see pool size flap around max. Checking the number of tcp connection initialization via tcp dump will show there are a lot. You won't see too many FIN_WAITs in netstat if the postgres is local, as the same kernel owns both ends of the connection.
Info
rustc --version
: rustc 1.75.0-nightlyThe text was updated successfully, but these errors were encountered: