Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pool creates many connections when under high contention #3132

Open
ivankelly opened this issue Mar 18, 2024 · 1 comment · May be fixed by #3582
Open

Pool creates many connections when under high contention #3132

ivankelly opened this issue Mar 18, 2024 · 1 comment · May be fixed by #3582
Labels

Comments

@ivankelly
Copy link

ivankelly commented Mar 18, 2024

Bug Description

The initial symptom of this bug was pools not being at full capacity yet many clients getting kicked back with PoolTimeout errors.
We also noticed total_auth_attempts was very high (20k/s).
Checking netstat, it appeared that a lot (5k+) of connections were being created per second from a single node, and destroyed in quick succession. Most of the connections were in FIN_WAIT (i.e. the application had already closed the socket and the operating system was in the process of cleaning it up).

Further investigation and eyeballing of the code led us to acquire() in inner.rs.

There's a single big timeout around acquiring the semaphore, taking idle connections and creating new connections.
What was happening in our scenario was that there were thousands of clients waiting on the semaphore. If the acquire timeout happened to coincide with the time waiting on the semaphore, they would get the permit, either take an idle connection or start a new one, and then timeout the acquire, cancelling the task and then drops the connection it had created or acquired from the idle pool.
Under heavy pressure, assuming there's some fairness on the semaphore, once this starts happening it will keep happening as the permit acquisitions closest to the acquire timeout will be at the front of the queue.

A quick fix we made was to separate the permit acquisition from the connection acquisition. This may mean waiting 3x the acquire timeout, but that's preferable to hammering our postgres instances.
https://github.com/pinecone-io/sqlx-fork/blob/repro/sqlx-core/src/pool/inner.rs#L247

Minimal Reproduction

There is code to repro the condition here: https://github.com/pinecone-io/sqlx-fork/blob/repro/tests/repro.rs

Run the test for a few minutes, you'll eventually start to see pool size flap around max. Checking the number of tcp connection initialization via tcp dump will show there are a lot. You won't see too many FIN_WAITs in netstat if the postgres is local, as the same kernel owns both ends of the connection.

tcpdump -i lo 'tcp[tcpflags] & (tcp-syn) != 0 and dst port 5432'

Info

  • SQLx version: 0.7.3
  • SQLx features enabled: uuid, chrono, postgres, sqlite, tokio-rustls
  • Database server and version: Postgres aurora engine version 14.8
  • Operating system: Linux 6.5.0-1014-aws
  • rustc --version: rustc 1.75.0-nightly
@ivankelly ivankelly added the bug label Mar 18, 2024
@abonander
Copy link
Collaborator

See also: #2848 #2854

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants