-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unreferenced state group cleanup job in v1.126.0rc2 caused explosion in number of state group state rows #18217
Comments
I guess not obvious from your graph: how did you get your disk space back? Did you roll back to an earlier database backup or did it actually self-recover afterwards? |
ooi, how are you measuring this? As far as I understand it, this mechanism should not insert any rows into that table; sheerly the opposite (it deletes them). Hearing that the table grows is, therefore, perplexing. |
It didn't recover, I deleted other stuff (and afterwards deleted HQ to try to free up more space). The part where it kept going down after recovering was synapse continuing to add rows to state_groups_state. It flattened out when I cleared state_groups_pending_deletion and upgraded back to 1.126
Very manually by running |
I have to contradict myself: synapse/synapse/storage/databases/state/store.py Lines 816 to 824 in 350e84a
This isn't meant to get triggered because the state groups should be unreferenced. But do you see any |
Yes, loads of those. I thought that log just means it's cleaning up state groups
In case it's relevant, the |
Thank you for that, I think this points to this behaviour as the probable cause. Based on what I can see:
#18219 is some code that crafts a situation that illustrates this. This is probably going to wait until Monday for someone who knows the intent behind this mechanism to weigh in / confirm I'm not making stuff up. |
out of interest: Have you previously purged history or had retention policies that would purge history? |
I don't think so, I've only deleted entire rooms and used the state compressor for disk space management |
…s introduced in v1.126.0rc1), due to a suspected issue that causes increased disk usage. (#18222) Revert "Add background job to clear unreferenced state groups (#18154)" This mechanism is suspected of inserting large numbers of rows into `state_groups_state`, thus unreasonably increasing disk usage. See: #18217 This reverts commit 5121f92 (#18154). --------- Signed-off-by: Olivier 'reivilibre <[email protected]>
No longer a release blocker as this was rolled back in rc3. |
I believe the retention policies also has a similar issue to this, that at random times I get runaway rooms (been like this for a while) so I have to hard delete rooms and rejoin them |
Doing a This problem should not happen again with the new approach to deleting unreferenced state groups merged in #18254. It avoids deleting anything which would lead to de-deltaing. |
Description
Not quite sure how/why, but something about the unreferenced state group cleanup (which was enabled in #18154) seems to have caused the number of state groups, especially in HQ, to explode and use the entire disk.
The
_delete_state_groups_loop
background job was using 40% cpu and 100% database according to prom metrics the entire time. The number of rows instate_groups_state
was growing by tens or hundreds of thousands per minute. The logs for_delete_state_groups_loop-0-
didn't seem to mention HQ state groups specifically, it was cleaning up other groups all the time.I first downgraded to 1.125.0, but it didn't have any effect. Then I cleared
state_groups_pending_deletion
and upgraded back to 1.126.0rc2, which made the explosion stop.Steps to reproduce
Homeserver
maunium.net
Synapse Version
1.126.0rc2
Installation Method
Docker (maunium/synapse)
Database
Postgres
Workers
Multiple workers
Platform
Docker on Ubuntu 22.04
The text was updated successfully, but these errors were encountered: