Skip to content

Commit 1f88790

Browse files
authoredMar 12, 2024
Prevent locking up while processing batched_auth_events (#16968)
This PR aims to fix #16895, caused by a regression in #7 and not fixed by #16903. The PR #16903 only fixes a starvation issue, where the CPU isn't released. There is a second issue, where the execution is blocked. This theory is supported by the flame graphs provided in #16895 and the fact that I see the CPU usage reducing and far below the limit. Since the changes in #7, the method `check_state_independent_auth_rules` is called with the additional parameter `batched_auth_events`: https://github.com/element-hq/synapse/blob/6fa13b4f927c10b5f4e9495be746ec28849f5cb6/synapse/handlers/federation_event.py#L1741-L1743 It makes the execution enter this if clause, introduced with #15195 https://github.com/element-hq/synapse/blob/6fa13b4f927c10b5f4e9495be746ec28849f5cb6/synapse/event_auth.py#L178-L189 There are two issues in the above code snippet. First, there is the blocking issue. I'm not entirely sure if this is a deadlock, starvation, or something different. In the beginning, I thought the copy operation was responsible. It wasn't. Then I investigated the nested `store.get_events` inside the function `update`. This was also not causing the blocking issue. Only when I replaced the set difference operation (`-` ) with a list comprehension, the blocking was resolved. Creating and comparing sets with a very large amount of events seems to be problematic. This is how the flamegraph looks now while persisting outliers. As you can see, the execution no longer locks up in the above function. ![output_2024-02-28_13-59-40](https://github.com/element-hq/synapse/assets/13143850/6db9c9ac-484f-47d0-bdde-70abfbd773ec) Second, the copying here doesn't serve any purpose, because only a shallow copy is created. This means the same objects from the original dict are referenced. This fails the intention of protecting these objects from mutation. The review of the original PR matrix-org/synapse#15195 had an extensive discussion about this matter. Various approaches to copying the auth_events were attempted: 1) Implementing a deepcopy caused issues due to builtins.EventInternalMetadata not being pickleable. 2) Creating a dict with new objects akin to a deepcopy. 3) Creating a dict with new objects containing only necessary attributes. Concluding, there is no easy way to create an actual copy of the objects. Opting for a deepcopy can significantly strain memory and CPU resources, making it an inefficient choice. I don't see why the copy is necessary in the first place. Therefore I'm proposing to remove it altogether. After these changes, I was able to successfully join these rooms, without the main worker locking up: - #synapse:matrix.org - #element-android:matrix.org - #element-web:matrix.org - #ecips:matrix.org - #ipfs-chatter:ipfs.io - #python:matrix.org - #matrix:matrix.org
1 parent 48f59d3 commit 1f88790

File tree

2 files changed

+35
-9
lines changed

2 files changed

+35
-9
lines changed
 

‎changelog.d/16968.bugfix

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Prevent locking up when checking auth rules that are independent of room state for batched auth events. Contributed by @ggogel.

‎synapse/event_auth.py

+34-9
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,20 @@
2323
import collections.abc
2424
import logging
2525
import typing
26-
from typing import Any, Dict, Iterable, List, Mapping, Optional, Set, Tuple, Union
26+
from typing import (
27+
Any,
28+
ChainMap,
29+
Dict,
30+
Iterable,
31+
List,
32+
Mapping,
33+
MutableMapping,
34+
Optional,
35+
Set,
36+
Tuple,
37+
Union,
38+
cast,
39+
)
2740

2841
from canonicaljson import encode_canonical_json
2942
from signedjson.key import decode_verify_key_bytes
@@ -175,23 +188,35 @@ async def check_state_independent_auth_rules(
175188
return
176189

177190
# 2. Reject if event has auth_events that: ...
191+
auth_events: ChainMap[str, EventBase] = ChainMap()
178192
if batched_auth_events:
179-
# Copy the batched auth events to avoid mutating them.
180-
auth_events = dict(batched_auth_events)
181-
needed_auth_event_ids = set(event.auth_event_ids()) - batched_auth_events.keys()
193+
# batched_auth_events can become very large. To avoid repeatedly copying it, which
194+
# would significantly impact performance, we use a ChainMap.
195+
# batched_auth_events must be cast to MutableMapping because .new_child() requires
196+
# this type. This casting is safe as the mapping is never mutated.
197+
auth_events = auth_events.new_child(
198+
cast(MutableMapping[str, "EventBase"], batched_auth_events)
199+
)
200+
needed_auth_event_ids = [
201+
event_id
202+
for event_id in event.auth_event_ids()
203+
if event_id not in batched_auth_events
204+
]
182205
if needed_auth_event_ids:
183-
auth_events.update(
206+
auth_events = auth_events.new_child(
184207
await store.get_events(
185208
needed_auth_event_ids,
186209
redact_behaviour=EventRedactBehaviour.as_is,
187210
allow_rejected=True,
188211
)
189212
)
190213
else:
191-
auth_events = await store.get_events(
192-
event.auth_event_ids(),
193-
redact_behaviour=EventRedactBehaviour.as_is,
194-
allow_rejected=True,
214+
auth_events = auth_events.new_child(
215+
await store.get_events(
216+
event.auth_event_ids(),
217+
redact_behaviour=EventRedactBehaviour.as_is,
218+
allow_rejected=True,
219+
)
195220
)
196221

197222
room_id = event.room_id

0 commit comments

Comments
 (0)