-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc: Add documentation about the path of an event through relay #289
Merged
Merged
Changes from 5 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
79b5427
doc: Add documentation about the path of an event through relay
untitaker 3e221bf
doc: Add reference to config names
untitaker 4bdb978
ref: Add summary sections
untitaker c9943c0
fix: Add index.md to architecture section
untitaker 74e1d11
ref: Reorder eventprocessor steps
untitaker 646edf2
ref: Add some graphs
untitaker 432e173
ref: Add command for hotfixing docs
untitaker 9025265
fix: Do not go into modes rabbit hole
untitaker File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,3 @@ | ||
# Architecture | ||
|
||
WIP | ||
This subsection contains internal docs that are useful for development and operation of Relay |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,164 @@ | ||
# Path of an Event through Relay | ||
|
||
## Processing enabled vs not? | ||
|
||
Relay can run as part of a Sentry installation, such as within `sentry.io`'s | ||
infrastructure, or next to the application as a forwarding proxy. A lot of | ||
steps described here are skipped or run in a limited form when Relay is *not* | ||
running in processing mode: | ||
|
||
* Event normalization does different (less) things. | ||
|
||
* In `static` mode, project config is read from disk instead of fetched from | ||
an HTTP endpoint. In `proxy` mode, project config is just filled out with | ||
defaults. | ||
|
||
* Events are forwarded to an HTTP endpoint instead of being written to Kafka. | ||
|
||
* Rate limits are not calculated using Redis, instead Relay just honors 429s | ||
from previously mentioned endpoint. | ||
|
||
* Filters are not applied at all. | ||
|
||
## Inside the endpoint | ||
|
||
When an SDK hits `/api/X/store` on Relay, the code in | ||
`server/src/endpoints/store.rs` is called before returning a HTTP response. | ||
|
||
That code looks into an in-memory cache to answer basic information about a project such as: | ||
|
||
* Does it exist? Is it suspended/disabled? | ||
|
||
* Is it rate limited right now? If so, which key is rate limited? | ||
|
||
* Which DSNs are valid for this project? | ||
|
||
Some of the data for this cache is coming from the [projectconfigs | ||
endpoint](https://github.com/getsentry/sentry/blob/c868def30e013177383f8ca5909090c8bdbd8f6f/src/sentry/api/endpoints/relay_projectconfigs.py). | ||
It is refreshed every couple of minutes, depending on configuration (`project_expiry`). | ||
|
||
If the cache is fresh, we may return a `429` for rate limits or a `4xx` for | ||
invalid auth information. | ||
|
||
That cache might be empty or stale. If that is the case, Relay does not | ||
actually attempt to populate it at this stage. **It just returns a `200` even | ||
though the event might be dropped later.** This implies: | ||
|
||
* The first store request that runs into a rate limit doesn't actually result | ||
in a `429`, but a subsequent request will (because by that time the project | ||
cache will have been updated). | ||
|
||
* A store request for a non-existent project may result in a `200`, but | ||
subsequent ones will not. | ||
|
||
* A store request with wrong auth information may result in a `200`, but | ||
subsequent ones will not. | ||
|
||
* Filters are also not applied at this stage, so **a filtered event will | ||
always result in a `200`**. This matches the Python behavior since [a while | ||
now](https://github.com/getsentry/sentry/pull/14561). | ||
|
||
These examples assume that a project receives one event at a time. In practice | ||
one may observe that a highly concurrent burst of store requests for a single | ||
project results in `200 OK`s only. However, a multi-second flood of incoming | ||
events should quickly result in eventually consistent and correct status codes. | ||
|
||
The response is completed at this point. All expensive work (such as talking to | ||
external services) is deferred into a background task. Except for responding to | ||
the HTTP request, there's no I/O done in the endpoint in any form. We didn't | ||
even hit Redis to calculate rate limits. | ||
|
||
!!! Summary | ||
The HTTP response returned is just a best-effort guess at what the actual | ||
outcome of the event is going to be. We only return a `4xx` code if we know that | ||
the response will fail (based on cached information), if we don't we return a | ||
200 and continue to process the event asynchronously. This asynchronous | ||
processing used to happen synchronously in the Python implementation of | ||
`StoreView`. | ||
|
||
The effect of this is that the server will respond much faster that before but | ||
we might return 200 for events that will ultimately not be accepted. | ||
|
||
Generally Relay will return a 200 in many more situations than the old | ||
`StoreView`. | ||
|
||
## The background task | ||
|
||
The HTTP response is out by now. The rest of what used to happen synchronously in the | ||
Python `StoreView` is done asynchronously, but still in the same process. | ||
|
||
So, now to the real work: | ||
|
||
1. **Project config is fetched.** If the project cache is stale or missing, we | ||
fetch it. We may wait a couple milliseconds (`batch_interval`) here to be | ||
able to batch multiple project config fetches into the same HTTP request to | ||
not overload Sentry too much. | ||
|
||
At this stage Relay may drop the event because it realized that the DSN was | ||
invalid or the project didn't even exist. The next incoming event will get a | ||
proper 4xx status code. | ||
|
||
1. **The event is parsed.** In the endpoint we only did decompression, a basic | ||
JSON syntax check, and extraction of the event ID to be able to return it as | ||
part of the response. | ||
|
||
Now we create an `Event` struct, which conceptually is the equivalent to | ||
parsing it into a Python dictionary: We allocate more memory. | ||
|
||
1. **The event is normalized.** Event normalization is probably the most | ||
CPU-intensive task running in Relay. It discards invalid data, moves data | ||
from deprecated fields to newer fields and generally just does schema | ||
validation. | ||
|
||
1. **Filters ("inbound filters") are applied.** Event may be discarded because of IP | ||
addresses, patterns on the error message or known web crawlers. | ||
|
||
1. **Exact rate limits ("quotas") are applied.** `is_rate_limited.lua` is | ||
executed on Redis. The input parameters for `is_rate_limited.lua` ("quota | ||
objects") are part of the project config. See [this pull | ||
request](https://github.com/getsentry/sentry/pull/14558) for an explanation | ||
of what quota objects are. | ||
|
||
The event may be discarded here. If so, we write the rate limit info | ||
(reason and expiration timestamp) into the in-memory project cache so that | ||
the next store request returns a 429 in the endpoint and doesn't hit Redis | ||
at all. | ||
|
||
This contraption has the advantage that suspended or permanently | ||
rate-limited projects are very cheap to handle, and do not involve external | ||
services (ignoring the polling of the project config every couple of | ||
minutes). | ||
|
||
1. **The event is datascrubbed.** We have a PII config (new format) and a | ||
datascrubbing config (old format, converted to new format on the fly) as | ||
part of the project config fetched from Sentry. | ||
|
||
1. **Event is written to Kafka.** | ||
|
||
**Note:** If we discard an event at any point, an outcome is written to Kafka | ||
if Relay is configured to do so. | ||
|
||
!!! Summary | ||
For events that returned a `200` we spawn an in-process background task | ||
that does the rest of what the old `StoreView` did. | ||
|
||
This task updates in-memory state for rate limits and disabled | ||
projects/keys. | ||
|
||
## The outcomes consumer | ||
|
||
Outcomes are small messages in Kafka that contain an event ID and information | ||
about whether that event was rejected, and if so, why. | ||
|
||
The outcomes consumer is mostly responsible for updating (user-visible) | ||
counters in Sentry (buffers/counters and tsdb, which are two separate systems). | ||
|
||
## The ingest consumer | ||
|
||
The ingest consumer reads accepted events from Kafka, and also updates some | ||
stats. Some of *those* stats are billing-relevant. | ||
|
||
Its main purpose is to do what `insert_data_to_database` in Python store did: | ||
Call `preprocess_event`, after which comes sourcemap processing, native | ||
symbolication, grouping, snuba and all that other stuff that is of no concern | ||
to Relay. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we perhaps link to the modes description here?
Also, is "processing" also a mode, or more like, a feature? If the former, we should probably add it to the modes list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to keep this out of this document. I want to remove some of these modes once the original mode (fetching from sentry) is GA.