Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(profiles): Emit "accepted" outcomes for profiles filtered by sampling #2054

Merged
merged 15 commits into from
Apr 24, 2023

Conversation

jjbayer
Copy link
Member

@jjbayer jjbayer commented Apr 20, 2023

To facilitate a billing model that is consistent between transactions and profiles, #2051 introduced a new data category for profiles, such that

processed profiles = indexed profiles + sampled profiles  # "sampled" as in dropped by dynamic sampling

just like it is already the case for transactions:

processed transactions = indexed transactions + sampled transactions

In other words, "processed profiles" should count all valid, non rate limited profiles seen by Relays, even if they were dropped by dynamic sampling.

Difference between transactions and profiles

For transactions, we extract metrics before dynamic sampling, and those metrics are what we rate limit and eventually log as "accepted" for the "processed" transaction data category (DataCategory.Transaction).

For profiles, we do not extract metrics (yet), so the outcomes for the "processed" profile data category have to be calculated in a different way.

How this PR achieves this goal

  1. In processing Relays, if an envelope still contains profiles after dynamic sampling, log an Accepted outcome for the "processed" category. By restricting this to processing Relays, we can be sure that every profile is only counted once.
  2. Also in processing Relays, shortly before reporting outcomes to kafka,
    2.1. translate Filtered outcomes with reason Sampled: to an Accepted outcome. This counts all profiles dropped by dynamic sampling, regardless of where the dynamic sampling took place (external relay, pop relay, processing relay).
    2.2. Also send the original Filtered outcome, but with data category DataCategory.TransactionIndexed.

By adding up the counts of these two disjoint sets, we should correctly count all profiles regardless of whether they were sampled or not.

Alternative proposal (rejected for now)

In order to actually line up behavior of transactions and profiles, we could start extracting a simple counter metric for profiles before dynamic sampling, and let that metric represent DataCategory.Profile -- this would mean that rate limits are applied to the metric, and the accepted outcome would be emitted from the billing_metrics_consumer, just like we do for the DataCategory.Transactions. See this internal doc for more.

Why the currrent approach was chosen

I personally believe that the alternative proposal described above would be more correct and easier to maintain, but:

  1. By containing all new logic to processing relays, we will correctly count "processed" profiles even if they were dropped by dynamic sampling outdated external relays.
  2. In a wider sense, by containing all new logic to processing relays, we can iterate faster (deploy hotfixes, etc.) without having to worry about the behavior of external relays (which we cannot update).
  3. This change is easy to revert if needed. The alternative solution would be distributed across nodes, from external relays to sentry.

Copy link
Contributor

@phacops phacops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

.items()
.filter(|item| item.ty() == &ItemType::Profile)
.map(|item| item.quantity())
.sum();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, the SDKs will only send 1 profile per envelope.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but Relay does not enforce that AFAIK, so its safer to handle the case where there's multiple profiles.

@jjbayer jjbayer marked this pull request as ready for review April 21, 2023 09:04
@jjbayer jjbayer requested a review from a team April 21, 2023 09:04
@phacops
Copy link
Contributor

phacops commented Apr 21, 2023

It'd be good to deploy getsentry/sentry#47703 before this PR is deployed to make sure we don't double count indexed profiles as processed.

Copy link
Contributor

@olksdr olksdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Let's test it in prod 😄

///
/// In the future, we might actually extract metrics from profiles before dynamic sampling,
/// like we do for transactions. At that point, this code should be removed, and we should
/// enforce rate limits and emit outcomes based on the collect profile metric, as we do for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence ends here kind of leaving you hanging at the end.

Suggested change
/// enforce rate limits and emit outcomes based on the collect profile metric, as we do for
/// enforce rate limits and emit outcomes based on the collect profile metric, as we do for transactions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jjbayer
Copy link
Member Author

jjbayer commented Apr 24, 2023

lgtm. Let's test it in prod 😄

@olksdr I do hope the integration test covers the most important use case, let me know if you think we need more.

Improvement to #2054:

Before a profile reaches the `process_profile` stage, we count it as
`Profile:Accepted`. So if `process_profile` _fails_ with an `Invalid`
outcome, it is incorrect to log it in the same data category, because it
would double count within that category. Log it as `ProfileIndexed`
instead.
@jjbayer
Copy link
Member Author

jjbayer commented Apr 24, 2023

It'd be good to deploy getsentry/sentry#47703 before this PR is deployed to make sure we don't double count indexed profiles as processed.

@phacops yes, let's coordinate when to do the deploy!

jjbayer added 2 commits April 24, 2023 13:16
Improvement to #2054:

Since #2054 already introduced a translation from `Profile` to
`ProfileIndexed` for sampling outcomes, this PR does not change anything
functionally. It is merely an attempt to be consistent with #2056 and
#2060, such that we consider the `metrics_extracted` flag any time we
generate a profiling outcome.
…2060)

Improvement to #2054:

Same as #2056, but for rate
limited outcomes: _Before_ metrics extraction, rate limited profiles
should count as `Profile` ("processed" profile), because they have not
been counted towards accepted profiles yet. _after_ metrics extraction,
they should count as `ProfileIndexed`.
@jjbayer jjbayer merged commit 4801ec2 into master Apr 24, 2023
@jjbayer jjbayer deleted the feat/indexed-profile-outcomes branch April 24, 2023 11:58
jjbayer added a commit that referenced this pull request Apr 26, 2023
…2071)

In #2056, #2060 and #2061, I wrongly assumed that dropped profiles
should always be counted as `ProfileIndexed` after metrics extraction
and dynamic sampling, in order to not double-count towards `Profile`.
This is wrong, because as the PR description of #2054 clearly states,
profiles that are not dropped by dynamic sampling are only counted as
accepted in processing relays, _after_ dynamic sampling and rate
limiting:

> In processing Relays, if an envelope still contains profiles after
dynamic sampling, log an Accepted outcome for the "processed" category.
By restricting this to processing Relays, we can be sure that every
profile is only counted once.

Instead of checking the `metrics_extracted` flag, introduce a new flag
that explicitly states whether a profile was already counted towards
`DataCategory::Profile` or not, and evaluate that flag instead of
`metrics_extracted`.
jjbayer added a commit that referenced this pull request May 31, 2023
jjbayer added a commit that referenced this pull request Jun 5, 2023
Instead of counting processed profiles in two different places (see
#2054), add a `has_profile` tag
to the transaction counter metric, and define

```
processed_profiles := count(processed_transactions) WHERE has_profile = true
```

## Changes

This PR contains the following changes:

* Revert PRs around counting profiles.
* Add `has_profile(s)` tag on `transaction.duration` metric.
#2165
* Update metrics billing consumer to emit `accepted` outcome for
profiles. getsentry/sentry#50047


## Caveats

With this PR, rate limits will be applied consistently if they are
issued for `DataCategory.Transaction`. However, if a rate limit (e.g.
spike protection) is issued _only_ for `DataCategory.Profile`, it will
not be enforced.

## Rollout Order

1. Merge and deploy getsentry/sentry#50047. The
billing metrics consumer will now listen for profiles, but not receive
any.
3. Deploy processing Relays. Profile counts will go down, because
PoP-Relays are not sending the `has_profile` tag yet.
4. Deploy PoP Relays. Profile counts should go back to normal.

## In case of rollback

1. First revert the sentry change
(getsentry/sentry#50047) and deploy it to stop
counting profiles.
2. Then revert the Relay change (this PR) and deploy to processing
Relays and PoPs.

ref: #2158

---------

Co-authored-by: Iker Barriocanal <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants