feat(profiles): Emit "accepted" outcomes for profiles filtered by sampling (#2054)

jjbayer · web-flow · commit 4801ec21266a · 2023-04-24T13:58:00.000+02:00
To facilitate a billing model that is consistent between transactions and profiles, #2051 introduced a new data category for profiles, such that ``` processed profiles = indexed profiles + sampled profiles # "sampled" as in dropped by dynamic sampling ``` just like it is already the case for transactions: ``` processed transactions = indexed transactions + sampled transactions ``` In other words, "processed profiles" should count _all_ valid, non rate limited profiles seen by Relays, even if they were dropped by dynamic sampling. ## Difference between transactions and profiles For transactions, we extract _metrics_ before dynamic sampling, and those metrics are what we rate limit and eventually log as "accepted" for the "processed" transaction data category (`DataCategory.Transaction`). For profiles, we do not extract metrics (yet), so the outcomes for the "processed" profile data category have to be calculated in a different way. ## How this PR achieves this goal 1. In processing Relays, if an envelope still contains profiles _after_ dynamic sampling, log an `Accepted` outcome for the "processed" category. By restricting this to processing Relays, we can be sure that every profile is only counted once. 2. Also in processing Relays, shortly before reporting outcomes to kafka, 2.1. translate `Filtered` outcomes with reason `Sampled:` to an `Accepted` outcome. This counts all profiles dropped by dynamic sampling, regardless of where the dynamic sampling took place (external relay, pop relay, processing relay). 2.2. Also send the original `Filtered` outcome, but with data category `DataCategory.TransactionIndexed`. By adding up the counts of these two disjoint sets, we should correctly count all profiles regardless of whether they were sampled or not. ## Alternative proposal (rejected for now) In order to _actually_ line up behavior of transactions and profiles, we could start extracting a simple counter metric for profiles before dynamic sampling, and let that metric represent `DataCategory.Profile` -- this would mean that rate limits are applied to the metric, and the accepted outcome would be emitted from the `billing_metrics_consumer`, just like we do for the `DataCategory.Transactions`. See [this internal doc](https://www.notion.so/sentry/Implementation-Concerns-412caf18c2f04f579bb551b98c9dad8c) for more. ## Why the currrent approach was chosen I personally believe that the alternative proposal described above would be more correct and easier to maintain, but: 1. By containing all new logic to processing relays, we will correctly count "processed" profiles even if they were dropped by dynamic sampling outdated external relays. 3. In a wider sense, by containing all new logic to processing relays, we can iterate faster (deploy hotfixes, etc.) without having to worry about the behavior of external relays (which we cannot update). 4. This change is easy to revert if needed. The alternative solution would be distributed across nodes, from external relays to sentry.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,6 +14,7 @@
 - Lower default max compressed replay recording segment size to 10 MiB. ([#2031](https://github.com/getsentry/relay/pull/2031))
 - Increase chunking limit to 15MB for replay recordings. ([#2032](https://github.com/getsentry/relay/pull/2032))
 - Add a data category for indexed profiles. ([#2051](https://github.com/getsentry/relay/pull/2051))
+- Differentiate between `Profile` and `ProfileIndexed` outcomes. ([#2054](https://github.com/getsentry/relay/pull/2054))
 
 ## 23.4.0
 
diff --git a/relay-server/src/actors/outcome.rs b/relay-server/src/actors/outcome.rs
@@ -148,12 +148,16 @@ impl FromMessage<Self> for TrackOutcome {
 /// Defines the possible outcomes from processing an event.
 #[derive(Clone, Debug, PartialEq, Eq, Hash)]
 pub enum Outcome {
-    // /// The event has been accepted and handled completely.
-    // ///
-    // /// This is never emitted by Relay as the event may be discarded by the processing pipeline
-    // /// after Relay. Only the `save_event` task in Sentry finally accepts an event.
-    // #[allow(dead_code)]
-    // Accepted,
+    /// The event has been accepted and handled completely.
+    ///
+    /// For events and most other types, this is never emitted by Relay as the event
+    /// may be discarded by the processing pipeline after Relay.
+    /// Only the `save_event` task in Sentry finally accepts an event.
+    ///
+    /// The only data type for which this outcome is emitted by Relay is [`DataCategory::Profile`].
+    /// (See [`crate::actors::processor::EnvelopeProcessor`])
+    #[cfg(feature = "processing")]
+    Accepted,
     /// The event has been filtered due to a configured filter.
     Filtered(FilterStatKey),
 
@@ -178,6 +182,8 @@ impl Outcome {
     /// Returns the raw numeric value of this outcome for the JSON and Kafka schema.
     fn to_outcome_id(&self) -> OutcomeId {
         match self {
+            #[cfg(feature = "processing")]
+            Outcome::Accepted => OutcomeId::ACCEPTED,
             Outcome::Filtered(_) | Outcome::FilteredSampling(_) => OutcomeId::FILTERED,
             Outcome::RateLimited(_) => OutcomeId::RATE_LIMITED,
             Outcome::Invalid(_) => OutcomeId::INVALID,
@@ -189,6 +195,8 @@ impl Outcome {
     /// Returns the `reason` code field of this outcome.
     fn to_reason(&self) -> Option<Cow<str>> {
         match self {
+            #[cfg(feature = "processing")]
+            Outcome::Accepted => None,
             Outcome::Invalid(discard_reason) => Some(Cow::Borrowed(discard_reason.name())),
             Outcome::Filtered(filter_key) => Some(Cow::Borrowed(filter_key.name())),
             Outcome::FilteredSampling(rule_ids) => Some(Cow::Owned(format!("Sampled:{rule_ids}"))),
@@ -221,6 +229,8 @@ impl Outcome {
 impl fmt::Display for Outcome {
     fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
         match self {
+            #[cfg(feature = "processing")]
+            Outcome::Accepted => write!(f, "accepted"),
             Outcome::Filtered(key) => write!(f, "filtered by {key}"),
             Outcome::FilteredSampling(rule_ids) => write!(f, "sampling rule {rule_ids}"),
             Outcome::RateLimited(None) => write!(f, "rate limited"),
@@ -787,14 +797,16 @@ impl OutcomeBroker {
     }
 
     #[cfg(feature = "processing")]
-    fn send_kafka_message(
+    fn send_kafka_message_inner(
         &self,
         producer: &KafkaOutcomesProducer,
         organization_id: u64,
         message: TrackRawOutcome,
     ) -> Result<(), OutcomeError> {
         relay_log::trace!("Tracking kafka outcome: {:?}", message);
 
+        send_outcome_metric(&message, "kafka");
+
         let payload = serde_json::to_string(&message).map_err(OutcomeError::SerializationError)?;
 
         // At the moment, we support outcomes with optional EventId.
@@ -824,11 +836,23 @@ impl OutcomeBroker {
         }
     }
 
+    #[cfg(feature = "processing")]
+    fn send_kafka_message(
+        &self,
+        producer: &KafkaOutcomesProducer,
+        organization_id: u64,
+        message: TrackRawOutcome,
+    ) -> Result<(), OutcomeError> {
+        for message in transform_outcome(message) {
+            self.send_kafka_message_inner(producer, organization_id, message)?;
+        }
+        Ok(())
+    }
+
     fn handle_track_outcome(&self, message: TrackOutcome, config: &Config) {
         match self {
             #[cfg(feature = "processing")]
             Self::Kafka(kafka_producer) => {
-                send_outcome_metric(&message, "kafka");
                 let organization_id = message.scoping.organization_id;
                 let raw_message = TrackRawOutcome::from_outcome(message, config);
                 if let Err(error) =
@@ -853,7 +877,6 @@ impl OutcomeBroker {
         match self {
             #[cfg(feature = "processing")]
             Self::Kafka(kafka_producer) => {
-                send_outcome_metric(&message, "kafka");
                 let sharding_id = message.org_id.unwrap_or_else(|| message.project_id.value());
                 if let Err(error) = self.send_kafka_message(kafka_producer, sharding_id, message) {
                     relay_log::error!("failed to produce outcome: {}", LogError(&error));
@@ -869,6 +892,56 @@ impl OutcomeBroker {
     }
 }
 
+/// Returns true if the outcome represents profiles dropped by dynamic sampling.
+#[cfg(feature = "processing")]
+fn is_sampled_profile(outcome: &TrackRawOutcome) -> bool {
+    // Older external Relays will still emit a `Profile` outcome.
+    // Newer Relays will emit a `ProfileIndexed` outcome.
+    (outcome.category == Some(DataCategory::Profile as u8)
+        || outcome.category == Some(DataCategory::ProfileIndexed as u8))
+        && outcome.outcome == OutcomeId::FILTERED
+        && outcome
+            .reason
+            .as_deref()
+            .map_or(false, |reason| reason.starts_with("Sampled:"))
+}
+
+/// Transform outcome into one or more derived outcome messages before sending it to kafka.
+#[cfg(feature = "processing")]
+fn transform_outcome(mut outcome: TrackRawOutcome) -> impl Iterator<Item = TrackRawOutcome> {
+    let mut extra = None;
+    if is_sampled_profile(&outcome) {
+        // Profiles that were dropped by dynamic sampling still count as "processed",
+        // so we emit the FILTERED outcome only for the "indexed" category instead.
+        outcome.category = Some(DataCategory::ProfileIndexed as u8);
+
+        // "processed" profiles are an abstract data category that does not represent actual data
+        // going through our pipeline. Instead, the number of accepted "processed" profiles is counted as
+        //
+        //     processed_profiles = indexed_profiles + sampled_profiles
+        //
+        // The "processed" outcome for indexed_profiles is generated in processing
+        // (see `EnvelopeProcessor::count_processed_profiles()`),
+        // but for profiles dropped by dynamic sampling, all we have is the FILTERED outcome,
+        // which we transform into an ACCEPTED outcome here.
+        //
+        // The reason for doing this transformation in the kafka producer is that it should apply
+        // to both `TrackOutcome` and `TrackRawOutcome`, and it should only happen _once_.
+        //
+        // In the future, we might actually extract metrics from profiles before dynamic sampling,
+        // like we do for transactions. At that point, this code should be removed, and we should
+        // enforce rate limits and emit outcomes based on the collect profile metric, as we do for
+        // transactions.
+        extra = Some(TrackRawOutcome {
+            outcome: OutcomeId::ACCEPTED,
+            reason: None,
+            category: Some(DataCategory::Profile as u8),
+            ..outcome.clone()
+        });
+    }
+    Some(outcome).into_iter().chain(extra)
+}
+
 #[derive(Debug)]
 enum ProducerInner {
     #[cfg(feature = "processing")]
diff --git a/relay-server/src/actors/processor.rs b/relay-server/src/actors/processor.rs
@@ -1070,6 +1070,54 @@ impl EnvelopeProcessorService {
         })
     }
 
+    /// Count the number of profiles that are in the envelope and emit accepted outcome.
+    ///
+    /// "processed" profiles are an abstract data category that does not represent actual data
+    /// going through our pipeline. Instead, the number of accepted "processed" profiles is counted as
+    ///
+    /// ```text
+    /// processed_profiles = indexed_profiles + sampled_profiles
+    /// ```
+    ///
+    /// The "processed" outcome for sampled profiles is generated by the Kafka producer
+    /// (see `transform_outcome` in [`crate::actors::store`]), but for "indexed" profiles, we count
+    /// the corresponding number of processed profiles here.
+    ///
+    /// NOTE: Instead of emitting a [processed](`DataCategory::Profile`) outcome here,
+    /// we could also do it in sentry, in the same place where the [indexed](`DataCategory::ProfileIndexed`)
+    /// outcome is logged. We do it here to be consistent with profiles that are dropped by dynamic sampling,
+    /// which also count as "processed" even though they did not pass through the `process_profiles` step yet.
+    ///
+    ///
+    /// In the future, we might actually extract metrics from profiles before dynamic sampling,
+    /// like we do for transactions. At that point, this code should be removed, and we should
+    /// enforce rate limits and emit outcomes based on the collect profile metric, as we do for
+    /// transactions.
+    #[cfg(feature = "processing")]
+    fn count_processed_profiles(&self, state: &mut ProcessEnvelopeState) {
+        let profile_count: usize = state
+            .managed_envelope
+            .envelope()
+            .items()
+            .filter(|item| item.ty() == &ItemType::Profile)
+            .map(|item| item.quantity())
+            .sum();
+
+        if profile_count == 0 {
+            return;
+        }
+
+        self.outcome_aggregator.send(TrackOutcome {
+            timestamp: state.managed_envelope.received_at(),
+            scoping: state.managed_envelope.scoping(),
+            outcome: Outcome::Accepted,
+            event_id: None,
+            remote_addr: None,
+            category: DataCategory::Profile,
+            quantity: profile_count as u32, // truncates to `u32::MAX`
+        })
+    }
+
     /// Process profiles and set the profile ID in the profile context on the transaction if successful
     #[cfg(feature = "processing")]
     fn process_profiles(&self, state: &mut ProcessEnvelopeState) {
@@ -2245,6 +2293,9 @@ impl EnvelopeProcessorService {
         self.process_replays(state)?;
         self.filter_profiles(state);
 
+        // After filtering, we need to update the envelope summary:
+        state.managed_envelope.update();
+
         if state.creates_event() {
             // Some envelopes only create events in processing relays; for example, unreal events.
             // This makes it possible to get in this code block while not really having an event in
@@ -2276,6 +2327,11 @@ impl EnvelopeProcessorService {
 
         if_processing!({
             self.enforce_quotas(state)?;
+            // Any profile that reaches this point counts as "processed", regardless of whether
+            // they survive the actual `process_profiles` step. This is to be consistent with
+            // profiles that are dropped by dynamic sampling, which also count as "processed"
+            // even though they did not pass through the `process_profiles` step yet.
+            self.count_processed_profiles(state);
             // We need the event parsed in order to set the profile context on it
             self.process_profiles(state);
             self.process_check_ins(state);
diff --git a/relay-server/src/envelope.rs b/relay-server/src/envelope.rs
@@ -550,7 +550,11 @@ impl Item {
             ItemType::Metrics | ItemType::MetricBuckets => None,
             ItemType::FormData => None,
             ItemType::UserReport => None,
-            ItemType::Profile => Some(DataCategory::Profile),
+            ItemType::Profile => Some(if indexed {
+                DataCategory::ProfileIndexed
+            } else {
+                DataCategory::Profile
+            }),
             ItemType::ReplayEvent | ItemType::ReplayRecording => Some(DataCategory::Replay),
             ItemType::ClientReport => None,
             ItemType::CheckIn => Some(DataCategory::Monitor),
diff --git a/relay-server/src/utils/managed_envelope.rs b/relay-server/src/utils/managed_envelope.rs
@@ -312,7 +312,11 @@ impl ManagedEnvelope {
         if self.context.summary.profile_quantity > 0 {
             self.track_outcome(
                 outcome,
-                DataCategory::Profile,
+                if self.context.summary.event_metrics_extracted {
+                    DataCategory::ProfileIndexed
+                } else {
+                    DataCategory::Profile
+                },
                 self.context.summary.profile_quantity,
             );
         }
diff --git a/relay-server/src/utils/rate_limits.rs b/relay-server/src/utils/rate_limits.rs
@@ -510,8 +510,13 @@ where
 
             // It makes no sense to store profiles without transactions, so if the event
             // is rate limited, rate limit profiles as well.
+            let profile_category = if summary.event_metrics_extracted {
+                DataCategory::ProfileIndexed
+            } else {
+                DataCategory::Profile
+            };
             enforcement.profiles =
-                CategoryLimit::new(DataCategory::Profile, summary.profile_quantity, longest);
+                CategoryLimit::new(profile_category, summary.profile_quantity, longest);
 
             rate_limits.merge(event_limits);
         }
@@ -548,7 +553,11 @@ where
             let item_scoping = scoping.item(DataCategory::Profile);
             let profile_limits = (self.check)(item_scoping, summary.profile_quantity)?;
             enforcement.profiles = CategoryLimit::new(
-                DataCategory::Profile,
+                if summary.event_metrics_extracted {
+                    DataCategory::ProfileIndexed
+                } else {
+                    DataCategory::Profile
+                },
                 summary.profile_quantity,
                 profile_limits.longest(),
             );
diff --git a/tests/integration/test_outcome.py b/tests/integration/test_outcome.py