diff --git a/docs/investigation_into_drivers_of_heterogeneity.md b/docs/investigation_into_drivers_of_heterogeneity.md new file mode 100644 index 00000000..5aebc5b8 --- /dev/null +++ b/docs/investigation_into_drivers_of_heterogeneity.md @@ -0,0 +1,109 @@ +--- +editor_options: + markdown: + wrap: 72 +--- + +## Wastewater forecasting evaluation: Investigation into drivers of heterogeneity in the impact of wastewater on forecast performance + +This document is intended to outline concretely the proposed analyses +requested as a follow-up to the previously decided on [evaluation +plan](https://github.com/cdcgov/wastewater-informed-covid-forecasting/tree/prod/docs). +This evaluation focused on two major components: - retrospective +comparison of forecast performance with and without wastewater using +vintaged datasets of COVID-19 hospital admissions and wastewater at the +jurisdiction and wastewater treatment plant level, across the 2023-24 +epidemic season - comparison of the performance of the +wastewater-informed model to other models submitting to the COVID-19 +forecast Hub, both in real-time (when submitted from Feb - March of +2024) and retrospectively (using the real-time COVID Hub submissions and +our retrospectively produce submission), to be presented in the +manuscript: "Bayesian generative modeling for heterogeneous wastewater +data applied to COVID-19 forecasting". In the interest of maintaining +scientific integrity, we specified the plan for this evaluation prior to +performing the analysis. + +The goal of the proposed analyses here is to investigate, using the +empirical wastewater data, hospital admissions data, model performance +metrics, and the model parameters, the potential drivers of the +heterogeneity in the relative performance of wastewater. This analysis +is intended for hypothesis generation, rather than hypothesis testing or +assigning any causal relationships, as we believe a full-fledged +analysis of the impact of different characteristics of the wastewater +and hospital admissions data on forecast performance is out of scope for +this paper and should be performed as a separate independent analysis. + +We are writing this analysis plan with the intention of coming to a +consensus on the scope of the required additional analysis and the form +of the presentation of the results, prior to running the analysis, again +with the intention of promoting scientific integrity and holding +ourselves accountable to presenting the results in an unbiased manner. + +We plan to address the following questions via the proposed analyses: + +All of the planned analysis will focus on the retrospective model +performance with and without wastewater data. + +1. How strong is the correlation in the recent trend in wastewater + signals (with one another), and is that consistent with the eventual + trend in hospital admissions data? How does it impact performance? + +- We will quantify correlation in recent trend in wastewater signals + using only data from the 2 weeks prior to the forecast date. We will + pool data from all sites in those two weeks, calculate a correlation + coefficient, and estimate the instantaneous exponential growth rate + in the observed data. We will compare this to an estimate of the + exponential growth rate in the hospital admissions using the + evaluation data from 1 week prior to the forecast to 2 weeks beyond + the forecast date, in an attempt to characterize the trend of + hospital admissions into the forecast period. Next, we will bin + forecasts into the following categories: + + - high correlation in wastewater signal, trend in same direction + + - high correlation in wastewater signal, trend in opposite + direction + + - low correlation in wastewater signal, trend in same direction + + - low correlation in wastewater signal, trend in opposite + direction + +We will then display and summarize the distributions of forecast +performances (average and relative CRPS of the full 38 day horizon for +an individual forecast date and location) in each of the bins. + +2. How did variability in the wastewater data impact forecast + performance? + - We will quantify the variability in wastewater data in two ways + 1 by computing the coefficient of variation across the time + series for each wastewater treatment plant (returning a + distribution of CVs) and 2 by looking at the posterior estimate + of the mean obsrvation error across sites. + + - We will summarize the "average variability in the wastewater + signal" for each location and forecast date by taking the mean + of each of the empirical and model-based distributions, and we + will plot the mean variability from both methods compared to the + forecast performance (average and relative CRPS of the full 38 + day horizon for an individual forecast date and location) +3. How much does latency impact forecast performance? + - We will bin forecasts by those containing more than 5 sites with + wastewater concentration data within the last 20-15, 14-11 days, + 10-8 days, and 7-0 days of the forecast date. + + - We will present violin plots demonstrating the distribution of + forecast performance (average and relative CRPS of the full 38 + day horizon for an individual forecast date and location) +4. How does relative vs absolute forecast performance compare, e.g. + does wastewater improve forecasts when the performance would have + otherwise been poor, or vice versa? + - scatterplot of the relative CRPS of the wastewater informed + model versus the absolute CRPS of the hospital admissions only + model + + - scatterplot of the absolute CRPS of the wastewater model versus + the absolute CRPS of the hospital admissions only model + +We plan to add the proposed plots and analysis to the supplement of the +manuscript.