Skip to content

Commit

Permalink
StatisticsV2: initial statistics framework redesign (#14699)
Browse files Browse the repository at this point in the history
* StatisticsV2: initial definition and validation method implementation

* Implement mean, median and standard deviation extraction for StatsV2

* Move stats_v2 to `physical-expr` package

* Introduce `ExprStatisticGraph` and `ExprStatisticGraphNode`

* Split the StatisticsV2 and statistics graph locations, prepare the infrastructure for stats top-down propagation and final bottom-up calculation

* Calculate variance instead of std_dev

* Create a skeleton for statistics bottom-up evaluation

* Introduce high-level test for 'evaluate_statistics()'

* Refactor result distribution computation during the statistics evaluation phase; add compute_range function

* Always produce Unknown distribution in non-mentioned combination cases, todos for the future

* Introduce Bernoulli distribution to be used as result of comparisons and inequations distribution combinations

* Implement initial statistics propagation of Uniform and Unknown distributions with known ranges

* Implement evaluate_statistics for logical not and unary negation operator

* Fix and add tests; make fmt happy

* Add integration test, implement conversion into Bernoulli distribution for Eq and NotEq

* Finish test, small cleanup

* minor improvements

* Update stats.rs

* Addressing review comments

* Implement median colmputation for Gaussian-Gaussian pair

* Update stats_v2.rs

* minor improvements

* Addressing second review comments, part 1

* Return true in other cases

* Finish addressing review requrests, part 2

* final clean-up

* bug fix

* final clean-up

* apply reverse logic in stats framework as well

* Update cp_solver.rs

* revert data.parquet

* Apply suggestions from code review

* Update datafusion/physical-expr-common/src/stats_v2.rs

* Update datafusion/physical-expr-common/src/stats_v2.rs

* Apply suggestions from code review

Fix links

* Fix compilation issue

* Fix mean/median formula for exponential distribution

* casting + exp dir + remove opt's + is_valid refractor

* Update stats_v2_graph.rs

* remove inner mod

* last todo: bernoulli propagation

* Apply suggestions from code review

* Apply suggestions from code review

* prop_stats in binary

* Update binary.rs

* rename intervals

* block explicit construction

* test updates

* Update binary.rs

* revert renaming

* impl range methods as well

* Apply suggestions from code review

* Apply suggestions from code review

* Update datafusion/physical-expr-common/src/stats_v2.rs

* Update stats_v2.rs

* fmt

* fix bernoulli or eval

* fmt

* Review

* Review Part 2

* not propagate

* clean-up

* Review Part 3

* Review Part 4

* Review Part 5

* Review Part 6

* Review Part 7

* Review Part 8

* Review Part 9

* Review Part 10

* Review Part 11

* Review Part 12

* Review Part 13

* Review Part 14

* Review Part 15 | Fix equality comparisons between uniform distributions

* Review Part 16 | Remove unnecessary temporary file

* Review Part 17 | Leave TODOs for real-valued summary statistics

* Review Part 18

* Review Part 19 | Fix variance calculations

* Review Part 20 | Fix range calculations

* Review Part 21

* Review Part 22

* Review Part 23

* Review Part 24 | Add default implementations for evaluate_statistics and propagate_statistics

* Review Part 25 | Improve docs, refactor statistics graph code

* Review Part 26

* Review Part 27

* Review Part 28 | Remove get_zero/get_one, simplify propagation in statistics graph

* Review Part 29

* Review Part 30 | Move statistics-combining functions to core module, polish tests

* Review Part 31

* Review Part 32 | Module reorganization

* Review Part 33

* Add tests for bernoulli and gaussians combination

* Incorporate community feedback

* Fix merge issue

---------

Co-authored-by: Sasha Syrotenko <[email protected]>
Co-authored-by: berkaysynnada <[email protected]>
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
  • Loading branch information
4 people authored Feb 24, 2025
1 parent c58a812 commit 0fbd20c
Show file tree
Hide file tree
Showing 14 changed files with 3,059 additions and 171 deletions.
2 changes: 1 addition & 1 deletion datafusion/common/src/spans.rs
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ impl Span {
/// the column a that comes from SELECT 1 AS a UNION ALL SELECT 2 AS a you'll
/// need two spans.
#[derive(Debug, Clone)]
// Store teh first [`Span`] on the stack because that is by far the most common
// Store the first [`Span`] on the stack because that is by far the most common
// case. More will spill onto the heap.
pub struct Spans(pub Vec<Span>);

Expand Down
Loading

0 comments on commit 0fbd20c

Please sign in to comment.