Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KYUUBI #6017] Support observe hint #6017

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wForget
Copy link
Member

@wForget wForget commented Jan 25, 2024

🔍 Description

Issue References 🔗

This pull request fixes #

Describe Your Solution 🔧

Provide OBSERVE Hint to create an observer to collect aggregated metrics.

The OBSERVE Hint Syntax:

/*+ OBSERVE(name, exprs) */

Usage like:

SELECT /*+ OBSERVE('observer1', sum(c1), count(1)) */ * from t1

Types of changes 🔖

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Test Plan 🧪

Behavior Without This Pull Request ⚰️

Behavior With This Pull Request 🎉

Related Unit Tests

org.apache.spark.sql.observe.ResolveObserveHintsSuite


Checklist 📝

Be nice. Be informative.

@wForget wForget changed the title Support observe hint [KYUUBI #6017] Support observe hint Jan 25, 2024
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (47a1091) 61.16% compared to head (22bd178) 61.14%.

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #6017      +/-   ##
============================================
- Coverage     61.16%   61.14%   -0.03%     
  Complexity       23       23              
============================================
  Files           623      623              
  Lines         37060    37060              
  Branches       5024     5024              
============================================
- Hits          22669    22661       -8     
- Misses        11956    11961       +5     
- Partials       2435     2438       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wForget wForget requested a review from ulysses-you January 25, 2024 07:54
@ulysses-you
Copy link
Contributor

how can end users get the metrics after adding hint ? Is there a pure sql way to show the metrics ?

@wForget
Copy link
Member Author

wForget commented Jan 25, 2024

how can end users get the metrics after adding hint ? Is there a pure sql way to show the metrics ?

I would have preferred to display it in the SQL Tab of the Spark UI, but that would require changes to spark code. like:

image

For Kyuubi, I want to record metrics in Spark Event first, and then display them in Kyuubi Statement Tab. What do you think?

@ulysses-you
Copy link
Contributor

How about introducing a new syntax instead of hint, something like CALL OBSERVE('observer1', sum(c1), count(1)) (select * from t1), it will return the result of observation

@wForget
Copy link
Member Author

wForget commented Jan 25, 2024

How about introducing a new syntax instead of hint, something like CALL OBSERVE('observer1', sum(c1), count(1)) (select * from t1), it will return the result of observation

This seems to only observe the final result, I want to insert observers at the all stages of SQL like in test case.

original sql:

SELECT *
  FROM
 (SELECT * from t1) tt1
  join
 (SELECT c1, c1 * 2 as c3 from t2) tt2
  on tt1.c1 = tt2.c1

sql with observers:

SELECT /*+ OBSERVE('observer3', sum(tt2.c3), count(1)) */ *
  FROM
 (SELECT /*+ OBSERVE('observer1', sum(c1), count(1)) */ * from t1) tt1
  join
 (SELECT /*+ OBSERVE('observer2', sum(c1), count(1)) */ c1, c1 * 2 as c3 from t2) tt2
  on tt1.c1 = tt2.c1

@ulysses-you
Copy link
Contributor

The behavior is decided by yourself. The new syntax also supports observe each operator as long as you inject CollectMetrics for each operator.

@wForget
Copy link
Member Author

wForget commented Jan 26, 2024

The behavior is decided by yourself. The new syntax also supports observe each operator as long as you inject CollectMetrics for each operator.

How do we push down the specified aggregation expression to the previous stage?

CALL OBSERVE('observer1', sum(c1), count(1)) (SELECT c1 from (select a, max(b) as c1 from t group by a))

Like this SQL, how do we apply sum(c1) to select a, max(b) as c1 project?

@ulysses-you
Copy link
Contributor

Maybe we can make the syntax support nested statement. e.g.,

OBSERVE(a, sum(c1)) (
  select c1, count(*) from (
    OBSERVE(b, count(1)) (
      select * from t1 join t2
    )
  )
)

The difference with hint is that, we can change it's output.

@wForget
Copy link
Member Author

wForget commented Jan 26, 2024

The difference with hint is that, we can change it's output.

I might prefer to split it into two tasks:

  1. add a hint to inject observer
  2. provide getObservedMetrics method to obtain metrics

like:

Call getObservedMetrics(
SELECT /*+ OBSERVE('observer1', sum(c1), count(1)) */ * from t1
)

@ulysses-you
Copy link
Contributor

I'm fine with it. It should be a kind of pure sql for dataset.observe.

@wForget
Copy link
Member Author

wForget commented Jan 26, 2024

I'm fine with it. It should be a kind of pure sql for dataset.observe.

Yes, dataset.observe also seems to just inject a observer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants