Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Add Delta Connect Server Library #3136

Merged
merged 61 commits into from
Jun 3, 2024

Conversation

longvu-db
Copy link
Contributor

@longvu-db longvu-db commented May 22, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

This PR adds a skeleton for Delta Connect server library, and we add support for Scan to Delta's planner plugin.

How was this patch tested?

Added some basic tests for SparkConnectPlanner using the DeltaRelationPlugin and DeltaCommandPlugin plugins.

Does this PR introduce any user-facing changes?

No.

@longvu-db longvu-db force-pushed the delta-connect-server-lib branch from 2a878fe to ed9382b Compare May 23, 2024 15:42
@felipepessoto
Copy link
Contributor

@longvu-db do you have more details on what is Delta Connect?

@longvu-db longvu-db force-pushed the delta-connect-server-lib branch from 706624d to f988777 Compare May 26, 2024 09:50
@longvu-db
Copy link
Contributor Author

longvu-db commented May 27, 2024

Hey @felipepessoto!

Delta Connect adds Spark Connect support to Scala and Python APIs of Delta Lake for Apache Spark. Spark Connect is a new project released in Apache Spark 4.0 that adds a decoupled client-server infrastructure which allows remote connectivity from Spark from everywhere. Delta Connect makes the DeltaTable interfaces compatible with the new Spark Connect protocol.

There are some issues regarding Delta Connect:

  1. [Feature Request] Support for Spark Connect #1570
  2. [BUG][Spark] DeltaTable.forPath doesn't work with Spark Connect #1967

Copy link
Contributor

@andreaschat-db andreaschat-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments nits mostly.

libraryDependencies ++= Seq(
"com.google.protobuf" % "protobuf-java" % protoVersion % "protobuf",

"org.apache.spark" %% "spark-hive" % sparkVersion.value % "provided",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the spark version here needs to be the spark master. Does sparkVersion.value return the spark master version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreaschat-db

If we specify the sparkVersion to be master (build/sbt -DsparkVersion=master) then it will only compile files with prefix src/main/scala-spark-master (https://github.com/delta-io/delta/blob/master/build.sbt#L163).

I'm not sure if there is a way to throw an error if we compile Spark Connect by mistake with an older Spark Version, I guess most softwares if we compile with the wrong jdk or python if we compile with the wrong dependencies then it will not tell us explicit but just simply fail with an error log.

Copy link
Contributor Author

@longvu-db longvu-db May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreaschat-db

Thanks to @scottsand-db help on build.sbt we managed to make the project only compiles, tests and publishes only against Spark Master now, and it does nothing for Spark Latest (other than saying that it is skipping for Delta Connect projects).

I ran the build.sbt -DsparkVersion={latest, master} {connectServer, connectCommon}/{compile, test, publish} and build.sbt {compile/test/publish} manually to verify, but I'm not sure if there is a way to automatically test it for Github Workflows

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you.

@longvu-db longvu-db requested a review from andreaschat-db May 29, 2024 15:32
@longvu-db longvu-db changed the title [Spark] Add Delta Connect server library [Spark] Add Delta Connect Server library May 29, 2024
Copy link
Contributor

@andreaschat-db andreaschat-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

libraryDependencies ++= Seq(
"com.google.protobuf" % "protobuf-java" % protoVersion % "protobuf",

"org.apache.spark" %% "spark-hive" % sparkVersion.value % "provided",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you.

@longvu-db longvu-db changed the title [Spark] Add Delta Connect Server library [Spark] Add Delta Connect Server Library May 30, 2024
@@ -48,7 +48,6 @@ def get_args():

def run_sbt_tests(root_dir, test_group, coverage, scala_version=None):
print("##### Running SBT tests #####")
is_running_spark_tests = test_group is None or test_group == "spark"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkorukanti Seems like is_running_spark_tests is actually not used anywhere? So I'm removing it.

Copy link
Collaborator

@tomvanbussel tomvanbussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM

@vkorukanti vkorukanti merged commit eccb75e into delta-io:master Jun 3, 2024
10 checks passed
vkorukanti pushed a commit that referenced this pull request Jun 4, 2024
)

## Description
[Delta Connect Server Lib
PR](#3136) but merge into
`branch-4.0-preview1` instead of `master`.

## How was this patch tested?
Added UTs.
tdas pushed a commit that referenced this pull request Jun 5, 2024
…3200)

<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  3. Be sure to keep the PR description updated to reflect all changes.
  4. Please write your PR title to summarize what this PR proposes.
5. If possible, provide a concise example to reproduce the issue for a
faster review.
6. If applicable, include the corresponding issue number in the PR title
and link it in the body.
-->

#### Which Delta project/connector is this regarding?
<!--
Please add the component selected below to the beginning of the pull
request title
For example: [Spark] Title of my pull request
-->

- [X] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description
[Delta Connect Scala
Client](#3136) but merge into
`branch-4.0-preview1` instead of `master`.
<!--
- Describe what this PR changes.
- Describe why we need the change.
 
If this PR resolves an issue be sure to include "Resolves #XXX" to
correctly link and close the issue upon merge.
-->

## How was this patch tested?
Added UTs.
<!--
If tests were added, say they were added here. Please make sure to test
the changes thoroughly including negative and positive cases if
possible.
If the changes were tested in any way other than unit tests, please
clarify how you tested step by step (ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future).
If the changes were not tested, please explain why.
-->

## Does this PR introduce _any_ user-facing changes?
No.
<!--
If yes, please clarify the previous behavior and the change this PR
proposes - provide the console output, description and/or an example to
show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change
compared to the released Delta Lake versions or within the unreleased
branches such as master.
If no, write 'No'.
-->

---------

Signed-off-by: Shawn Chang <[email protected]>
Co-authored-by: Scott Sandre <[email protected]>
Co-authored-by: Tom van Bussel <[email protected]>
Co-authored-by: Jiaheng Tang <[email protected]>
Co-authored-by: Sumeet Varma <[email protected]>
Co-authored-by: Venki Korukanti <[email protected]>
Co-authored-by: Avril Aysha <[email protected]>
Co-authored-by: Shawn Chang <[email protected]>
Co-authored-by: Shawn Chang <[email protected]>
Co-authored-by: Hao Jiang <[email protected]>
allisonport-db added a commit that referenced this pull request Jun 7, 2024
…3201)

<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  3. Be sure to keep the PR description updated to reflect all changes.
  4. Please write your PR title to summarize what this PR proposes.
5. If possible, provide a concise example to reproduce the issue for a
faster review.
6. If applicable, include the corresponding issue number in the PR title
and link it in the body.
-->

#### Which Delta project/connector is this regarding?
<!--
Please add the component selected below to the beginning of the pull
request title
For example: [Spark] Title of my pull request
-->

- [X] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description

<!--
- Describe what this PR changes.
- Describe why we need the change.
 
If this PR resolves an issue be sure to include "Resolves #XXX" to
correctly link and close the issue upon merge.
-->
[Delta Connect Python Client
PR](#3136) but merge into
branch-4.0-preview1 instead of master.

## How was this patch tested?
Added UTs.
<!--
If tests were added, say they were added here. Please make sure to test
the changes thoroughly including negative and positive cases if
possible.
If the changes were tested in any way other than unit tests, please
clarify how you tested step by step (ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future).
If the changes were not tested, please explain why.
-->

## Does this PR introduce _any_ user-facing changes?
No.
<!--
If yes, please clarify the previous behavior and the change this PR
proposes - provide the console output, description and/or an example to
show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change
compared to the released Delta Lake versions or within the unreleased
branches such as master.
If no, write 'No'.
-->

---------

Signed-off-by: Tai Le Manh <[email protected]>
Signed-off-by: Shawn Chang <[email protected]>
Co-authored-by: Scott Sandre <[email protected]>
Co-authored-by: Tai Le Manh <[email protected]>
Co-authored-by: Johan Lasperas <[email protected]>
Co-authored-by: Allison Portis <[email protected]>
Co-authored-by: Tom van Bussel <[email protected]>
Co-authored-by: Jiaheng Tang <[email protected]>
Co-authored-by: Sumeet Varma <[email protected]>
Co-authored-by: Venki Korukanti <[email protected]>
Co-authored-by: Avril Aysha <[email protected]>
Co-authored-by: Shawn Chang <[email protected]>
Co-authored-by: Shawn Chang <[email protected]>
Co-authored-by: Hao Jiang <[email protected]>
allisonport-db added a commit to allisonport-db/delta that referenced this pull request Jul 31, 2024
…elta-io#3201)

<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  3. Be sure to keep the PR description updated to reflect all changes.
  4. Please write your PR title to summarize what this PR proposes.
5. If possible, provide a concise example to reproduce the issue for a
faster review.
6. If applicable, include the corresponding issue number in the PR title
and link it in the body.
-->

<!--
Please add the component selected below to the beginning of the pull
request title
For example: [Spark] Title of my pull request
-->

- [X] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

<!--
- Describe what this PR changes.
- Describe why we need the change.

If this PR resolves an issue be sure to include "Resolves #XXX" to
correctly link and close the issue upon merge.
-->
[Delta Connect Python Client
PR](delta-io#3136) but merge into
branch-4.0-preview1 instead of master.

Added UTs.
<!--
If tests were added, say they were added here. Please make sure to test
the changes thoroughly including negative and positive cases if
possible.
If the changes were tested in any way other than unit tests, please
clarify how you tested step by step (ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future).
If the changes were not tested, please explain why.
-->

No.
<!--
If yes, please clarify the previous behavior and the change this PR
proposes - provide the console output, description and/or an example to
show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change
compared to the released Delta Lake versions or within the unreleased
branches such as master.
If no, write 'No'.
-->

---------

Signed-off-by: Tai Le Manh <[email protected]>
Signed-off-by: Shawn Chang <[email protected]>
Co-authored-by: Scott Sandre <[email protected]>
Co-authored-by: Tai Le Manh <[email protected]>
Co-authored-by: Johan Lasperas <[email protected]>
Co-authored-by: Allison Portis <[email protected]>
Co-authored-by: Tom van Bussel <[email protected]>
Co-authored-by: Jiaheng Tang <[email protected]>
Co-authored-by: Sumeet Varma <[email protected]>
Co-authored-by: Venki Korukanti <[email protected]>
Co-authored-by: Avril Aysha <[email protected]>
Co-authored-by: Shawn Chang <[email protected]>
Co-authored-by: Shawn Chang <[email protected]>
Co-authored-by: Hao Jiang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants