Skip to content

Spark35_glue5_upgrade #194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 25 commits into
base: main
Choose a base branch
from
Open

Spark35_glue5_upgrade #194

wants to merge 25 commits into from

Conversation

vidhyamanisankar
Copy link
Contributor

@vidhyamanisankar vidhyamanisankar commented Apr 9, 2025

SPP-12455
Update the imputation's engine and ratio_calculator

  • Split big the chained transaction into small transactions
  • Truncate the logical plan using localCheckpoint(eager=True) while calling the ratio_calculator, imputation_helper in the loop.
  • Use broadcast join on smaller df
  • Secondary ordering by reference in mean_of_ratios. This does not impact the calculated.forward/backward links, since it will only apply to contributors with equal growth ratios, but keeps the selection of rows for trimming deterministic.

Synopsis

Upgrading to use glue 5, spark 3.5.2, and python 3.11 for glue jobs. AWS

Checklist

  • Documentation created/updated
  • Tests created/updated

Description

Add a more detailed description of the pr if necessary (can reference release
notes if included).

@vidhyamanisankar vidhyamanisankar requested a review from a team as a code owner April 9, 2025 14:13
@vidhyamanisankar vidhyamanisankar marked this pull request as draft April 9, 2025 14:13
@vidhyamanisankar vidhyamanisankar marked this pull request as ready for review April 10, 2025 09:14
Copy link

@BenLatham BenLatham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, however I don't fully understand broadcasts, so it would be helpful to discuss those to ensure I've not missed anything.

@@ -8,20 +8,59 @@ permissions:
contents: read

jobs:
python37-test:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth maintaining backwards compatibility testing for python 3.7, or could we drop 3.7 support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Spark 3.5.x officially supports Python 3.8 and later. Support for Python 3.7 was deprecated in Spark 3.4.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However pyproject.toml has the pyspark dependency versions ranging '>=3.1.1 <3.5.3' and spark 3.1.1 supported python version - https://archive.apache.org/dist/spark/docs/3.1.1/ . so at-least we need to support python3.7.

@mwirikia
Copy link

comment from ID

This should be kept as a draft for now IMO…we can’t merge into main until MDQ have confirmed they have resource to test and we can deploy and test as part of a release.

@vidhyamanisankar vidhyamanisankar marked this pull request as draft April 17, 2025 10:53
@vidhyamanisankar vidhyamanisankar marked this pull request as ready for review April 17, 2025 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants