Spark35_glue5_upgrade #194

vidhyamanisankar · 2025-04-09T14:13:23Z

SPP-12455
Update the imputation's engine and ratio_calculator

Split big the chained transaction into small transactions
Truncate the logical plan using localCheckpoint(eager=True) while calling the ratio_calculator, imputation_helper in the loop.
Use broadcast join on smaller df
Secondary ordering by reference in mean_of_ratios. This does not impact the calculated.forward/backward links, since it will only apply to contributors with equal growth ratios, but keeps the selection of rows for trimming deterministic.

Synopsis

Upgrading to use glue 5, spark 3.5.2, and python 3.11 for glue jobs. AWS

Checklist

Documentation created/updated
Tests created/updated

Description

Add a more detailed description of the pr if necessary (can reference release
notes if included).

… use broadcast join on smaller df

BenLatham

Looks good to me, however I don't fully understand broadcasts, so it would be helpful to discuss those to ensure I've not missed anything.

BenLatham · 2025-04-16T15:31:20Z

.github/workflows/ci-checks.yaml

@@ -8,20 +8,59 @@ permissions:
  contents: read

 jobs:
+  python37-test:


Is it worth maintaining backwards compatibility testing for python 3.7, or could we drop 3.7 support?

I agree. Spark 3.5.x officially supports Python 3.8 and later. Support for Python 3.7 was deprecated in Spark 3.4.0.

However pyproject.toml has the pyspark dependency versions ranging '>=3.1.1 <3.5.3' and spark 3.1.1 supported python version - https://archive.apache.org/dist/spark/docs/3.1.1/ . so at-least we need to support python3.7.

mwirikia · 2025-04-17T06:11:20Z

comment from ID

This should be kept as a draft for now IMO…we can’t merge into main until MDQ have confirmed they have resource to test and we can deploy and test as part of a release.

…in conditions

split the chained transactions into small, truncate the logical plan,…

79c8979

… use broadcast join on smaller df

vidhyamanisankar requested a review from a team as a code owner April 9, 2025 14:13

vidhyamanisankar marked this pull request as draft April 9, 2025 14:13

vidhyamanisankar added 11 commits April 9, 2025 16:01

remove localCheckpoint

bf1568a

add repartition to improve join run duration

964486e

add repartition on construction

ec833a3

debug

099ecfa

Debug

e184e2b

refactor longing running contruction join

d21d98f

contruction split & repartition

55d9c7d

construct join as broadcast

4eede02

remove repartitions

6758b82

Pass imputation marker

d7cf1e1

refactor cons

0a2f760

vidhyamanisankar marked this pull request as ready for review April 10, 2025 09:14

deterministic sort for trimming

6f837e2

BenLatham reviewed Apr 16, 2025

View reviewed changes

vidhyamanisankar marked this pull request as draft April 17, 2025 10:53

vidhyamanisankar marked this pull request as ready for review April 17, 2025 10:53

vidhyamanisankar added 10 commits April 25, 2025 09:32

Add comments for broadcast which helps to fix the frozen/very slow jo…

80e525e

…in conditions

fix black & flake8

7628a55

Stop supporting python 3.7 , 3.8

4779026

debug the decimal precision difference

cdf4abf

debug decimal precision

1e600f9

debug

e3549c4

debug

375e309

debug:add decimal cast

0eed081

Debug: cast decimalType

e48a037

debug: decimal to 6 precision

6a4bf18

vidhyamanisankar added 2 commits June 11, 2025 16:57

remove debug print and df shoe printschema

ca8c0e3

remove explict decimal cast for 6 decimal place

ff3c241

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark35_glue5_upgrade #194

Spark35_glue5_upgrade #194

Uh oh!

vidhyamanisankar commented Apr 9, 2025 •

edited

Loading

Uh oh!

BenLatham left a comment

Uh oh!

BenLatham Apr 16, 2025

Uh oh!

vidhyamanisankar Apr 25, 2025

Uh oh!

vidhyamanisankar Apr 25, 2025

Uh oh!

mwirikia commented Apr 17, 2025

Uh oh!

Spark35_glue5_upgrade #194

Are you sure you want to change the base?

Spark35_glue5_upgrade #194

Uh oh!

Conversation

vidhyamanisankar commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Synopsis

Checklist

Description

Uh oh!

BenLatham left a comment

Choose a reason for hiding this comment

Uh oh!

BenLatham Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

vidhyamanisankar Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

vidhyamanisankar Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

mwirikia commented Apr 17, 2025

Uh oh!

vidhyamanisankar commented Apr 9, 2025 •

edited

Loading