Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of spark configs #378

Open
bcajes opened this issue Jun 1, 2021 · 5 comments
Open

Better handling of spark configs #378

bcajes opened this issue Jun 1, 2021 · 5 comments

Comments

@bcajes
Copy link
Contributor

bcajes commented Jun 1, 2021

Glow requires some spark configuration tuning when applied to large datasets. It would be nice to have the glow context automatically override these configs with some default recommended values. It looks like there are already some configuration overrides:

sess.conf.set("spark.sql.parquet.columnarReaderBatchSize", "16")

The user may also want to be warned through stdout that a config setting has changed during glow initialization.

@williambrandler
Copy link
Contributor

hey @bcajes yes this config is changed without warning and it would be useful to explicitly give a warning or put this in the docs.

Are there other configs you typically override when using glow?

@williambrandler
Copy link
Contributor

another config we usually add for the regression step (which uses pandas udfs and arrow) is,

"spark.sql.execution.arrow.maxRecordsPerBatch": 100

@bboutkov
Copy link
Contributor

bboutkov commented Aug 4, 2021

+1, I do think having some pre-specified conf settings can make sense, but I want to reemphasize @williambrandler's point that we need to yell to stdout that such things are happening behind the scenes.

Along these lines, @henrydavidge, in #326 there were changes introduced which seem to default to a new spark session during glow registration. Can you please provide some further info on why you chose a new session as the default behavior rather than carrying through the current sessions settings as well as what the nature of the issues were that you encountered? We ran into trouble with this recently where the spark conf shuffle partitions were not being respected unless set explicitly for the new session, this felt unintuitive - any reason to avoid defaulting to new_session = false?

@dmoore247
Copy link
Contributor

Another related issue, for spark.conf.set("...","...") are not picked up when
sess = glow.create(spark) is called (forgive me for not knowing the api).
This case is:

spark.conf.set("spark.sql.maxPartitions",2000)
sess = glow.create(spark)
sess.conf.get("spark.sql.maxPartitions")

@bcajes
Copy link
Contributor Author

bcajes commented Aug 5, 2021

spark.sql.files.maxPartitionBytes is another setting I've needed to tune to around 32MB or less

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants