Better handling of spark configs #378

bcajes · 2021-06-01T18:01:11Z

Glow requires some spark configuration tuning when applied to large datasets. It would be nice to have the glow context automatically override these configs with some default recommended values. It looks like there are already some configuration overrides:

glow/core/src/main/scala/io/projectglow/Glow.scala

Line 51 in 4a414c6

sess.conf.set("spark.sql.parquet.columnarReaderBatchSize", "16")

The user may also want to be warned through stdout that a config setting has changed during glow initialization.

williambrandler · 2021-07-27T21:44:39Z

hey @bcajes yes this config is changed without warning and it would be useful to explicitly give a warning or put this in the docs.

Are there other configs you typically override when using glow?

williambrandler · 2021-08-02T21:02:42Z

another config we usually add for the regression step (which uses pandas udfs and arrow) is,

"spark.sql.execution.arrow.maxRecordsPerBatch": 100

bboutkov · 2021-08-04T17:20:38Z

+1, I do think having some pre-specified conf settings can make sense, but I want to reemphasize @williambrandler's point that we need to yell to stdout that such things are happening behind the scenes.

Along these lines, @henrydavidge, in #326 there were changes introduced which seem to default to a new spark session during glow registration. Can you please provide some further info on why you chose a new session as the default behavior rather than carrying through the current sessions settings as well as what the nature of the issues were that you encountered? We ran into trouble with this recently where the spark conf shuffle partitions were not being respected unless set explicitly for the new session, this felt unintuitive - any reason to avoid defaulting to new_session = false?

dmoore247 · 2021-08-04T18:47:13Z

Another related issue, for spark.conf.set("...","...") are not picked up when
sess = glow.create(spark) is called (forgive me for not knowing the api).
This case is:

spark.conf.set("spark.sql.maxPartitions",2000)
sess = glow.create(spark)
sess.conf.get("spark.sql.maxPartitions")

bcajes · 2021-08-05T22:04:35Z

spark.sql.files.maxPartitionBytes is another setting I've needed to tune to around 32MB or less

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handling of spark configs #378

Better handling of spark configs #378

bcajes commented Jun 1, 2021

williambrandler commented Jul 27, 2021

williambrandler commented Aug 2, 2021

bboutkov commented Aug 4, 2021

dmoore247 commented Aug 4, 2021

bcajes commented Aug 5, 2021

Better handling of spark configs #378

Better handling of spark configs #378

Comments

bcajes commented Jun 1, 2021

williambrandler commented Jul 27, 2021

williambrandler commented Aug 2, 2021

bboutkov commented Aug 4, 2021

dmoore247 commented Aug 4, 2021

bcajes commented Aug 5, 2021