-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] S3SingleDriverLogStore.listFrom takes a long time for large _delta_logs #1191
Comments
Hi @jkylling, thanks for this issue? We will look into this and get back to you. |
Hi @jkylling, this approach LGTM. Want to make an official PR? |
Hi @scottsand-db . Thanks for looking into this. I'll open a PR. |
The current implementation of `S3SingleDriverLogStore.listFrom` lists the entire content of the parent directory and filters the result. This can take a long time if the parent directory contains a lot of files. In practice, this can happen for _delta_log folders with a lot of commits. We change the implementation to use the [startAfter](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/model/ListObjectsV2Request.html#startAfter--) parameter such that we only get keys lexicographically greater or equal than the resolved path in the S3 list response. This will usually reduce the number of S3 list requests from `size of _delta_log / 1000` to 1. This resolves #[1191](delta-io#1191).
## Description The current implementation of `S3SingleDriverLogStore.listFrom` lists the entire content of the parent directory and filters the result. This can take a long time if the parent directory contains a lot of files. In practice, this can happen for _delta_log folders with a lot of commits. We change the implementation to use the [startAfter](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/model/ListObjectsV2Request.html#startAfter--) parameter such that we only get keys lexicographically greater or equal than the resolved path in the S3 list response. This will usually reduce the number of S3 list requests from `size of _delta_log / 1000` to 1. This resolves #(#1191). I've tested the patch briefly with the sample test described in #(#1191). The [previous iteration of this patch](jkylling@ec998ee) has been tested a bit more. Correctness has not been tested thoroughly. ## Does this PR introduce _any_ user-facing changes? No Closes #1210 Signed-off-by: Scott Sandre <[email protected]> GitOrigin-RevId: 2a0d1279655672cbdffd5604b7d7d781556888b9
This seems to be enforcing s3a, breaks EMRFS or other Hadoop FileSystem implementations ? |
@shenavaa This only works for S3A file system, yup. If you are using EMRFS, then you shouldn't enable this feature. |
@scottsand-db No option to disable and it's already in 2.3.0 milestone! |
Closed by c156c98 |
@shenavaa the feature is automatically disabled by default. You have to explicitly enable it using |
Bug
Describe the problem
Reading from a Delta table with a large Delta log takes a long time using
S3SingleDriverLogStore
. When calculating a snapshot of a Delta tableS3SingleDriverLogStore
will list all files within the_delta_log
key prefix. This can be expensive for Delta logs with many commits. This frequently happens when a streaming job is writing to a table.To keep reads and writes fast the Delta Log Protocol creates checkpoints of the delta log, typically every 10th commit. To calculate a snapshot of a Delta table we look at the latest checkpoint, together with the commits which have happened since the latest checkpoint. This is done using the
listFrom(path: Path)
method of theLogStore
interface. However, the existing implementation oflistFrom
in theS3SingleDriverLogStore
will list all keys within the_delta_log
prefix, and filter the result to only include lexicographic greater keys, seedelta/storage/src/main/java/io/delta/storage/S3SingleDriverLogStore.java
Line 228 in fab016e
It is possible to avoid listing all keys under a given key prefix by using the
startKey
parameter of the S3 ListObjects V2 API. We have applied a patch with this fix in jkylling@ec998ee.For the sample tables we tested the patch with this brought the read time down from 20-30 seconds to around 5 seconds with some occasional reads at 15 seconds. The reads at 15 seconds seemed to be related to processing of new checkpoints, as we were streaming to the tables at the same time as we were reading.
I have not tested this for the other LogStore implementations, but by looking at the code they seem to be affected by the same issue.
Steps to reproduce
The below tests can be used for the above steps. See also this commit.
A sample run of the "time selects" tests gives
That is, reading identical tables takes 10 seconds longer when the delta log contains more files. The same happens if the delta log contains commits instead of junk files.
After applying the patch to
S3SingleDriverLogStore
we getObserved results
Reading from a Delta table with many files takes a really long time.
Expected results
Reading from a Delta table with a _delta_log with many files should take the same amount of time as reading from a _delta_log with few files. That is, the time to get a snapshot should not be proportional to the number of files in the _delta_log.
Further details
The issue can be partially mitigated by setting the
delta.stalenessLimit
option to a large value. However, I believe the write path would still be affected by this issue, as it seems to force an update of the snapshot before every write.The same issue can be observed on a local file system by filling the
_delta_log
directory with junk (or several commits). This issue might affect all LogStores with naive implementations oflistFrom
.Environment information
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?
I would be happy to contribute to get this issue fixed.The suggested patch might need some work, as it digs quite deep in the AWS Hadoop API.
The text was updated successfully, but these errors were encountered: