Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Fix CDC Commit Timestamp value under different Timezones #3347

Merged
merged 1 commit into from
Jul 9, 2024

Conversation

longvu-db
Copy link
Contributor

@longvu-db longvu-db commented Jul 9, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

The CDC's _commit_timestamp is incorrect when we try to read/display it under a different Spark Session's timezone spark.sql.session.timeZone (e.g. America/Chicago, Asia/Ho_Chi_Minh, ...).

In this PR, we address this issue by taking into account timezone to capture the precise point in time when we convert CDCDataSpec's Java Timestamp field to Spark's Timestamp for the _commit_timestamp column for all CDC's file indexes (CDCAddFileIndex, TahoeRemoveFileIndex, TahoeChangeFileIndex).

This is needed in order for CDF to work properly under a different timezone than UTC.

How was this patch tested?

Added UT, some minor UTs fix to take into account timezone.

Does this PR introduce any user-facing changes?

Yes.

  • CDC's _commit_timestamp should now be correct when we try to read/display it under a different Spark Session's timezone spark.sql.session.timeZone (e.g. America/Chicago, Asia/Ho_Chi_Minh, ...).
  • This is a user-facing change compared to the released Delta Lake versions and within the unreleased branches such as master.

modifyDeltaTimestamp(deltaLog, 1, currentTime)
modifyDeltaTimestamp(deltaLog, 2, currentTime + 100000)

val readDf = sql(s"SELECT * FROM table_changes('$tbl', 0, now())")
Copy link
Contributor Author

@longvu-db longvu-db Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now() and current_date() aren't affected by SQLConf.SESSION_LOCAL_TIMEZONE, while we are using the Java's new Date().getTime to generate time under the JVM timezone by the testing environment, in this case the JVM timezone in Delta testing is Pacific Timezone I believe (they set it not UTC on purpose to catch any timezone test failures).

Thus, I use the withDefaultTimeZone to modify the JVM timezone to UTC in this test so that we can test the interaction with timestamp expressions.

@@ -250,13 +251,11 @@ abstract class DeltaCDCSuiteBase
// modify timestamps
// version 0
modifyDeltaTimestamp(deltaLog, 0, 0)
val tsAfterV0 = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
.format(new Date(1))
val tsAfterV0 = dateFormat.format(new Date(1))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, we change so that we take into account timezone, these test failed because they didn't take into account timezones, so I'm modifying them to take into account timezone.

Copy link
Contributor

@larsk-db larsk-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@scottsand-db scottsand-db merged commit 0d87908 into delta-io:master Jul 9, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants