Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native Grok Reader Implementation #25205

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

bangtim
Copy link
Contributor

@bangtim bangtim commented Mar 3, 2025

Description

Native reader implementation for Grok format.

This PR is implementing a GrokDeserializer as well as porting over the entire Grok library (Athena depends on release 0.1.4 with some minor bug fixes and changes to support date data type).

The Java Grok library can be found here: https://github.com/thekrakken/java-grok/tree/grok-0.1.4

  • The library includes an api that allows us to parse logs as well as some basic unit tests

Questions/concerns:

  • One thing to pay attention to is the LICENSE
  • The header is different, thus the build fails (with the same header as other files, the build succeeds locally) - How should we make sure the header is properly citing the authors/contributors of the open source grok library? cc: @martint
  • What should the getHiveSerDeClassNames value be?

The implementation(everything aside from java grok library) for the reader was done in the following files:

  • trino-hive-formats module:
    • GrokDeserializer + GrokDeserializerFactory --> our implementation of the Deserializer
      • Very similar to regex
    • TestGrokFormat --> some additional unit tests + tests against examples found in athena docs (reading line, following format of other native reader tests)
    • pom.xml
  • trino-hive module:
    • HiveModule
    • HiveClassNames
    • HiveMetadata
    • HiveStorageFormat
    • HiveTableProperties
    • GrokFileWriterFactory
    • GrokPageSourceFactory
    • BaseHiveConnectorTest
    • HiveTestUtils
    • TestGrokTable
    • TestHivePageSink
    • pom.xml

Additional context and related issues

Athena supports the GrokSerde and this is a bug-for-bug implementation for what Athena currently has.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Section
* Add native Grok file format reader. ({issue}`25205 `)

@cla-bot cla-bot bot added the cla-signed label Mar 3, 2025
@github-actions github-actions bot added the hive Hive connector label Mar 3, 2025
@bangtim bangtim force-pushed the native-grok-reader branch from 242112c to 3917639 Compare March 4, 2025 16:51
@martint
Copy link
Member

martint commented Mar 4, 2025

We need to preserve the copyright notice in those files, but it doesn't need to be laid out verbatim. See how we do it in other places, such as:

// Copyright (C) 2007 The Guava Authors

@bangtim bangtim force-pushed the native-grok-reader branch 3 times, most recently from a36e4bd to 2fdc153 Compare March 4, 2025 22:30
@bangtim bangtim force-pushed the native-grok-reader branch 9 times, most recently from d30ef32 to ca31331 Compare March 10, 2025 15:11
@bangtim bangtim force-pushed the native-grok-reader branch from ca31331 to 8c86b48 Compare March 10, 2025 16:08
@bangtim bangtim requested a review from findinpath March 11, 2025 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

3 participants