Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: multi threaded query execution #1220

Merged
merged 8 commits into from
Mar 6, 2025

Conversation

nikhilsinhaparseable
Copy link
Contributor

@nikhilsinhaparseable nikhilsinhaparseable commented Mar 3, 2025

add multi threading execution to query execute method
allows datafusion to use multiple threads to perform parallel execution of plans
improves query performance

Summary by CodeRabbit

  • Refactor

    • Improved query processing to run in asynchronous mode for enhanced responsiveness.
    • Updated error handling to capture and report issues more effectively.
    • Refined system configuration to optimize concurrent data processing and performance.
    • Introduced new session configuration options for better execution strategies.
  • Chores

    • Changed default value for row group size to 262144 to optimize data processing.

Copy link

coderabbitai bot commented Mar 3, 2025

Walkthrough

The pull request refactors asynchronous query execution across multiple modules to prevent blocking the async runtime. In both src/handlers/airplane.rs and src/handlers/http/query.rs, direct awaiting of query execution is replaced with tokio::task::spawn_blocking, ensuring that heavy operations run in a dedicated blocking thread and errors are handled robustly. Additionally, src/query/mod.rs updates the session configuration and repartitioning logic for optimized data processing while switching to a multi-threaded async environment.

Changes

File(s) Summary of Changes
src/handlers/airplane.rs
src/handlers/http/query.rs
Switched from direct query awaiting to using tokio::task::spawn_blocking to offload blocking work. In airplane.rs, the do_get method now clones the stream name for use in the blocking task and applies a match statement for detailed error handling. In http/query.rs, a new async function execute_query encapsulates this logic, clearly distinguishing between execution errors and thread joining failures.
src/query/mod.rs Updated SessionConfig by replacing with_round_robin_repartition(true) with new options: with_batch_size(1000000) and removed enable_page_index. Modified the execute method to include #[tokio::main(flavor = "multi_thread")] for multi-threaded execution support.
src/cli.rs Changed the default value of row_group_size in the Options struct from 1048576 to 262144.

Sequence Diagram(s)

sequenceDiagram
    participant C as Client
    participant H as Handler
    participant BT as Blocking Task
    participant Q as Query Executor

    C->>H: Send query request
    H->>H: Clone stream name
    H->>BT: Spawn blocking task for query execution
    BT->>Q: Execute query
    Q-->>BT: Return query result (Ok/Err)
    BT-->>H: Provide result
    H->>C: Respond with records or error
Loading

Suggested labels

for next release

Poem

I'm a bunny with a code-loving heart,
Hopping through async tasks with a joyful start,
Spawned threads now keep the runtime light,
Error handling sharp as my whiskers at night,
I celebrate each commit with a hop and a cheer,
CodeRabbit incites my glee—oh, what a year! 🐰


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0417ec and 5a5dc0c.

📒 Files selected for processing (2)
  • src/cli.rs (1 hunks)
  • src/query/mod.rs (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/query/mod.rs
  • src/cli.rs
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: coverage

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/handlers/http/query.rs (1)

133-139: Great implementation of asynchronous query execution using spawn_blocking

The change to use tokio::task::spawn_blocking is an excellent improvement as it offloads potentially CPU-intensive query execution to a dedicated thread pool, preventing blocking of the async runtime. The error handling is also well-implemented with proper matching against different error scenarios.

For even better readability and separation of concerns, consider extracting this pattern into a helper function since it's used in multiple handlers:

-let table_name_clone = table_name.clone();
-let (records, fields) =
-    match tokio::task::spawn_blocking(move || query.execute(table_name_clone)).await {
-        Ok(Ok((records, fields))) => (records, fields),
-        Ok(Err(e)) => return Err(QueryError::Execute(e)),
-        Err(e) => return Err(QueryError::Anyhow(anyhow::Error::msg(e.to_string()))),
-    };
+let (records, fields) = execute_query_in_blocking_thread(query, table_name.clone()).await?;

// Then define this helper function elsewhere:
// async fn execute_query_in_blocking_thread(
//     query: LogicalQuery,
//     stream_name: String,
// ) -> Result<(Vec<RecordBatch>, Vec<String>), QueryError> {
//     match tokio::task::spawn_blocking(move || query.execute(stream_name)).await {
//         Ok(Ok(result)) => Ok(result),
//         Ok(Err(e)) => Err(QueryError::Execute(e)),
//         Err(e) => Err(QueryError::Anyhow(anyhow::Error::msg(e.to_string()))),
//     }
// }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 887a63f and 17c6edd.

📒 Files selected for processing (4)
  • src/handlers/airplane.rs (1 hunks)
  • src/handlers/http/query.rs (1 hunks)
  • src/parseable/streams.rs (1 hunks)
  • src/query/mod.rs (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • src/parseable/streams.rs
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: coverage
🔇 Additional comments (3)
src/handlers/airplane.rs (1)

219-225:

❓ Verification inconclusive

Effective implementation of non-blocking query execution

Offloading the query execution to a blocking thread pool using tokio::task::spawn_blocking is a great architectural improvement. This prevents the query execution from blocking the async Tokio runtime, improving overall system throughput and responsiveness.

Let's verify the error handling improvements by checking if there are different error scenarios properly handled in both handlers:


🏁 Script executed:

#!/bin/bash
# Check if both handlers handle errors consistently

# First, check the query handler error handling
echo "Checking HTTP query handler error handling..."
rg -A 5 "tokio::task::spawn_blocking.*query.execute" src/handlers/http/query.rs

# Then, check the airplane handler error handling
echo "Checking airplane handler error handling..."
rg -A 5 "tokio::task::spawn_blocking.*query.execute" src/handlers/airplane.rs

# Compare the error types being returned
echo "Comparing error types..."
rg "return Err\(" src/handlers/http/query.rs | grep -A 2 "tokio::task::spawn_blocking" 
rg "return Err\(" src/handlers/airplane.rs | grep -A 2 "tokio::task::spawn_blocking"

Length of output: 1264


Non-blocking query execution is effectively implemented, but please double-check error type differences

The offloading of query.execute to a blocking thread pool is well done. Our verification shows that in the airplane handler the errors are wrapped using Status::internal, whereas the HTTP query handler wraps errors in custom QueryError variants. This discrepancy appears intentional given the different protocol contexts (gRPC vs. HTTP), but please confirm these differences were by design.

  • Location: src/handlers/airplane.rs (Lines 219-225) – error handling via Status::internal
  • Observation: HTTP handler (src/handlers/http/query.rs) uses QueryError::Execute and QueryError::Anyhow
  • Action: Verify that these distinct error types align with the intended behavior for each endpoint
src/query/mod.rs (2)

94-95:

✅ Verification successful

Optimization: Added batch size configuration for improved performance

Setting a specific batch size (8192) is a good optimization. This controls how many rows are processed in each batch during query execution, allowing for better memory usage and potentially improved performance.

Let's analyze how this batch size value compares to defaults and recommendations:


🌐 Web query:

What is the default batch size in DataFusion and what are the recommended values?

💡 Result:

The default batch size in Apache DataFusion is 8192 rows per batch, as shown in configuration settings and CLI outputs[1][2][3]. This value balances memory efficiency and metadata overhead, particularly for buffer-in-memory operations where small batches would create excessive metadata[2][3].

Recommended Batch Size Values

Use Case Recommended Size Rationale
General workloads 8192 (default) Balances memory usage and processing efficiency[2][7]
Memory-constrained environments 1024-4096 Reduces per-batch memory footprint[1][6]
Large aggregation/join operations 16384 Shows ~10% performance improvement in some queries[7]
TopK/sorting optimizations Match LIMIT clause Partial sorts can use smaller batches (e.g., 10-100 rows)[5]

Key considerations when adjusting:

  1. Larger batches (16K-65K rows) improve throughput for complex queries but increase memory pressure[7][10]
  2. Smaller batches (1K-4K) reduce memory usage but may impact performance due to increased batch count[5][6]
  3. Use SET datafusion.execution.batch_size TO <value> for session-level adjustments[1]
  4. Configure via environment variable DATAFUSION_EXECUTION_BATCH_SIZE for CLI persistence[1][3]

The optimal size depends on query patterns and available memory resources, with Cloud DataFusion deployments typically recommending nodes with ≥8GB RAM[4]. Performance testing with representative workloads is advised when tuning this parameter.

Citations:


Clarification: Explicit Batch Size Configuration Confirmed

The explicit setting of the batch size to 8192 in src/query/mod.rs (lines 94-95) aligns with Apache DataFusion defaults and recommendations for general workloads. Although this value is the default, specifying it enhances clarity and eases future tuning if workload requirements change. Approved.


139-139:

❓ Verification inconclusive

Excellent use of multi-threaded tokio runtime for query execution

Adding the #[tokio::main(flavor = "multi_thread")] attribute to the execute method enables multi-threaded query execution, allowing DataFusion to parallelize operations across multiple CPU cores. This is crucial for improving query performance, especially for complex queries on large datasets.

Let's check if the multi_thread flavor is appropriate and if there are any potential limitations:


🌐 Web query:

What are the pros and cons of using #[tokio::main(flavor = "multi_thread")] on a method instead of the main function?

💡 Result:

Tokio's multi_thread runtime flavor balances concurrency and resource utilization through work-stealing across a thread pool, but its effectiveness depends on proper usage. Here's a breakdown of its trade-offs when applied at different scopes:

Runtime Configuration Fundamentals

  • #[tokio::main(flavor = "multi_thread")] initializes a global runtime with worker threads (defaulting to CPU core count).
  • Tasks are Send + 'static by default, enabling thread-safe execution across cores[1][7].

Pros of Multi-Threaded Runtime

  1. Work-Stealing Efficiency

    • Distributes tasks dynamically across threads, preventing idle cores and reducing tail latency under load[3][9].
    • Example: A web server handles 500 concurrent requests faster than a single-threaded runtime, even if each task sleeps briefly[1].
  2. Blocking Task Isolation

    • Use tokio::task::spawn_blocking to offload CPU-heavy work to a dedicated thread pool, avoiding starvation of async tasks[2][4].
  3. Resource Scalability

    • Adjust worker threads via worker_threads=N to balance throughput and memory usage (e.g., restrict to 1 thread for low-resource environments)[3].

Cons of Multi-Threaded Runtime

  1. Send Bound Complexity

    • Futures and data shared across threads must implement Send, complicating code with non-Send types (e.g., raw pointers, thread-local data)[4][7].
  2. Deadlock Risks

    • Blocking synchronous code (e.g., std::thread::sleep) in async tasks can stall the entire runtime if worker threads are saturated[2].
    • Mitigation: Use yield_now() or offload to spawn_blocking.
  3. Overhead Trade-Offs

    • Multithreading adds memory and scheduling overhead. Single-threaded runtimes (flavor = "current_thread") often match performance for light workloads[3][4].

Method vs. Main Function Usage

  • Main Function: Standard approach. Initializes the runtime once, ensuring all async code uses the same executor[3][7].
  • Method-Level: Not recommended. Tokio prohibits nested runtimes, and method-specific initialization risks:
    • Runtime conflicts or unintended thread pools.
    • Fragmented resource management (e.g., duplicated thread pools)[2][4].

When to Prefer Single-Threaded

  • Low-concurrency services (e.g., CLI tools).
  • !Send futures or shared references across awaits (simpler with LocalSet)[4][7].
  • Resource-constrained environments (embedded systems)[3].

In practice, benchmark both runtimes with worker_threads=1 versus current_thread to quantify trade-offs for your workload[3]. Use multi_thread as the default for servers and adjust based on profiling.

Citations:


Action Required: Verify Runtime Context for Tokio Attribute Usage

The use of #[tokio::main(flavor = "multi_thread")] in this method leverages Tokio’s multi-threaded runtime for efficient, parallel query execution, which is beneficial for large-scale, complex queries. However, please verify that this method is not being invoked within an already active Tokio runtime. Using the attribute on a non-main function can risk nested runtime conflicts (e.g., creating multiple thread pools or potential deadlocks) if not properly isolated.

  • Location: src/query/mod.rs, Line 139
  • Recommendation: Ensure that this method is the sole entry point for starting a runtime in its context, or refactor to initialize the runtime in the main function if nested execution might occur.

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 3, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/handlers/http/query.rs (2)

134-134: Consider resource limits or timeouts for parallel query execution.

Although calling execute_query asynchronously improves throughput, there is no guardrail on concurrency (like a task limiter). If many queries arrive simultaneously, the thread pool might become saturated, impacting the system’s responsiveness.


153-162: Implementation of asynchronous, multi-threaded query execution looks good.

  1. The usage of tokio::task::spawn_blocking is appropriate for CPU-bound tasks, preventing blocking of the async runtime.
  2. The error handling cleanly distinguishes between internal query execution errors and the blocking task’s errors.
  3. Consider adding optional logs or telemetry within spawn_blocking to track query performance or detect long-running tasks.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 17c6edd and b4c3d88.

📒 Files selected for processing (1)
  • src/handlers/http/query.rs (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: coverage
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
🔇 Additional comments (1)
src/handlers/http/query.rs (1)

22-22: No issues noted with the newly added import.

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 3, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/handlers/http/query.rs (1)

34-34: Importing std::thread is acceptable, but consider async best practices.
It’s okay to import std::thread for parallel work, though using it in an async context can block the runtime. If you need full non-blocking behavior, tokio::task::spawn_blocking is often recommended.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b4c3d88 and 1ac2dc4.

📒 Files selected for processing (1)
  • src/handlers/http/query.rs (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: coverage
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
🔇 Additional comments (2)
src/handlers/http/query.rs (2)

22-22: Import looks fine.
No issues spotted; importing RecordBatch here is appropriate to handle query results.


135-135: Validate concurrency approach.
The invocation of the new execute_query function is clear, but ensure that parallel calls to this endpoint won’t block the async runtime if all threads are occupied, especially under high load.

add multi threading execution to query execute method
allows datafusion to use multiple threads to perform parallel execution of plans
improves query performance
coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 6, 2025
coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 6, 2025
coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 6, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/query/mod.rs (1)

103-103: Document this configuration option

Please add a comment explaining why binary_as_string = true is necessary. What specific issues does this solve, and how does it affect query results?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e31e655 and 51d3ac2.

📒 Files selected for processing (1)
  • src/query/mod.rs (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: coverage
🔇 Additional comments (3)
src/query/mod.rs (3)

94-94: Add documentation for this configuration value

Please add a comment explaining why a batch size of 1000000 was chosen and its impact on query performance. This will help maintainers understand the rationale behind this specific value.


107-107: Document how this optimization affects query execution

This configuration option seems important for partitioning optimization. Please add a comment explaining its purpose and expected impact on query performance.


131-131: Good implementation of multi-threaded execution

The #[tokio::main(flavor = "multi_thread")] annotation aligns well with the PR objective to improve query performance through multi-threading. This allows DataFusion to utilize multiple threads for parallel execution of query plans.

Consider adding a comment explaining how this annotation improves performance and any potential resource considerations (e.g., thread pool size, memory usage).

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 6, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/query/mod.rs (1)

132-132: Excellent improvement with multi-threaded execution!

Adding #[tokio::main(flavor = "multi_thread")] enables DataFusion to leverage multiple CPU cores for query execution, which should significantly improve performance for complex queries or large datasets.

Consider adding a brief comment explaining the performance benefits this change provides and any considerations for resource usage.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51d3ac2 and f0417ec.

📒 Files selected for processing (2)
  • src/cli.rs (1 hunks)
  • src/query/mod.rs (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/cli.rs
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: coverage
🔇 Additional comments (3)
src/query/mod.rs (3)

96-96: Consider adding an explanation for the batch size selection.

The batch size of 1,000,000 is a significant configuration parameter that affects query performance and memory usage. Adding a comment explaining why this specific value was chosen would improve code maintainability.


104-104: Good addition of binary_as_string configuration.

Setting binary_as_string = true ensures that binary data is treated as strings, which is appropriate for log data that may contain binary content that needs to be displayed as text.


91-92: Great documentation reference!

Providing a link to the DataFusion documentation helps developers understand the available configuration options and their implications.

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 6, 2025
@nitisht nitisht merged commit 7596825 into parseablehq:main Mar 6, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants