fix: multi threaded query execution #1220

nikhilsinhaparseable · 2025-03-03T07:02:10Z

add multi threading execution to query execute method
allows datafusion to use multiple threads to perform parallel execution of plans
improves query performance

Summary by CodeRabbit

Refactor
- Improved query processing to run in asynchronous mode for enhanced responsiveness.
- Updated error handling to capture and report issues more effectively.
- Refined system configuration to optimize concurrent data processing and performance.
- Introduced new session configuration options for better execution strategies.
Chores
- Changed default value for row group size to 262144 to optimize data processing.

coderabbitai · 2025-03-03T07:02:18Z

Walkthrough

The pull request refactors asynchronous query execution across multiple modules to prevent blocking the async runtime. In both src/handlers/airplane.rs and src/handlers/http/query.rs, direct awaiting of query execution is replaced with tokio::task::spawn_blocking, ensuring that heavy operations run in a dedicated blocking thread and errors are handled robustly. Additionally, src/query/mod.rs updates the session configuration and repartitioning logic for optimized data processing while switching to a multi-threaded async environment.

Changes

File(s)	Summary of Changes
`src/handlers/airplane.rs` `src/handlers/http/query.rs`	Switched from direct query awaiting to using `tokio::task::spawn_blocking` to offload blocking work. In airplane.rs, the `do_get` method now clones the stream name for use in the blocking task and applies a match statement for detailed error handling. In http/query.rs, a new async function `execute_query` encapsulates this logic, clearly distinguishing between execution errors and thread joining failures.
`src/query/mod.rs`	Updated `SessionConfig` by replacing `with_round_robin_repartition(true)` with new options: `with_batch_size(1000000)` and removed `enable_page_index`. Modified the `execute` method to include `#[tokio::main(flavor = "multi_thread")]` for multi-threaded execution support.
`src/cli.rs`	Changed the default value of `row_group_size` in the `Options` struct from `1048576` to `262144`.

Sequence Diagram(s)

sequenceDiagram
    participant C as Client
    participant H as Handler
    participant BT as Blocking Task
    participant Q as Query Executor

    C->>H: Send query request
    H->>H: Clone stream name
    H->>BT: Spawn blocking task for query execution
    BT->>Q: Execute query
    Q-->>BT: Return query result (Ok/Err)
    BT-->>H: Provide result
    H->>C: Respond with records or error

Suggested labels

for next release

Poem

I'm a bunny with a code-loving heart,
Hopping through async tasks with a joyful start,
Spawned threads now keep the runtime light,
Error handling sharp as my whiskers at night,
I celebrate each commit with a hop and a cheer,
CodeRabbit incites my glee—oh, what a year! 🐰

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0417ec and 5a5dc0c.

📒 Files selected for processing (2)

src/cli.rs (1 hunks)
src/query/mod.rs (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

src/query/mod.rs
src/cli.rs

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: coverage

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/handlers/http/query.rs (1)
133-139: Great implementation of asynchronous query execution using spawn_blocking

The change to use tokio::task::spawn_blocking is an excellent improvement as it offloads potentially CPU-intensive query execution to a dedicated thread pool, preventing blocking of the async runtime. The error handling is also well-implemented with proper matching against different error scenarios.

For even better readability and separation of concerns, consider extracting this pattern into a helper function since it's used in multiple handlers:
-let table_name_clone = table_name.clone();
-let (records, fields) =
-    match tokio::task::spawn_blocking(move || query.execute(table_name_clone)).await {
-        Ok(Ok((records, fields))) => (records, fields),
-        Ok(Err(e)) => return Err(QueryError::Execute(e)),
-        Err(e) => return Err(QueryError::Anyhow(anyhow::Error::msg(e.to_string()))),
-    };
+let (records, fields) = execute_query_in_blocking_thread(query, table_name.clone()).await?;

// Then define this helper function elsewhere:
// async fn execute_query_in_blocking_thread(
//     query: LogicalQuery,
//     stream_name: String,
// ) -> Result<(Vec<RecordBatch>, Vec<String>), QueryError> {
//     match tokio::task::spawn_blocking(move || query.execute(stream_name)).await {
//         Ok(Ok(result)) => Ok(result),
//         Ok(Err(e)) => Err(QueryError::Execute(e)),
//         Err(e) => Err(QueryError::Anyhow(anyhow::Error::msg(e.to_string()))),
//     }
// }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 887a63f and 17c6edd.

📒 Files selected for processing (4)

src/handlers/airplane.rs (1 hunks)
src/handlers/http/query.rs (1 hunks)
src/parseable/streams.rs (1 hunks)
src/query/mod.rs (2 hunks)

✅ Files skipped from review due to trivial changes (1)

src/parseable/streams.rs

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: coverage

🔇 Additional comments (3)

src/handlers/airplane.rs (1)

219-225:
❓ Verification inconclusive

Effective implementation of non-blocking query execution

Offloading the query execution to a blocking thread pool using tokio::task::spawn_blocking is a great architectural improvement. This prevents the query execution from blocking the async Tokio runtime, improving overall system throughput and responsiveness.

Let's verify the error handling improvements by checking if there are different error scenarios properly handled in both handlers:

🏁 Script executed:
#!/bin/bash
# Check if both handlers handle errors consistently

# First, check the query handler error handling
echo "Checking HTTP query handler error handling..."
rg -A 5 "tokio::task::spawn_blocking.*query.execute" src/handlers/http/query.rs

# Then, check the airplane handler error handling
echo "Checking airplane handler error handling..."
rg -A 5 "tokio::task::spawn_blocking.*query.execute" src/handlers/airplane.rs

# Compare the error types being returned
echo "Comparing error types..."
rg "return Err\(" src/handlers/http/query.rs | grep -A 2 "tokio::task::spawn_blocking" 
rg "return Err\(" src/handlers/airplane.rs | grep -A 2 "tokio::task::spawn_blocking"
Length of output: 1264
Non-blocking query execution is effectively implemented, but please double-check error type differences

The offloading of query.execute to a blocking thread pool is well done. Our verification shows that in the airplane handler the errors are wrapped using Status::internal, whereas the HTTP query handler wraps errors in custom QueryError variants. This discrepancy appears intentional given the different protocol contexts (gRPC vs. HTTP), but please confirm these differences were by design.

Location: src/handlers/airplane.rs (Lines 219-225) – error handling via Status::internal

Observation: HTTP handler (src/handlers/http/query.rs) uses QueryError::Execute and QueryError::Anyhow

Action: Verify that these distinct error types align with the intended behavior for each endpoint
src/query/mod.rs (2)

94-95:
✅ Verification successful

Optimization: Added batch size configuration for improved performance

Setting a specific batch size (8192) is a good optimization. This controls how many rows are processed in each batch during query execution, allowing for better memory usage and potentially improved performance.

Let's analyze how this batch size value compares to defaults and recommendations:

🌐 Web query:
What is the default batch size in DataFusion and what are the recommended values?
💡 Result:

The default batch size in Apache DataFusion is 8192 rows per batch, as shown in configuration settings and CLI outputs[1][2][3]. This value balances memory efficiency and metadata overhead, particularly for buffer-in-memory operations where small batches would create excessive metadata[2][3].

Recommended Batch Size Values

Use Case Recommended Size Rationale

General workloads 8192 (default) Balances memory usage and processing efficiency[2][7]

Memory-constrained environments 1024-4096 Reduces per-batch memory footprint[1][6]

Large aggregation/join operations 16384 Shows ~10% performance improvement in some queries[7]

TopK/sorting optimizations Match LIMIT clause Partial sorts can use smaller batches (e.g., 10-100 rows)[5]

Key considerations when adjusting:

Larger batches (16K-65K rows) improve throughput for complex queries but increase memory pressure[7][10]

Smaller batches (1K-4K) reduce memory usage but may impact performance due to increased batch count[5][6]

Use SET datafusion.execution.batch_size TO <value> for session-level adjustments[1]

Configure via environment variable DATAFUSION_EXECUTION_BATCH_SIZE for CLI persistence[1][3]

The optimal size depends on query patterns and available memory resources, with Cloud DataFusion deployments typically recommending nodes with ≥8GB RAM[4]. Performance testing with representative workloads is advised when tuning this parameter.

Citations:

1: https://datafusion.apache.org/user-guide/cli/usage.html

2: https://datafusion.apache.org/user-guide/configs.html

3: https://datafusion.apache.org/_sources/user-guide/configs.md.txt

4: https://cloud.google.com/data-fusion/docs/concepts/cluster-sizing

5: https://xebia.com/blog/optimizing-topk-queries-in-datafusion/

6: https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html

7: Tweak optimal default batch size apache/datafusion#6916

8: Improve RepartitionExec for better query performance apache/datafusion#7001

9: https://cloud.google.com/data-fusion/docs/how-to/configure-cloud-storage-batch-source

10: https://www.influxdata.com/blog/aggregating-millions-groups-fast-apache-arrow-datafusion/
Clarification: Explicit Batch Size Configuration Confirmed

The explicit setting of the batch size to 8192 in src/query/mod.rs (lines 94-95) aligns with Apache DataFusion defaults and recommendations for general workloads. Although this value is the default, specifying it enhances clarity and eases future tuning if workload requirements change. Approved.

139-139:
❓ Verification inconclusive

Excellent use of multi-threaded tokio runtime for query execution

Adding the #[tokio::main(flavor = "multi_thread")] attribute to the execute method enables multi-threaded query execution, allowing DataFusion to parallelize operations across multiple CPU cores. This is crucial for improving query performance, especially for complex queries on large datasets.

Let's check if the multi_thread flavor is appropriate and if there are any potential limitations:

🌐 Web query:
What are the pros and cons of using #[tokio::main(flavor = "multi_thread")] on a method instead of the main function?
💡 Result:

Tokio's multi_thread runtime flavor balances concurrency and resource utilization through work-stealing across a thread pool, but its effectiveness depends on proper usage. Here's a breakdown of its trade-offs when applied at different scopes:

Runtime Configuration Fundamentals

#[tokio::main(flavor = "multi_thread")] initializes a global runtime with worker threads (defaulting to CPU core count).

Tasks are Send + 'static by default, enabling thread-safe execution across cores[1][7].

Pros of Multi-Threaded Runtime

Work-Stealing Efficiency

Distributes tasks dynamically across threads, preventing idle cores and reducing tail latency under load[3][9].

Example: A web server handles 500 concurrent requests faster than a single-threaded runtime, even if each task sleeps briefly[1].

Blocking Task Isolation

Use tokio::task::spawn_blocking to offload CPU-heavy work to a dedicated thread pool, avoiding starvation of async tasks[2][4].

Resource Scalability

Adjust worker threads via worker_threads=N to balance throughput and memory usage (e.g., restrict to 1 thread for low-resource environments)[3].

Cons of Multi-Threaded Runtime

Send Bound Complexity

Futures and data shared across threads must implement Send, complicating code with non-Send types (e.g., raw pointers, thread-local data)[4][7].

Deadlock Risks

Blocking synchronous code (e.g., std::thread::sleep) in async tasks can stall the entire runtime if worker threads are saturated[2].

Mitigation: Use yield_now() or offload to spawn_blocking.

Overhead Trade-Offs

Multithreading adds memory and scheduling overhead. Single-threaded runtimes (flavor = "current_thread") often match performance for light workloads[3][4].

Method vs. Main Function Usage

Main Function: Standard approach. Initializes the runtime once, ensuring all async code uses the same executor[3][7].

Method-Level: Not recommended. Tokio prohibits nested runtimes, and method-specific initialization risks:

Runtime conflicts or unintended thread pools.

Fragmented resource management (e.g., duplicated thread pools)[2][4].

When to Prefer Single-Threaded

Low-concurrency services (e.g., CLI tools).

!Send futures or shared references across awaits (simpler with LocalSet)[4][7].

Resource-constrained environments (embedded systems)[3].

In practice, benchmark both runtimes with worker_threads=1 versus current_thread to quantify trade-offs for your workload[3]. Use multi_thread as the default for servers and adjust based on profiling.

Citations:

1: https://users.rust-lang.org/t/dont-fully-understand-tokio-multithreaded-runtime-benefits/113393

2: https://users.rust-lang.org/t/tokio-using-core-threads-for-cpu-heavy-computation/83443

3: https://users.rust-lang.org/t/choosing-rt-multi-thread-versus-rt-option-in-tokio/106125

4: https://news.ycombinator.com/item?id=37466284

5: https://bitbashing.io/async-rust.html

6: https://www.howtocodeit.com/articles/master-hexagonal-architecture-rust

7: https://swatinem.de/blog/futures-n-tasks/

8: https://livebook.manning.com/book/code-like-a-pro-in-rust/chapter-11/v-8/

9: https://news.ycombinator.com/item?id=36782884

10: https://livebook.manning.com/book/code-like-a-pro-in-rust/chapter-8/
Action Required: Verify Runtime Context for Tokio Attribute Usage

The use of #[tokio::main(flavor = "multi_thread")] in this method leverages Tokio’s multi-threaded runtime for efficient, parallel query execution, which is beneficial for large-scale, complex queries. However, please verify that this method is not being invoked within an already active Tokio runtime. Using the attribute on a non-main function can risk nested runtime conflicts (e.g., creating multiple thread pools or potential deadlocks) if not properly isolated.

Location: src/query/mod.rs, Line 139

Recommendation: Ensure that this method is the sole entry point for starting a runtime in its context, or refactor to initialize the runtime in the main function if nested execution might occur.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

src/handlers/http/query.rs (2)

134-134: Consider resource limits or timeouts for parallel query execution.

Although calling execute_query asynchronously improves throughput, there is no guardrail on concurrency (like a task limiter). If many queries arrive simultaneously, the thread pool might become saturated, impacting the system’s responsiveness.

153-162: Implementation of asynchronous, multi-threaded query execution looks good.

The usage of tokio::task::spawn_blocking is appropriate for CPU-bound tasks, preventing blocking of the async runtime.

The error handling cleanly distinguishes between internal query execution errors and the blocking task’s errors.

Consider adding optional logs or telemetry within spawn_blocking to track query performance or detect long-running tasks.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 17c6edd and b4c3d88.

📒 Files selected for processing (1)

src/handlers/http/query.rs (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: coverage
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Build Kafka x86_64-unknown-linux-gnu

🔇 Additional comments (1)

src/handlers/http/query.rs (1)

22-22: No issues noted with the newly added import.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/handlers/http/query.rs (1)

34-34: Importing std::thread is acceptable, but consider async best practices.
It’s okay to import std::thread for parallel work, though using it in an async context can block the runtime. If you need full non-blocking behavior, tokio::task::spawn_blocking is often recommended.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b4c3d88 and 1ac2dc4.

📒 Files selected for processing (1)

src/handlers/http/query.rs (4 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: coverage
GitHub Check: Quest Smoke and Load Tests for Standalone deployments

🔇 Additional comments (2)

src/handlers/http/query.rs (2)

22-22: Import looks fine.
No issues spotted; importing RecordBatch here is appropriate to handle query results.

135-135: Validate concurrency approach.
The invocation of the new execute_query function is clear, but ensure that parallel calls to this endpoint won’t block the async runtime if all threads are occupied, especially under high load.

src/handlers/http/query.rs

add multi threading execution to query execute method allows datafusion to use multiple threads to perform parallel execution of plans improves query performance

src/query/mod.rs

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/query/mod.rs (1)

103-103: Document this configuration option

Please add a comment explaining why binary_as_string = true is necessary. What specific issues does this solve, and how does it affect query results?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e31e655 and 51d3ac2.

📒 Files selected for processing (1)

src/query/mod.rs (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: coverage

🔇 Additional comments (3)

src/query/mod.rs (3)

94-94: Add documentation for this configuration value

Please add a comment explaining why a batch size of 1000000 was chosen and its impact on query performance. This will help maintainers understand the rationale behind this specific value.

107-107: Document how this optimization affects query execution

This configuration option seems important for partitioning optimization. Please add a comment explaining its purpose and expected impact on query performance.

131-131: Good implementation of multi-threaded execution

The #[tokio::main(flavor = "multi_thread")] annotation aligns well with the PR objective to improve query performance through multi-threading. This allows DataFusion to utilize multiple threads for parallel execution of query plans.

Consider adding a comment explaining how this annotation improves performance and any potential resource considerations (e.g., thread pool size, memory usage).

src/query/mod.rs

src/cli.rs

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/query/mod.rs (1)

132-132: Excellent improvement with multi-threaded execution!

Adding #[tokio::main(flavor = "multi_thread")] enables DataFusion to leverage multiple CPU cores for query execution, which should significantly improve performance for complex queries or large datasets.

Consider adding a brief comment explaining the performance benefits this change provides and any considerations for resource usage.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51d3ac2 and f0417ec.

📒 Files selected for processing (2)

src/cli.rs (1 hunks)
src/query/mod.rs (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/cli.rs

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: coverage

🔇 Additional comments (3)

src/query/mod.rs (3)

96-96: Consider adding an explanation for the batch size selection.

The batch size of 1,000,000 is a significant configuration parameter that affects query performance and memory usage. Adding a comment explaining why this specific value was chosen would improve code maintainability.

104-104: Good addition of binary_as_string configuration.

Setting binary_as_string = true ensures that binary data is treated as strings, which is appropriate for log data that may contain binary content that needs to be displayed as text.

91-92: Great documentation reference!

Providing a link to the DataFusion documentation helps developers understand the available configuration options and their implications.

coderabbitai bot reviewed Mar 3, 2025

View reviewed changes

coderabbitai bot previously approved these changes Mar 3, 2025

View reviewed changes

nikhilsinhaparseable dismissed coderabbitai[bot]’s stale review via b4c3d88 March 3, 2025 07:20

coderabbitai bot reviewed Mar 3, 2025

View reviewed changes

coderabbitai bot previously approved these changes Mar 3, 2025

View reviewed changes

nikhilsinhaparseable dismissed coderabbitai[bot]’s stale review via 1ac2dc4 March 3, 2025 10:44

coderabbitai bot requested changes Mar 3, 2025

View reviewed changes

src/handlers/http/query.rs Outdated Show resolved Hide resolved

nikhilsinhaparseable force-pushed the multi-threading branch from 1ac2dc4 to b4c3d88 Compare March 3, 2025 10:58

nikhilsinhaparseable added 4 commits March 6, 2025 02:29

fix: multi threaded query execution

b55b9b6

add multi threading execution to query execute method allows datafusion to use multiple threads to perform parallel execution of plans improves query performance

separated code for execute query in tokio task

a09e56a

binary as string

fe492e7

config options updated

475d1a7

nikhilsinhaparseable force-pushed the multi-threading branch from 0493eb2 to 475d1a7 Compare March 6, 2025 07:29

nitisht requested a review from de-sh March 6, 2025 08:44

nitisht reviewed Mar 6, 2025

View reviewed changes

src/query/mod.rs Outdated Show resolved Hide resolved

src/query/mod.rs Outdated Show resolved Hide resolved

coderabbitai bot previously approved these changes Mar 6, 2025

View reviewed changes

removed hardcoded repartition size

a09765d

nikhilsinhaparseable dismissed coderabbitai[bot]’s stale review via a09765d March 6, 2025 11:37

coderabbitai bot previously approved these changes Mar 6, 2025

View reviewed changes

reduced row group size

e31e655

nikhilsinhaparseable dismissed coderabbitai[bot]’s stale review via e31e655 March 6, 2025 11:46

coderabbitai bot previously approved these changes Mar 6, 2025

View reviewed changes

removed coalesce batches as it is default behaviour

51d3ac2

nikhilsinhaparseable dismissed coderabbitai[bot]’s stale review via 51d3ac2 March 6, 2025 12:31

coderabbitai bot reviewed Mar 6, 2025

View reviewed changes

coderabbitai bot previously approved these changes Mar 6, 2025

View reviewed changes

nitisht reviewed Mar 6, 2025

View reviewed changes

src/query/mod.rs Show resolved Hide resolved

nitisht reviewed Mar 6, 2025

View reviewed changes

src/cli.rs Show resolved Hide resolved

nikhilsinhaparseable dismissed coderabbitai[bot]’s stale review via f0417ec March 6, 2025 15:10

added comments to the configs updated

5a5dc0c

coderabbitai bot reviewed Mar 6, 2025

View reviewed changes

coderabbitai bot previously approved these changes Mar 6, 2025

View reviewed changes

nikhilsinhaparseable dismissed coderabbitai[bot]’s stale review via 5a5dc0c March 6, 2025 15:13

nikhilsinhaparseable force-pushed the multi-threading branch from f0417ec to 5a5dc0c Compare March 6, 2025 15:13

coderabbitai bot approved these changes Mar 6, 2025

View reviewed changes

nitisht approved these changes Mar 6, 2025

View reviewed changes

nitisht merged commit 7596825 into parseablehq:main Mar 6, 2025
14 checks passed

coderabbitai bot mentioned this pull request Mar 6, 2025

perf: don't construct a tokio runtime for each query #1226

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: multi threaded query execution #1220

fix: multi threaded query execution #1220

nikhilsinhaparseable commented Mar 3, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 3, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

Recommended Batch Size Values

Runtime Configuration Fundamentals

Pros of Multi-Threaded Runtime

Cons of Multi-Threaded Runtime

Method vs. Main Function Usage

When to Prefer Single-Threaded

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

Use Case	Recommended Size	Rationale
General workloads	8192 (default)	Balances memory usage and processing efficiency[2][7]
Memory-constrained environments	1024-4096	Reduces per-batch memory footprint[1][6]
Large aggregation/join operations	16384	Shows ~10% performance improvement in some queries[7]
TopK/sorting optimizations	Match LIMIT clause	Partial sorts can use smaller batches (e.g., 10-100 rows)[5]

fix: multi threaded query execution #1220

fix: multi threaded query execution #1220

Conversation

nikhilsinhaparseable commented Mar 3, 2025 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Mar 3, 2025 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Suggested labels

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

Recommended Batch Size Values

Runtime Configuration Fundamentals

Pros of Multi-Threaded Runtime

Cons of Multi-Threaded Runtime

Method vs. Main Function Usage

When to Prefer Single-Threaded

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

nikhilsinhaparseable commented Mar 3, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 3, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)