Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match datastore semantics more closely: btree_scan => index_scan_range #2203

Merged
merged 1 commit into from
Feb 4, 2025

Conversation

Centril
Copy link
Contributor

@Centril Centril commented Feb 4, 2025

Description of Changes

In preparation for adding a index algo/kind that we'll call direct (backed by Vec<Vec<RowPointer>>, ...),
this PR moves all the module bindings APIs (Rust + C#) to use datastore_[delete_by_]index_scan_range_bsatn
and deprecates datastore_[delete_by_]btree_scan_bsatn.

This more closely reflects what the datastore actually does today.
Nowhere (mostly because there's only a single index algo) does the datastore actually check that a btree index is used.
What the ABI actually is about is fast range queries.
The new direct index can also support that, so the more general name makes sense,
while it is also beneficial for efficiency, as we can avoid a check.

After this change, there are no users of the deprecated ABIs, so we could elect to remove them just before 1.0,
but this PR leaves them be.

Some datastore types and methods are also renamed to reflect the new reality.

API and ABI breaking changes

None, for now.

Expected complexity level and risk

2, mostly simple renaming and code motion.

Testing

Existing tests should cover this.

Copy link
Collaborator

@joshua-spacetime joshua-spacetime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the ABI actually is about is fast range queries.
The new direct index can also support that,

Will there be logic for choosing which index to use within the current syscall?

@Centril
Copy link
Contributor Author

Centril commented Feb 4, 2025

Will there be logic for choosing which index to use within the current syscall?

The existing syscall datastore_btree_scan_bsatn will become a deprecated name for datastore_index_scan_range_bsatn,
which won't have any logic beyond generically dispatching index_id to the known index with that id, as is currently done for the different variations of TypedIndexes that we have.
The MutTxId::index_scan_range implementation should not change at all in upcoming PRs.

@Centril Centril added this pull request to the merge queue Feb 4, 2025
Merged via the queue into master with commit 2fc0361 Feb 4, 2025
14 checks passed
@cloutiertyler
Copy link
Contributor

cloutiertyler commented Feb 4, 2025

What the ABI actually is about is fast range queries.

I would dispute this point, the ABI is actually about using an index to perform a range query. The host must validate that the particular type of index specified by index_id can support the index operation being made.

I think this is an important point. Not all indexes will be able to support all index operations.

Also NIT, let's not call them queries, they're datastore operations.

while it is also beneficial for efficiency, as we can avoid a check.

I don't think you can because if we add a hash index we need to validate that that you did not pass in the index_id of a hash index to this function.

@cloutiertyler
Copy link
Contributor

cloutiertyler commented Feb 4, 2025

TBH I'm wavering about this renaming scheme since those points were muddled in the short period between our conversation and merging this PR. I don't think we should revert it necessarily, but I think it is very very important that we're clear that these are physical datastore operations, and the query engine should always be able to choose the specific physical operation it wants to do.

@joshua-spacetime
Copy link
Collaborator

I think it is very very important that we're clear that these are physical datastore operations, and the query engine should always be able to choose the specific physical operation it wants to do.

This was my main concern, and I approved based on the confirmation from @Centril that these are indeed remaining purely physical operations.

@Centril
Copy link
Contributor Author

Centril commented Feb 4, 2025

I would dispute this point, the ABI is actually about using an index to perform a range query. The host must validate that the particular type of index specified by index_id can support the index operation being made.

I'm not sure what the distinction is; I think we're saying the same thing, that this ABI affords range scans via an index, specifically for those indices where we think it is reasonable to expose range scan functionality. For example, you can do a range scan via a hash index, but it's a bad idea to expose such a facility, so we won't.

I think this is an important point. Not all indexes will be able to support all index operations.

We're aligned. :) A datastore_index_scan_point_bsatn, which I think we should add for efficiency, would be supported by hash, btree, direct, whereas _range_ would only be supported by ``btree, direct`.

I don't think you can because if we add a hash index we need to validate that that you did not pass in the index_id of a hash index to this function.

What I meant was that we're avoiding a check until such time as we add a hash index, which might not be necessary, based on the the surprising perf outcomes.

[...], but I think it is very very important that we're clear that these are physical datastore operations, and the query engine should always be able to choose the specific physical operation it wants to do.

Agreed. The query engine will retain full control over what index/operation it wants to use with this PR merged.

@Centril Centril deleted the centril/datastore_index_scan_range branch February 4, 2025 09:53
@gefjon
Copy link
Contributor

gefjon commented Feb 4, 2025

Two cents here: the host function is going to have to dispatch on the index type no matter what. If we have separate btree_scan vs dense_seq_vector_scan methods, then each of them checks equality on the index type, and returns an error otherwise. If we have a unified index_scan_range, then it does a switch/case on the index type, and the cases for unordered index types return errors. There's no version of this, given our system table schemas, where you can have a host function btree_scan which immediately does a low-level physical operation without first checking the index type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants