-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-3044: HashJoin iterator construction no longer eagerly builds hash probe table. #3047
base: main
Are you sure you want to change the base?
Conversation
1d84676
to
0661332
Compare
@@ -51,7 +51,7 @@ public static void consume(RowSet rowSet) | |||
* This operation consumes the RowSet. | |||
*/ | |||
public static long count(RowSet rowSet) | |||
{ return rowSet.rewindable().size(); } | |||
{ long c[] = {0}; rowSet.forEach(b -> ++c[0]); return c[0]; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intent: Avoid memory copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe rowSet.stream().count()
would be cleaner and still avoid the memory copy (though it has conversion to Stream
overhead)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! - changed to the cleaner stream version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it does not matter with this destructive operation but the cost of stream()
is not trivial.
forEach
does not use stream
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intent: Avoid memory copy.
RowSetMem
rewind should not be a memory copy. It shares the rows array.
A rewindable RowSet (e.g. Mem) goes back to the beginning where as forEach
continues the iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Behavior so far (IIRC):
- For non-RowSetMem instances counting would create an intermediate in-memory copy.
- The count for a RowSetMem instances would be the total number of bindings in the in-memory row set - rather than the number of remaining bindings.
I added the method RowSetStream.forEachRemaining
- the stack trace for RowSetOps.count
using rowSet.stream().count()
then becomes a delegate to the forEachRemaining method (with some stream creation overhead that should be neglectable for larger counts):
026f326
to
c1ec70f
Compare
@@ -68,6 +70,17 @@ private static Iterator<BindingNodeId> access(NodeTupleTable nodeTupleTable, Bin | |||
iterMatches = x.iterator(); | |||
} | |||
|
|||
// Add cancel check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a separate case where timeouts get ignored (besides the hash probe table construction), a cancel check is needed here for the following query on TDB2:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?x { ?x rdfs:label ?y . ?y rdfs:label ?z }
time curl http://localhost:3030/ds --data-urlencode 'query=PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT ?x { ?x rdfs:label ?y . ?y rdfs:label ?z }' --data-urlencode 'timeout=1'
Against this preloaded TDB2 DBpedia dataset (~30GB)
Set to draft because I am still investigating this query on in memory datasets. PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?x { ?x rdfs:label ?y . ?y rdfs:label ?z } |
0adb79d
to
7a98372
Compare
So for the "empty join" case I had to add cancel signal checks to the |
…s initial hash probe table.
7a98372
to
6b61668
Compare
To summarize:
CancelStageMatchTuple.webm |
GitHub issue resolved #3044
Pull request Description: Timeouts in QueryExecDataset were postponed until a so-far eager hash probe table construction completed because of locking. This PR proposes to initialize the hash probe table lazily such that the iterator construction becomes cheap and the lock is released quickly.
I added a test case that fails without the proposed fix.
By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.
See the Apache Jena "Contributing" guide.