Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timeout parameter is not always respected by fuseki server #3044

Open
mpagni12 opened this issue Mar 4, 2025 · 11 comments · May be fixed by #3047
Open

timeout parameter is not always respected by fuseki server #3044

mpagni12 opened this issue Mar 4, 2025 · 11 comments · May be fixed by #3047
Labels

Comments

@mpagni12
Copy link

mpagni12 commented Mar 4, 2025

Version

5.3.0

What happened?

Using the Python requests client, I am running sequentially a list of diverse SPARQL queries on a fuseki endpoint at localhost:3030, using POST request and a timeout parameter set to 10 s. For the few first lengthy query in the list, the timeout works as expected, i.e. the request terminates after approx 10 s and with a 503 status code. However at some point the timeout of subsequent queries is ignored, i.e. the queries executing correctly and completely.

As I suspect that this problem might be linked to the fuseki HTTP server, I have not investigated the content of the queries that are not timing out. Should I?

Thanks in advance,

Marco

Relevant output and stacktrace

Are you interested in making a pull request?

None

@mpagni12 mpagni12 added the bug label Mar 4, 2025
@afs
Copy link
Member

afs commented Mar 4, 2025

@mpagni12 -- theer isn't enough to recreate the situation from the description.

Could you provide the fuseki server log file for the relevant first request (timeouts) and later (non-timeouts). If that can be "verbose" (server "-v") that would be best.

What is the storage? TDB2? Any inference?

I have not investigated the content of the queries that are not timing out. Should I?

It does seem to need know what the queries are.

Note: if the server has just started, first requests are slower. Java starts optimizing code and the file cache becomes more active over time.

@Aklakan
Copy link
Contributor

Aklakan commented Mar 4, 2025

I quite recently fixed several cases where timeouts were not working properly. Without your query load its hard to tell whether this is related to something I did not fix - or perhaps to something that I accidentally broke.

Are update requests involved?
Do the timeouts work properly for your query load with jena 5.0.0?

@mpagni12
Copy link
Author

mpagni12 commented Mar 5, 2025

Here is the log: fuseki.log.gz

Storage is TDB with no inference.

@mpagni12
Copy link
Author

mpagni12 commented Mar 5, 2025

I quite recently fixed several cases where timeouts were not working properly. Without your query load its hard to tell whether this is related to something I did not fix - or perhaps to something that I accidentally broke.

Are update requests involved? Do the timeouts work properly for your query load with jena 5.0.0?

I am using 5.3.0, should I downgrade my version for testing?

@Aklakan
Copy link
Contributor

Aklakan commented Mar 5, 2025

I am using 5.3.0, should I downgrade my version for testing?

That would be very helpful in order to determine whether the issue existed for a while or was introduced recently.

@mpagni12
Copy link
Author

mpagni12 commented Mar 5, 2025

I rebuild a database and rerun the same test using 5.0.0: all lengthy queries timeout after 10 s :-)

This strongly suggests that the bug was introduce later.

@afs
Copy link
Member

afs commented Mar 5, 2025

I looked for timeout= and matched queries to their response code: e.g. [34] 503 Service Unavailable (10.879 s) or [26] 200 OK (7.066 s)

I didn't find any 200's over 10 seconds.
I didn't find any 503's under 10 seconds.

I did find one outlier [20] 503 Service Unavailable (377.566 s). It is like the query from #3021 without the suggested workaround but has extra work added. The OPTIONAL look like they have a partial cross-product in them.

The timeout is tested every so often - it isn't an interrupt - so the query is in a busy CPU loop (probably fixable - caveat polling the timeout too much is a bit expensive).

@Aklakan
Copy link
Contributor

Aklakan commented Mar 5, 2025

So I can reproduce timeouts being ignored on property paths, such as this one on a dbpedia dataset (lots of labels):

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?x { ?x rdfs:label/rdfs:label ?z }
time curl http://localhost:3030/ds --data-urlencode 'query=PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT ?x { ?x rdfs:label/rdfs:label ?z }' --data-urlencode 'timeout=1'

I have not analyzed property paths further.

I can also reproduce a similar (perhaps the same) problem with this query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?x {
  ?x rdfs:label ?o
    { ?x rdfs:label ?z }
  UNION
    { BIND("hi" AS ?s) }
}

The problem is synchronization in conjunction with an eagerly executing iterator:

The timeout fires and tries to abort, but doesn't get the lock:

class TimeoutCallback implements Runnable {
@Override
public void run() {
synchronized (lockTimeout) {

The lock is acquired here:

private void startQueryIterator() {
synchronized (lockTimeout) {
if (cancelSignal.get()) {
// Fail before starting the iterator if cancelled already
throw new QueryCancelledException();
}
startQueryIteratorActual();

And while the lock is held QueryExecDataset.startQueryIteratorActual executes a hash join eagerly in the line getPlan().iterator(); - so the abort signal due to timeout does not get through:

if ( !isTimeoutSet(timeout1) && !isTimeoutSet(timeout2) ) {
// Case -1,-1
queryIterator = getPlan().iterator();
return;

@afs
Copy link
Member

afs commented Mar 5, 2025

A sequence path should be expanded to triple patterns.

{ ?x rdfs:label / rdfs:label ?z }
{
?x rdfs:label ??P1 .
??P1  rdfs:label ?z .
}

Looking at the log, the timeout isn't getting forgotten - it is delayed (presumably by being busy in some tight loop).

The execution plan for the [20] does not have a hash join in it. It's conditional and sequence but I think the OPTIONALs can massive fan-out and should be hash left joins.

@Aklakan
Copy link
Contributor

Aklakan commented Mar 5, 2025

Removing the lock around QueryExecDataset.startQueryIterator() makes the query with UNION { BIND() } work (the eager hash join then gets cancelled because the signal is no longer blocked).

For the sequence path query, query execution hangs trying to produce a result with this stack trace - it does not appear to be affected by the locking issue - the context has the cancel signal correctly set to true but the signal is not considered.

Image

So it should be two separate issues.

@Aklakan
Copy link
Contributor

Aklakan commented Mar 5, 2025

The problem I see with eager iterators in the snippet below is, that if getPlan().iterator() fails such as due to concurrent abort, then there is no iterator that can be closed -> likely resource leak. Or at least warnings from QueryIteratorCheck about non-closed iterators in the execCxt. (Or would closing the plan prevent that? - I don't yet know what resources this would close.)

if ( !isTimeoutSet(timeout1) && !isTimeoutSet(timeout2) ) {
// Case -1,-1
queryIterator = getPlan().iterator();
return;

IMO the clean solution would be to change AbstractIterHashJoin to be lazy - i.e. move buildHashTable to the hasNext method. Then the lock would only be held shortly for the iterator construction, and then the timeout can kick in normally.

Aklakan added a commit to Aklakan/jena that referenced this issue Mar 5, 2025
Aklakan added a commit to Aklakan/jena that referenced this issue Mar 6, 2025
Aklakan added a commit to Aklakan/jena that referenced this issue Mar 6, 2025
Aklakan added a commit to Aklakan/jena that referenced this issue Mar 6, 2025
Aklakan added a commit to Aklakan/jena that referenced this issue Mar 6, 2025
Aklakan added a commit to Aklakan/jena that referenced this issue Mar 6, 2025
Aklakan added a commit to Aklakan/jena that referenced this issue Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants