Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

null_pointer_exception - Cannot invoke "String.equals(Object)" because the return value of "org.opensearch.wlm.QueryGroupTask.getQueryGroupId()" is null #17518

Open
etgraylog opened this issue Mar 5, 2025 · 9 comments · May be fixed by #17576
Assignees
Labels
bug Something isn't working Search Search query, autocomplete ...etc

Comments

@etgraylog
Copy link

etgraylog commented Mar 5, 2025

Describe the bug

This is a continuation of the problem reported in issue #16874.

In version 2.19.1 the Warning message can be generated by _search/scroll API queries:

[2025-03-05T06:47:19,875][INFO ][o.o.n.Node               ] [10.0.1.242] version[2.19.1], pid[83053], build[tar/2e4741fb45d1b150aaeeadf66d41445b23ff5982/2025-02-27T01:16:47.726162386Z], OS[Linux/6.8.0-1021-aws/amd64], JVM[Eclipse Adoptium/OpenJDK 64-Bit Server VM/21.0.6/21.0.6+7-LTS]
...
[2025-03-05T06:48:42,518][WARN ][o.o.w.QueryGroupTask     ] [10.0.1.242] QueryGroup _id can't be null, It should be set before accessing it. This is abnormal behaviour

And the cause of the Warning msgs seems to somehow affect the Node Stats API as well.

BEFORE the Warning messages are triggered:

user@es-master-data-node-614:/usr/share/opensearch/logs# grep -c 'This is abnormal behaviour' os-cluster-1.log
0
user@es-master-data-node-614:/usr/share/opensearch/logs#
user@es-master-data-node-614:/usr/share/opensearch/logs# curl -s -X GET "http://10.0.1.242:9200/_nodes/stats?pretty" -u admin:******** -k | tail -n 20
            "rejection_count" : { }
          }
        }
      },
      "caches" : {
        "request_cache" : {
          "size_in_bytes" : 0,
          "evictions" : 0,
          "hit_count" : 0,
          "miss_count" : 0,
          "item_count" : 0,
          "store_name" : "noop_store"
        }
      },
      "remote_store" : {
        "last_successful_fetch_of_pinned_timestamps" : -1
      }
    }
  }
}

AFTER the Warning messages are triggered:

user@es-master-data-node-614:/usr/share/opensearch/logs# grep -c 'This is abnormal behaviour' os-cluster-1.log
537
user@es-master-data-node-614:/usr/share/opensearch/logs# curl -s -X GET "http://10.0.1.242:9200/_nodes/stats?pretty" -u admin:******** -k
{
  "_nodes" : {
    "total" : 1,
    "successful" : 0,
    "failed" : 1,
    "failures" : [
      {
        "type" : "failed_node_exception",
        "reason" : "Failed node [Yo6mfyfRQvyNyMkb3iLuMg]",
        "node_id" : "Yo6mfyfRQvyNyMkb3iLuMg",
        "caused_by" : {
          "type" : "null_pointer_exception",
          "reason" : "Cannot invoke \"String.equals(Object)\" because the return value of \"org.opensearch.wlm.QueryGroupTask.getQueryGroupId()\" is null"
        }
      }
    ]
  },
  "cluster_name" : "os-cluster-1",
  "nodes" : { }
}
user@es-master-data-node-614:/usr/share/opensearch/logs#

The Nodes Stats API then continues to generate NPEs with the same reason in response to HTTP GETs for _nodes/stats until the OpenSearch node is restarted.

Related component

Search

To Reproduce

The steps to reproduce are essentially the same as documented in #16874. Yet in this issue, let's include a step to show how the Nodes Stats API can somehow be affected apparently by what triggers the reported Warning message:

  1. Query Node Stats API to confirm is accessible and responding as expected.
  2. Confirm zero instances of Warning msg in OpenSearch node(s) log-file.
  3. Execute repeated Scroll search queries until Warning msg appears in the OpenSearch node(s) log-file.
  4. Query Node Stats API to confirm is accessible and observe NPE.

Expected behavior

The expected behavior has 2 parts:

  • No WARN QueryGroupTask message to occur.
  • No NPE generated in response to an HTTP GET for _nodes/stats.

Additional Details

Plugins
Please list all plugins currently enabled.

  • opensearch-alerting
  • opensearch-anomaly-detection
  • opensearch-asynchronous-search
  • opensearch-cross-cluster-replication
  • opensearch-custom-codecs
  • opensearch-flow-framework
  • opensearch-geospatial
  • opensearch-index-management
  • opensearch-job-scheduler
  • opensearch-knn
  • opensearch-ltr
  • opensearch-ml
  • opensearch-neural-search
  • opensearch-notifications
  • opensearch-notifications-core
  • opensearch-observability
  • opensearch-performance-analyzer
  • opensearch-reports-scheduler
  • opensearch-security
  • opensearch-security-analytics
  • opensearch-skills
  • opensearch-sql
  • opensearch-system-templates
  • query-insights
  • repository-s3

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS] Debian
  • Version [e.g. 22] 12 (Bookworm)

Additional context
Add any other context about the problem here.

@etgraylog etgraylog added bug Something isn't working untriaged labels Mar 5, 2025
@github-actions github-actions bot added the Search Search query, autocomplete ...etc label Mar 5, 2025
@etgraylog etgraylog changed the title null_pointer_exception - Cannot invoke \"String.equals(Object)\" because the return value of \"org.opensearch.wlm.QueryGroupTask.getQueryGroupId()\" is null null_pointer_exception - Cannot invoke "String.equals(Object)" because the return value of "org.opensearch.wlm.QueryGroupTask.getQueryGroupId()" is null Mar 5, 2025
@sandeshkr419
Copy link
Contributor

@ansjcy @deshsidd Can you please check on this once?

@deshsidd
Copy link
Contributor

deshsidd commented Mar 5, 2025

cc @kaushalmahi12 This might be more relevant to wlm based on the path of the error org.opensearch.wlm.QueryGroupTask

@kaushalmahi12
Copy link
Contributor

kaushalmahi12 commented Mar 5, 2025

❯ curl -s "localhost:9200/_nodes/stats?pretty"                          
{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "runTask",
  "nodes" : {
    ....
      "caches" : {
        "request_cache" : {
          "size_in_bytes" : 0,
          "evictions" : 0,
          "hit_count" : 0,
          "miss_count" : 0,
          "item_count" : 0,
          "store_name" : "noop_store"
        }
      },
      "remote_store" : {
        "last_successful_fetch_of_pinned_timestamps" : -1
      }
    }
  }
}

@etgraylog I followed the repro steps and the issue didn't occur. Can you share the stacktrace from the logs ?

@kaushalmahi12
Copy link
Contributor

Steps I followed

  1. Checked out the main branch
  2. Spawned the OS from the local code using ./gradlew run
  3. Loaded some sample data into the opensearch
  4. Created the scroll_id
  5. Ran oha -z 1m "http://localhost:9200/_search/scroll/${scroll_id}?scroll=30s"
  6. Ran curl -s "localhost:9200/_nodes/stats?pretty"

@etgraylog
Copy link
Author

etgraylog commented Mar 6, 2025

A stacktrace is not generated in the log of the OpenSearch node when this occurs @kaushalmahi12.

To reproduce this, I use an single-shard index (zero replicas) consisting of 40,000,878 documents (35.2GiB), which a sliced scroll-search targets. It might require for example oha -p 10 or more to reproduce.

Here is a Gist containing a snippet of the OpenSearch node's log file containing messages from a time when NPEs were being noted in the response from its _nodes/stats API to cURL HTTP GETs, along with another file that contains the output from the HTTP GETs to _nodes/stats API and also other output related to the health of the OpenSearch node and shards.

Note that the date command was appended to the cURL commands to indicate when they were executed in relation to the contents of the OpenSearch node's log-file.

@kaushalmahi12
Copy link
Contributor

kaushalmahi12 commented Mar 10, 2025

Let me try to repro this @etgraylog further But I am still wondering a non-related log statement would cause the _nodes/stats to have the log statement because nodes/stats doesn't execute the QueryGroupTask code path .

@kaushalmahi12
Copy link
Contributor

kaushalmahi12 commented Mar 11, 2025

@etgraylog I found the source of the warning log, but I am still not able to repro the nodes/stats issue. For warning log the reason is that the 2.x PR doesn't contain the scroll API changes, hence we never set the queryGroupId into the Task which fails at this point
.

In crux all the scroll API tasks would not have this queryGroupId hence the log but I strongly believe that this should not cause the node stats to fail.

@kaushalmahi12
Copy link
Contributor

kaushalmahi12 commented Mar 11, 2025

Found the repro steps for node/stats as well. The repro steps are as following

  • Spin up the 2.19.1 opensearch either using docker or linux tar depending on the operating system
  • Ingest some test data
  • Run the scroll APIs
  • Concurrently fire the _nodes/stats request.

It has been addressed in the mentioned PR. We will also make this change in 2.x so that it doesn't affect the other 2.x branches

@etgraylog
Copy link
Author

Thank you @kaushalmahi12, this is great news!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search Search query, autocomplete ...etc
Projects
Status: 🆕 New
Development

Successfully merging a pull request may close this issue.

4 participants