Skip to content

Synthetics - Bad request: unable to decode checkin request #8577

Open
@sirbudd

Description

@sirbudd

Version: 8.18.2
Operating System: Ubuntu 24.04.2 LTS
Discuss Forum URL:
Steps to Reproduce:

We have 2 environments, Test & Production, in which we have the almost the same setup.
In each environment we have:

  • 3 elastic agents that act as the fleet server
  • 3 elastic agent complete that are used for synthetics - both journeys and lightweight tests

In Test we have:

  • a total of 1366 monitors
    • 187 synthetics
    • 1159 lightweight

In Production we have:

  • a total of 1595 monitors
    • 185 synthetics
    • 1410 lightweight

The issue that we are facing is that the elastic agent complete containers seem to lose connection to the fleet server containers after running for a couple minutes as healthy:

elastic-agent status
┌─ fleet
│  └─ status: (FAILED) status code: 400, fleet-server returned an error: BadRequest, message: Bad request: unable to decode checkin request
└─ elastic-agent
   └─ status: (HEALTHY) Running
cat Synthetics.stderr.359 | grep fleet
{"log.level":"warn","@timestamp":"2025-06-18T08:50:50.154Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).doExecute","file.name":"fleet/fleet_gateway.go","file.line":196},"message":"Possible transient error during checkin with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"status code: 400, fleet-server returned an error: BadRequest, message: Bad request: unable to decode checkin request"},"request_duration_ns":313344452,"failed_checkins":1,"retry_after_ns":61474208694,"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2025-06-18T08:51:52.285Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).doExecute","file.name":"fleet/fleet_gateway.go","file.line":196},"message":"Possible transient error during checkin with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"status code: 400, fleet-server returned an error: BadRequest, message: Bad request: unable to decode checkin request"},"request_duration_ns":323635649,"failed_checkins":2,"retry_after_ns":193698292377,"ecs.version":"1.6.0"}

This tends to happen only in Production. Because of this issue we are not able to push new monitors to those elastic agent complete agents, because they can't communicate with the fleet server.

The fix is to to an elastic-agent restart (inside the elastic-agent complete container), but after ~5 minutes the elastic-agent complete goes in this unhealthy state again.

Please let me know what extra logs are needed to help debug this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions