Description
Version: 8.18.2
Operating System: Ubuntu 24.04.2 LTS
Discuss Forum URL:
Steps to Reproduce:
We have 2 environments, Test & Production, in which we have the almost the same setup.
In each environment we have:
- 3 elastic agents that act as the fleet server
- 3 elastic agent complete that are used for synthetics - both journeys and lightweight tests
In Test we have:
- a total of 1366 monitors
- 187 synthetics
- 1159 lightweight
In Production we have:
- a total of 1595 monitors
- 185 synthetics
- 1410 lightweight
The issue that we are facing is that the elastic agent complete containers seem to lose connection to the fleet server containers after running for a couple minutes as healthy:
elastic-agent status
┌─ fleet
│ └─ status: (FAILED) status code: 400, fleet-server returned an error: BadRequest, message: Bad request: unable to decode checkin request
└─ elastic-agent
└─ status: (HEALTHY) Running
cat Synthetics.stderr.359 | grep fleet
{"log.level":"warn","@timestamp":"2025-06-18T08:50:50.154Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).doExecute","file.name":"fleet/fleet_gateway.go","file.line":196},"message":"Possible transient error during checkin with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"status code: 400, fleet-server returned an error: BadRequest, message: Bad request: unable to decode checkin request"},"request_duration_ns":313344452,"failed_checkins":1,"retry_after_ns":61474208694,"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2025-06-18T08:51:52.285Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).doExecute","file.name":"fleet/fleet_gateway.go","file.line":196},"message":"Possible transient error during checkin with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"status code: 400, fleet-server returned an error: BadRequest, message: Bad request: unable to decode checkin request"},"request_duration_ns":323635649,"failed_checkins":2,"retry_after_ns":193698292377,"ecs.version":"1.6.0"}
This tends to happen only in Production. Because of this issue we are not able to push new monitors to those elastic agent complete agents, because they can't communicate with the fleet server.
The fix is to to an elastic-agent restart (inside the elastic-agent complete container), but after ~5 minutes the elastic-agent complete goes in this unhealthy state again.
Please let me know what extra logs are needed to help debug this issue.