Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable/Disable builder proposals via HTTP reloads whole validator definitions #3795

Closed
ricardolyn opened this issue Dec 13, 2022 · 10 comments
Closed
Labels
optimization Something to make Lighthouse run more efficiently. val-client Relates to the validator client binary

Comments

@ricardolyn
Copy link

ricardolyn commented Dec 13, 2022

Description

We have a setup of LH running BN and VC separated. The VC is loading thousands of validator keys through from the validation_definitions file. We use web3signer externally for singing purposes.

We are working on activating the builder proposals to some of the validators (not all, so removing the --builder-proposals argument). We are using the API as documented on [this link](https://lighthouse-book.sigmaprime.io/builders.html#enabledisable-builder- proposals-via-http).

When experimenting with this API by only calling the PATCH to 1 validator, we noticed that it loads ALL validator keys again from the definitions file, creating high CPU usage.

The request is something like: PATCH lighthouse/validators/{pubkey} with a body with { "builder_proposals": true }. When making the request, the logs of LH VC show the following:

│ lighthouse-validator Dec 13 10:30:26.246 INFO Enabled validator                       voting_pubkey: <pubkey1>, signing_method: remote_signer    │
│ lighthouse-validator Dec 13 10:30:26.246 INFO Enabled validator                       voting_pubkey: <pubkey2>, signing_method: remote_signer    │
│ lighthouse-validator Dec 13 10:30:26.246 INFO Enabled validator                       voting_pubkey: <pubkey3>, signing_method: remote_signer

So it loads all pub keys again. on the example we did, it loads the 500 keys even though we only patched 1. Imagine that it loads 500 keys for the 500 requests we need to update the state of all keys, and it becomes an exponential problem.

On the tests we did, the CPU usage and Storage I/O goes high when compared to not doing those requests.

Version

version: Lighthouse/v3.3.0-bf533c8

Present Behaviour

When patching the builder_proposal value for a validator key, the LH VC service reloads all validators again.

Expected Behaviour

It should update that validator and not reload the data for all validators configured.

@jmcruz1983
Copy link

CC: @paulhauner

@michaelsproul
Copy link
Member

I think this is an issue affecting web3signer validators specifically. The PATCH method triggers a call to update_validators which is meant to ensure that the validators in-memory are in sync with the validators on disk. For local validators there's a check which skips the expensive parts of key reloading if the validator already exists in memory, here:

if self.validators.contains_key(&pubkey_bytes) {
continue;
}

There's no similar check in the web3signer case, so we end up re-initializing every validator for every request here:

match InitializedValidator::from_definition(
def.clone(),
&mut key_cache,
&mut key_stores,
&mut self.web3_signer_client_map,
)
.await

I think a simple fix might be just to add a similar duplicate check to the web3signer branch.

@michaelsproul michaelsproul added val-client Relates to the validator client binary optimization Something to make Lighthouse run more efficiently. labels Dec 13, 2022
@ricardolyn
Copy link
Author

@michaelsproul thanks

if self.validators.contains_key(&pubkey_bytes) {
continue;
}

which test validates that if above? I wanted to check if there was something similar that could be added as a test for the web3signer flow.

@ricardolyn
Copy link
Author

I tried that change and tested it manually, but it didn't work. It loaded the whole validator anyway again.

@realbigsean
Copy link
Member

@ricardolyn I tried Michael's suggestion in this PR #3801

It does look like it's working for me

@ricardolyn
Copy link
Author

let me do a test on my side as I didn't change the code in the same way you did.

bors bot pushed a commit that referenced this issue Jan 9, 2023
bors bot pushed a commit that referenced this issue Jan 9, 2023
@michaelsproul
Copy link
Member

Closed by #3801 (v3.4.0) 🎉 Please comment if you find it isn't fixed

@ricardolyn
Copy link
Author

ricardolyn commented Feb 8, 2023

@michaelsproul The team has been testing the v3.4.0 version of Lighthouse and encountered an issue with the PATCH lighthouse/validators/{pubkey} request, which is being performed every 2 minutes. Despite the fact that Lighthouse is not being killed every 2 minutes, it is still being restarted around every 2 hours by Kubernetes due to CPU throttling when updating 6 validators concurrently.

This could be a major problem in production, where there could be more than a thousand public keys in the same service, potentially leading to even more frequent restarts.

Any ideas on what may be causing this excessive CPU usage during the PATCH request?

@michaelsproul
Copy link
Member

The team has been testing the v3.4.0 version of Lighthouse and encountered an issue with the PATCH lighthouse/validators/{pubkey} request, which is being performed every 2 minutes.

What's the PATCH body that's being sent? Is it toggling { enabled: true/false }?

Despite the fact that Lighthouse is not being killed every 2 minutes, it is still being restarted around every 2 hours by Kubernetes due to CPU throttling when updating 6 validators concurrently.

How high does the CPU spike and for how long? As an aside, I'm always confused by the Kubernetes policy to kill services that use too much CPU, as that just seems to slow everything down, causing more work to restart everything and then just re-creating the bug as soon as the process has restarted. Nonetheless, it would be good to fix the underlying issue. When you say 6 validators, do you mean there are 6 validators in the VC total, or just that there are 6 requests made at the same time and the number of validators is another number? If so, how many validators are registered to the VC total?

@ricardolyn
Copy link
Author

ricardolyn commented Feb 8, 2023

What's the PATCH body that's being sent? Is it toggling { enabled: true/false }?

The body of the PATCH request is { builder_proposals: <true|false> }, as documented by LH.

How high does the CPU spike and for how long?

Let me correct my sentence in the previous message: the service is being killed by Kubernetes not due to CPU throttling but because the liveness probe fails.

This is the probe config we use: Liveness: http-get http://:metrics/metrics delay=0s timeout=5s period=10s #success=1 #failure=1. We are using the metrics endpoint to check for liveness (as we don't have http endpoint enabled by default). We believe that the high CPU spike occurs during the PATCH and the validator can't handle the requests to the metrics endpoint, leading to the restart by k8s.

Example of the liveness probe fail:
Warning Unhealthy 97s (x3 over 5m37s) Liveness probe failed: Get "http://<X.X.X.X>:8008/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

We could also use the /lighthouse/health endpoint for this but it asks for the Authentication header which seems strange.

FYI: This chart shows the relation between container_cpu_cfs_throttled_periods_total and container_cpu_cfs_periods_total (is a dashboard provided by Kubernetes). In the following image, you can see before and after the enable of those requests clearly:
Screenshot 2023-02-08 at 12 54 47

Nonetheless, it would be good to fix the underlying issue. When you say 6 validators, do you mean there are 6 validators in the VC total, or just that there are 6 requests made at the same time and the number of validators is another number?

There are 500 keys being loaded through the definitions file, but only 6 are active. 6 parallel PATCH requests are made every 2 minutes to update the builder_proposals

Example log: INFO Some validators active slot: 4946100, epoch: 154565, total_validators: 500, active_validators: 7, current_epoch_proposers: 0, service: notifier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization Something to make Lighthouse run more efficiently. val-client Relates to the validator client binary
Projects
None yet
Development

No branches or pull requests

4 participants