Enable/Disable builder proposals via HTTP reloads whole validator definitions #3795

ricardolyn · 2022-12-13T14:26:05Z

Description

We have a setup of LH running BN and VC separated. The VC is loading thousands of validator keys through from the validation_definitions file. We use web3signer externally for singing purposes.

We are working on activating the builder proposals to some of the validators (not all, so removing the --builder-proposals argument). We are using the API as documented on [this link](https://lighthouse-book.sigmaprime.io/builders.html#enabledisable-builder- proposals-via-http).

When experimenting with this API by only calling the PATCH to 1 validator, we noticed that it loads ALL validator keys again from the definitions file, creating high CPU usage.

The request is something like: PATCH lighthouse/validators/{pubkey} with a body with { "builder_proposals": true }. When making the request, the logs of LH VC show the following:

│ lighthouse-validator Dec 13 10:30:26.246 INFO Enabled validator                       voting_pubkey: <pubkey1>, signing_method: remote_signer    │
│ lighthouse-validator Dec 13 10:30:26.246 INFO Enabled validator                       voting_pubkey: <pubkey2>, signing_method: remote_signer    │
│ lighthouse-validator Dec 13 10:30:26.246 INFO Enabled validator                       voting_pubkey: <pubkey3>, signing_method: remote_signer

So it loads all pub keys again. on the example we did, it loads the 500 keys even though we only patched 1. Imagine that it loads 500 keys for the 500 requests we need to update the state of all keys, and it becomes an exponential problem.

On the tests we did, the CPU usage and Storage I/O goes high when compared to not doing those requests.

Version

version: Lighthouse/v3.3.0-bf533c8

Present Behaviour

When patching the builder_proposal value for a validator key, the LH VC service reloads all validators again.

Expected Behaviour

It should update that validator and not reload the data for all validators configured.

The text was updated successfully, but these errors were encountered:

jmcruz1983 · 2022-12-13T14:38:31Z

CC: @paulhauner

michaelsproul · 2022-12-13T22:16:33Z

I think this is an issue affecting web3signer validators specifically. The PATCH method triggers a call to update_validators which is meant to ensure that the validators in-memory are in sync with the validators on disk. For local validators there's a check which skips the expensive parts of key reloading if the validator already exists in memory, here:

lighthouse/validator_client/src/initialized_validators.rs

Lines 995 to 997 in bf533c8

    
           if self.validators.contains_key(&pubkey_bytes) { 
        
               continue; 
        
           }

There's no similar check in the web3signer case, so we end up re-initializing every validator for every request here:

lighthouse/validator_client/src/initialized_validators.rs

Lines 1053 to 1059 in bf533c8

    
           match InitializedValidator::from_definition( 
        
               def.clone(), 
        
               &mut key_cache, 
        
               &mut key_stores, 
        
               &mut self.web3_signer_client_map, 
        
           ) 
        
           .await

I think a simple fix might be just to add a similar duplicate check to the web3signer branch.

ricardolyn · 2022-12-14T10:43:22Z

@michaelsproul thanks

lighthouse/validator_client/src/initialized_validators.rs

Lines 995 to 997 in bf533c8

    
           if self.validators.contains_key(&pubkey_bytes) { 
        
               continue; 
        
           }

which test validates that if above? I wanted to check if there was something similar that could be added as a test for the web3signer flow.

ricardolyn · 2022-12-14T16:55:50Z

I tried that change and tested it manually, but it didn't work. It loaded the whole validator anyway again.

realbigsean · 2022-12-14T21:23:19Z

@ricardolyn I tried Michael's suggestion in this PR #3801

It does look like it's working for me

ricardolyn · 2022-12-15T09:18:00Z

let me do a test on my side as I didn't change the code in the same way you did.

## Issue Addressed #3795 Co-authored-by: realbigsean <[email protected]>

michaelsproul · 2023-01-15T23:30:08Z

Closed by #3801 (v3.4.0) 🎉 Please comment if you find it isn't fixed

ricardolyn · 2023-02-08T09:03:41Z

@michaelsproul The team has been testing the v3.4.0 version of Lighthouse and encountered an issue with the PATCH lighthouse/validators/{pubkey} request, which is being performed every 2 minutes. Despite the fact that Lighthouse is not being killed every 2 minutes, it is still being restarted around every 2 hours by Kubernetes due to CPU throttling when updating 6 validators concurrently.

This could be a major problem in production, where there could be more than a thousand public keys in the same service, potentially leading to even more frequent restarts.

Any ideas on what may be causing this excessive CPU usage during the PATCH request?

michaelsproul · 2023-02-08T11:28:09Z

The team has been testing the v3.4.0 version of Lighthouse and encountered an issue with the PATCH lighthouse/validators/{pubkey} request, which is being performed every 2 minutes.

What's the PATCH body that's being sent? Is it toggling { enabled: true/false }?

Despite the fact that Lighthouse is not being killed every 2 minutes, it is still being restarted around every 2 hours by Kubernetes due to CPU throttling when updating 6 validators concurrently.

How high does the CPU spike and for how long? As an aside, I'm always confused by the Kubernetes policy to kill services that use too much CPU, as that just seems to slow everything down, causing more work to restart everything and then just re-creating the bug as soon as the process has restarted. Nonetheless, it would be good to fix the underlying issue. When you say 6 validators, do you mean there are 6 validators in the VC total, or just that there are 6 requests made at the same time and the number of validators is another number? If so, how many validators are registered to the VC total?

ricardolyn · 2023-02-08T13:05:17Z

What's the PATCH body that's being sent? Is it toggling { enabled: true/false }?

The body of the PATCH request is { builder_proposals: <true|false> }, as documented by LH.

How high does the CPU spike and for how long?

Let me correct my sentence in the previous message: the service is being killed by Kubernetes not due to CPU throttling but because the liveness probe fails.

This is the probe config we use: Liveness: http-get http://:metrics/metrics delay=0s timeout=5s period=10s #success=1 #failure=1. We are using the metrics endpoint to check for liveness (as we don't have http endpoint enabled by default). We believe that the high CPU spike occurs during the PATCH and the validator can't handle the requests to the metrics endpoint, leading to the restart by k8s.

Example of the liveness probe fail:
Warning Unhealthy 97s (x3 over 5m37s) Liveness probe failed: Get "http://<X.X.X.X>:8008/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

We could also use the /lighthouse/health endpoint for this but it asks for the Authentication header which seems strange.

FYI: This chart shows the relation between container_cpu_cfs_throttled_periods_total and container_cpu_cfs_periods_total (is a dashboard provided by Kubernetes). In the following image, you can see before and after the enable of those requests clearly:

Nonetheless, it would be good to fix the underlying issue. When you say 6 validators, do you mean there are 6 validators in the VC total, or just that there are 6 requests made at the same time and the number of validators is another number?

There are 500 keys being loaded through the definitions file, but only 6 are active. 6 parallel PATCH requests are made every 2 minutes to update the builder_proposals

Example log: INFO Some validators active slot: 4946100, epoch: 154565, total_validators: 500, active_validators: 7, current_epoch_proposers: 0, service: notifier

## Issue Addressed sigp#3795 Co-authored-by: realbigsean <[email protected]>

michaelsproul added val-client Relates to the validator client binary optimization Something to make Lighthouse run more efficiently. labels Dec 13, 2022

realbigsean mentioned this issue Dec 14, 2022

[Merged by Bors] - Web3 signer validator definitions reloading on any request #3801

Closed

bors bot pushed a commit that referenced this issue Jan 9, 2023

Web3 signer validator definitions reloading on any request (#3801)

1dd742d

## Issue Addressed #3795 Co-authored-by: realbigsean <[email protected]>

bors bot pushed a commit that referenced this issue Jan 9, 2023

Web3 signer validator definitions reloading on any request (#3801)

c85cd87

## Issue Addressed #3795 Co-authored-by: realbigsean <[email protected]>

michaelsproul closed this as completed Jan 15, 2023

michaelsproul mentioned this issue Feb 13, 2023

Slow performance of validator client PATCH API with hundreds of keys #3968

Closed

Woodpile37 pushed a commit to Woodpile37/lighthouse that referenced this issue Jan 6, 2024

Web3 signer validator definitions reloading on any request (sigp#3801)

f6ae59e

## Issue Addressed sigp#3795 Co-authored-by: realbigsean <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable/Disable builder proposals via HTTP reloads whole validator definitions #3795

Enable/Disable builder proposals via HTTP reloads whole validator definitions #3795

ricardolyn commented Dec 13, 2022 •

edited

Loading

jmcruz1983 commented Dec 13, 2022

michaelsproul commented Dec 13, 2022

ricardolyn commented Dec 14, 2022

ricardolyn commented Dec 14, 2022

realbigsean commented Dec 14, 2022

ricardolyn commented Dec 15, 2022

michaelsproul commented Jan 15, 2023

ricardolyn commented Feb 8, 2023 •

edited

Loading

michaelsproul commented Feb 8, 2023

ricardolyn commented Feb 8, 2023 •

edited

Loading

Enable/Disable builder proposals via HTTP reloads whole validator definitions #3795

Enable/Disable builder proposals via HTTP reloads whole validator definitions #3795

Comments

ricardolyn commented Dec 13, 2022 • edited Loading

Description

Version

Present Behaviour

Expected Behaviour

jmcruz1983 commented Dec 13, 2022

michaelsproul commented Dec 13, 2022

ricardolyn commented Dec 14, 2022

ricardolyn commented Dec 14, 2022

realbigsean commented Dec 14, 2022

ricardolyn commented Dec 15, 2022

michaelsproul commented Jan 15, 2023

ricardolyn commented Feb 8, 2023 • edited Loading

michaelsproul commented Feb 8, 2023

ricardolyn commented Feb 8, 2023 • edited Loading

ricardolyn commented Dec 13, 2022 •

edited

Loading

ricardolyn commented Feb 8, 2023 •

edited

Loading

ricardolyn commented Feb 8, 2023 •

edited

Loading