Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaler scales down to 1 worker. #930

Open
me-her opened this issue Jan 21, 2025 · 2 comments
Open

Autoscaler scales down to 1 worker. #930

me-her opened this issue Jan 21, 2025 · 2 comments

Comments

@me-her
Copy link

me-her commented Jan 21, 2025

Autoscaler scales down to 1 worker despite configuring the minimum to be 2 workers:

Testing Scenario:

#Ran a big computation 

import distributed
client = await distributed.Client(" <hosted-url>:8786", asynchronous=True,direct_to_workers=True)

import dask.array as da 
array = da.random.random(size=(40960, 4096, 4096), chunks="256M").astype("float32")
mean = await client.compute(array.mean())

1st Run with min as 8 and max as 16.

2nd Run with min as 2 and max as 12

The scale up is perfect. It scales up as expected to 16 workers - 1st Run and 12 workers - 2nd Run respectively.
While scaling down, the operator logs as shown below indicate that it scales down to 2 (The minimum configured) but then I see only one worker remaining. This happened on both runs.

Operator Logs

Image

Scheduler Logs

Image

Anything else we need to know?:

I see the workers scale down to 1 even when the minimum configured was 8. Earlier version of dask-operator had a problem of staying at the maximum number of workers, as in when they scale up to max, they never scale back down. This version the scale up works perfectly as expected but the scale down happens to 1.

Environment:

  • Dask version: 2024.12.1
  • Distributed : 2024.12.1
  • Dask-Operator: 2025.1.0
  • Python version: 3.10
  • Operating System: Linux
  • Install method (conda, pip, source): pip
  • Running this on GKE
@jacobtomlinson
Copy link
Member

Would you be able to create a complete example which includes creating the cluster and running some workloads the reproduces the problem?

Ideally it should be the smallest amount of code that you can write that reproduces the issue, that I can copy/paste to see the problem for myself.

@me-her
Copy link
Author

me-her commented Jan 22, 2025

We create the cluster using dask-kubernetes operator. It's a HELM chart deployment in GKE. I also configure the min and max workers in autoscaler.yml. Once my scheduler and workers are up.

I connect to the cluster and run the commutation like below.

import distributed
client = await distributed.Client(" <hosted-url>:8786", asynchronous=True,direct_to_workers=True)

import dask.array as da 
array = da.random.random(size=(40960, 4096, 4096), chunks="256M").astype("float32")
mean = await client.compute(array.mean())

Since this is a large enough commutation. The scaling triggers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants