[BUG] DaskXGBClassifier raises ValueError when scored with sklearn.metrics.get_scorer('roc_auc') #11284

ScottMGustafson · 2025-02-25T19:49:57Z

Description

When using sklearn.metrics.get_scorer("roc_auc") on a DaskXGBClassifier, scikit-learn attempts to call predict_proba on what it believes is a regressor, triggering the error:

ValueError: DaskXGBClassifier should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.

Manually creating a scorer with make_scorer(roc_auc_score) works fine, as do other built-in scorers that only need predict (e.g. "f1"). The bug does not occur for the non-Dask XGBClassifier.

Examples

DaskXGBClassifier:

import dask.dataframe as dd
import pandas as pd
import xgboost as xgb
from distributed import Client, LocalCluster
from sklearn.datasets import make_classification
from sklearn.metrics import get_scorer, make_scorer, roc_auc_score

c = Client(LocalCluster())
X, y = make_classification()
X = dd.from_array(X, columns=[f"var{i}" for i in range(X.shape[1])])
y = dd.from_array(y)

obj = xgb.dask.DaskXGBClassifier().fit(X, y)

auc_score = make_scorer(roc_auc_score)(obj, X, y)  # works fine
f1 = get_scorer("f1")(obj, X, y)  # works fine
print(get_scorer("roc_auc")(obj, X, y))  # raises ValueError

Raises the following exception:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[4], line 17
     15 auc_score = make_scorer(roc_auc_score)(obj, X, y)  # works fine
     16 f1 = get_scorer("f1")(obj, X, y)  # works fine
---> 17 print(get_scorer("roc_auc")(obj, X, y))  # raises ValueError

File [/usr/local/lib/python3.12/site-packages/sklearn/metrics/_scorer.py:288](http://127.0.0.1:8888/usr/local/lib/python3.12/site-packages/sklearn/metrics/_scorer.py#line=287), in _BaseScorer.__call__(self, estimator, X, y_true, sample_weight, **kwargs)
    285 if sample_weight is not None:
    286     _kwargs["sample_weight"] = sample_weight
--> 288 return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)

File [/usr/local/lib/python3.12/site-packages/sklearn/metrics/_scorer.py:380](http://127.0.0.1:8888/usr/local/lib/python3.12/site-packages/sklearn/metrics/_scorer.py#line=379), in _Scorer._score(self, method_caller, estimator, X, y_true, **kwargs)
    378 pos_label = None if is_regressor(estimator) else self._get_pos_label()
    379 response_method = _check_response_method(estimator, self._response_method)
--> 380 y_pred = method_caller(
    381     estimator,
    382     _get_response_method_name(response_method),
    383     X,
    384     pos_label=pos_label,
    385 )
    387 scoring_kwargs = {**self._kwargs, **kwargs}
    388 return self._sign * self._score_func(y_true, y_pred, **scoring_kwargs)

File [/usr/local/lib/python3.12/site-packages/sklearn/metrics/_scorer.py:90](http://127.0.0.1:8888/usr/local/lib/python3.12/site-packages/sklearn/metrics/_scorer.py#line=89), in _cached_call(cache, estimator, response_method, *args, **kwargs)
     87 if cache is not None and response_method in cache:
     88     return cache[response_method]
---> 90 result, _ = _get_response_values(
     91     estimator, *args, response_method=response_method, **kwargs
     92 )
     94 if cache is not None:
     95     cache[response_method] = result

File [/usr/local/lib/python3.12/site-packages/sklearn/utils/_response.py:235](http://127.0.0.1:8888/usr/local/lib/python3.12/site-packages/sklearn/utils/_response.py#line=234), in _get_response_values(estimator, X, response_method, pos_label, return_response_method_used)
    233 else:  # estimator is a regressor
    234     if response_method != "predict":
--> 235         raise ValueError(
    236             f"{estimator.__class__.__name__} should either be a classifier to be "
    237             f"used with response_method={response_method} or the response_method "
    238             "should be 'predict'. Got a regressor with response_method="
    239             f"{response_method} instead."
    240         )
    241     prediction_method = estimator.predict
    242     y_pred, pos_label = prediction_method(X), None

ValueError: DaskXGBClassifier should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.

but the vanilla XGBClassifier is not affected:

import pandas as pd
import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.metrics import get_scorer, make_scorer, roc_auc_score

X, y = make_classification()
X = pd.DataFrame(X, columns=[f"var{i}" for i in range(X.shape[1])])
y = pd.Series(y)

obj = xgb.XGBClassifier().fit(X, y)

auc_score = make_scorer(roc_auc_score)(obj, X, y)  # works fine
f1 = get_scorer("f1")(obj, X, y)  # works fine
other_auc_score = get_scorer("roc_auc")(obj, X, y)  # works fine

Environment

dask==2024.8.0
xgboost==2.1.4
scikit-learn==1.6.1

system: python:3.12-slim-bullseye docker container on Mac M3

The text was updated successfully, but these errors were encountered:

trivialfis · 2025-02-25T20:52:17Z

I'm using dask 2024.11.2. It's just running into an internal error now. Will look into it when I setup a different environment with an older dask version.

  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1055, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.12/site-packages/sklearn/utils/_array_api.py", line 832, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.12/site-packages/dask/array/core.py", line 1746, in __array__
    x = self.compute()
        ^^^^^^^^^^^^^^
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.12/site-packages/dask/base.py", line 372, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.12/site-packages/dask/base.py", line 660, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.12/site-packages/distributed/client.py", line 2427, in _gather
    raise exception.with_traceback(traceback)
distributed.client.FutureCancelledError: ('getitem-3437663e5548237bd31527e3c352eca6', 0) cancelled for reason: unknown.

Same error with 2024.12.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] DaskXGBClassifier raises ValueError when scored with sklearn.metrics.get_scorer('roc_auc') #11284

[BUG] DaskXGBClassifier raises ValueError when scored with sklearn.metrics.get_scorer('roc_auc') #11284

ScottMGustafson commented Feb 25, 2025

trivialfis commented Feb 25, 2025 •

edited

Loading

[BUG] DaskXGBClassifier raises ValueError when scored with sklearn.metrics.get_scorer('roc_auc') #11284

[BUG] DaskXGBClassifier raises ValueError when scored with sklearn.metrics.get_scorer('roc_auc') #11284

Comments

ScottMGustafson commented Feb 25, 2025

Description

Examples

Environment

trivialfis commented Feb 25, 2025 • edited Loading

trivialfis commented Feb 25, 2025 •

edited

Loading