Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] DaskXGBClassifier raises ValueError when scored with sklearn.metrics.get_scorer('roc_auc') #11284

Open
ScottMGustafson opened this issue Feb 25, 2025 · 1 comment

Comments

@ScottMGustafson
Copy link
Contributor

Description

When using sklearn.metrics.get_scorer("roc_auc") on a DaskXGBClassifier, scikit-learn attempts to call predict_proba on what it believes is a regressor, triggering the error:

ValueError: DaskXGBClassifier should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.

Manually creating a scorer with make_scorer(roc_auc_score) works fine, as do other built-in scorers that only need predict (e.g. "f1"). The bug does not occur for the non-Dask XGBClassifier.

Examples

DaskXGBClassifier:

import dask.dataframe as dd
import pandas as pd
import xgboost as xgb
from distributed import Client, LocalCluster
from sklearn.datasets import make_classification
from sklearn.metrics import get_scorer, make_scorer, roc_auc_score

c = Client(LocalCluster())
X, y = make_classification()
X = dd.from_array(X, columns=[f"var{i}" for i in range(X.shape[1])])
y = dd.from_array(y)

obj = xgb.dask.DaskXGBClassifier().fit(X, y)

auc_score = make_scorer(roc_auc_score)(obj, X, y)  # works fine
f1 = get_scorer("f1")(obj, X, y)  # works fine
print(get_scorer("roc_auc")(obj, X, y))  # raises ValueError

Raises the following exception:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[4], line 17
     15 auc_score = make_scorer(roc_auc_score)(obj, X, y)  # works fine
     16 f1 = get_scorer("f1")(obj, X, y)  # works fine
---> 17 print(get_scorer("roc_auc")(obj, X, y))  # raises ValueError

File [/usr/local/lib/python3.12/site-packages/sklearn/metrics/_scorer.py:288](http://127.0.0.1:8888/usr/local/lib/python3.12/site-packages/sklearn/metrics/_scorer.py#line=287), in _BaseScorer.__call__(self, estimator, X, y_true, sample_weight, **kwargs)
    285 if sample_weight is not None:
    286     _kwargs["sample_weight"] = sample_weight
--> 288 return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)

File [/usr/local/lib/python3.12/site-packages/sklearn/metrics/_scorer.py:380](http://127.0.0.1:8888/usr/local/lib/python3.12/site-packages/sklearn/metrics/_scorer.py#line=379), in _Scorer._score(self, method_caller, estimator, X, y_true, **kwargs)
    378 pos_label = None if is_regressor(estimator) else self._get_pos_label()
    379 response_method = _check_response_method(estimator, self._response_method)
--> 380 y_pred = method_caller(
    381     estimator,
    382     _get_response_method_name(response_method),
    383     X,
    384     pos_label=pos_label,
    385 )
    387 scoring_kwargs = {**self._kwargs, **kwargs}
    388 return self._sign * self._score_func(y_true, y_pred, **scoring_kwargs)

File [/usr/local/lib/python3.12/site-packages/sklearn/metrics/_scorer.py:90](http://127.0.0.1:8888/usr/local/lib/python3.12/site-packages/sklearn/metrics/_scorer.py#line=89), in _cached_call(cache, estimator, response_method, *args, **kwargs)
     87 if cache is not None and response_method in cache:
     88     return cache[response_method]
---> 90 result, _ = _get_response_values(
     91     estimator, *args, response_method=response_method, **kwargs
     92 )
     94 if cache is not None:
     95     cache[response_method] = result

File [/usr/local/lib/python3.12/site-packages/sklearn/utils/_response.py:235](http://127.0.0.1:8888/usr/local/lib/python3.12/site-packages/sklearn/utils/_response.py#line=234), in _get_response_values(estimator, X, response_method, pos_label, return_response_method_used)
    233 else:  # estimator is a regressor
    234     if response_method != "predict":
--> 235         raise ValueError(
    236             f"{estimator.__class__.__name__} should either be a classifier to be "
    237             f"used with response_method={response_method} or the response_method "
    238             "should be 'predict'. Got a regressor with response_method="
    239             f"{response_method} instead."
    240         )
    241     prediction_method = estimator.predict
    242     y_pred, pos_label = prediction_method(X), None

ValueError: DaskXGBClassifier should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.

but the vanilla XGBClassifier is not affected:

import pandas as pd
import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.metrics import get_scorer, make_scorer, roc_auc_score

X, y = make_classification()
X = pd.DataFrame(X, columns=[f"var{i}" for i in range(X.shape[1])])
y = pd.Series(y)

obj = xgb.XGBClassifier().fit(X, y)

auc_score = make_scorer(roc_auc_score)(obj, X, y)  # works fine
f1 = get_scorer("f1")(obj, X, y)  # works fine
other_auc_score = get_scorer("roc_auc")(obj, X, y)  # works fine

Environment

dask==2024.8.0
xgboost==2.1.4
scikit-learn==1.6.1

system: python:3.12-slim-bullseye docker container on Mac M3

@trivialfis
Copy link
Member

trivialfis commented Feb 25, 2025

I'm using dask 2024.11.2. It's just running into an internal error now. Will look into it when I setup a different environment with an older dask version.

  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1055, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.12/site-packages/sklearn/utils/_array_api.py", line 832, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.12/site-packages/dask/array/core.py", line 1746, in __array__
    x = self.compute()
        ^^^^^^^^^^^^^^
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.12/site-packages/dask/base.py", line 372, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.12/site-packages/dask/base.py", line 660, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiamingy/.anaconda/envs/xgboost_dev/lib/python3.12/site-packages/distributed/client.py", line 2427, in _gather
    raise exception.with_traceback(traceback)
distributed.client.FutureCancelledError: ('getitem-3437663e5548237bd31527e3c352eca6', 0) cancelled for reason: unknown.

Same error with 2024.12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants