Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
Code Sample
import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[np.nan, np.nan, np.nan]],
columns=['A', 'B', 'C'])
print(df.agg(x=('A', max), y=('B', 'min'), z=('C', np.mean))) # Works fine
print(df.agg(x=('A', max), y=('B', 'min'), z=('C', lambda x: np.mean(x)))) # Does not work
Problem description
Passing lambda or user defined function causes pd.DataFrame.agg
to crash.
Traceback (most recent call last):
File "/home/tobii.intra/jzn/.PyCharm2018.3/config/scratches/scratch_200.py", line 22, in <module>
print(df.agg(x=('A', max), y=('B', 'min'), z=('C', lambda x: np.mean(x))))
File "/home/tobii.intra/jzn/.virtualenvs/venv/lib/python3.7/site-packages/pandas/core/frame.py", line 7379, in aggregate
result_in_dict = relabel_result(result, func, columns, order)
File "/home/tobii.intra/jzn/.virtualenvs/venv/lib/python3.7/site-packages/pandas/core/aggregation.py", line 342, in relabel_result
s = s[col_idx_order]
File "/home/tobii.intra/jzn/.virtualenvs/venv/lib/python3.7/site-packages/pandas/core/frame.py", line 2912, in __getitem__
indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
File "/home/tobii.intra/jzn/.virtualenvs/venv/lib/python3.7/site-packages/pandas/core/indexing.py", line 1254, in _get_listlike_indexer
self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
File "/home/tobii.intra/jzn/.virtualenvs/venv/lib/python3.7/site-packages/pandas/core/indexing.py", line 1298, in _validate_read_indexer
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Int64Index([0], dtype='int64')] are in the [columns]"
When setting a break point in the user defined function I noticed that the input is not a Series
containing all values of the column, but a single scalar value from that column. This seems to be isolated to DataFrame.agg
, since my workaround was to add a trivial group by and make use of GroupBy.agg
, which works fine.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : b5958ee
python : 3.7.10.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-143-generic
Version : #147-Ubuntu SMP Wed Apr 14 16:10:11 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.5
numpy : 1.18.2
pytz : 2020.4
dateutil : 2.8.1
pip : 21.1.2
setuptools : 54.2.0
Cython : 0.29.21
pytest : None
hypothesis : None
sphinx : 3.1.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.0
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None