Open
Description
Hello everyone,
We are facing a problem when calling dd.get_dumies (or DummyEncoder) when using Categorizer to infer the categories.
The problem seems to arise when two columns have the same categorical values that appear in a different order. In that case, we will get a ValueError: The columns in the computed data do not match the columns in the provided metadata Order of columns does not match
The example below shows that get_dummies works fine when we explicitly define the categories in the same order. But when Categorizer infers the categories (in a different order) we will get the ValueError.
We would expected get_dummies to work in both cases.
Thanks for the great work.
Milton
import dask.dataframe as dd
import pandas as pd
from dask_ml.preprocessing import Categorizer
from pandas.api.types import CategoricalDtype
pdf = pd.DataFrame(
{
"c1": ["a", "c"],
"c2": ["c", "a"],
"c3": ["d", "d"],
},
)
# setting categories explicitly in the same order works
ddf = dd.from_pandas(pdf, npartitions=1)
cat = Categorizer(
categories={
"c1": CategoricalDtype(categories=["a", "c"], ordered=False),
"c2": CategoricalDtype(categories=["a", "c"], ordered=False),
"c3": CategoricalDtype(categories=["d"], ordered=False),
}
)
ddf = cat.fit_transform(ddf)
print(dd.get_dummies(ddf).compute())
# if categorizer infers the categories in a different
# order we'll get an exception on get_dummies
ddf = dd.from_pandas(pdf, npartitions=1)
cat = Categorizer()
ddf = cat.fit_transform(ddf)
print(ddf.compute())
# this will show that categories are inferred as
# ['a', 'c'] and ['c', 'a'] for c1 and c2 respectively
# I believe this is causing the problem
print(cat.categories_)
print(dd.get_dummies(ddf).compute())
Environment:
- Dask version: 2022.4.0
- Python version: 3.9
- Operating System: Ubuntu 20.04
- Install method (conda, pip, source): conda