Categorizer should sort categories

Hello everyone,

We are facing a problem when calling dd.get_dumies (or DummyEncoder) when using Categorizer to infer the categories.

The problem seems to arise when two columns have the same categorical values that appear in a different order. In that case, we will get a `ValueError: The columns in the computed data do not match the columns in the provided metadata
Order of columns does not match` 

The example below shows that get_dummies works fine when we explicitly define the categories in the same order. But when Categorizer infers the categories (in a different order) we will get the ValueError.

We would expected get_dummies to work in both cases.

Thanks for the great work.

Milton

```python
import dask.dataframe as dd
import pandas as pd
from dask_ml.preprocessing import Categorizer
from pandas.api.types import CategoricalDtype

pdf = pd.DataFrame(
    {
        "c1": ["a", "c"],
        "c2": ["c", "a"],
        "c3": ["d", "d"],
    },
)


# setting categories explicitly in the same order works
ddf = dd.from_pandas(pdf, npartitions=1)
cat = Categorizer(
    categories={
        "c1": CategoricalDtype(categories=["a", "c"], ordered=False),
        "c2": CategoricalDtype(categories=["a", "c"], ordered=False),
        "c3": CategoricalDtype(categories=["d"], ordered=False),
    }
)
ddf = cat.fit_transform(ddf)
print(dd.get_dummies(ddf).compute())


# if categorizer infers the categories in a different
# order we'll get an exception on get_dummies
ddf = dd.from_pandas(pdf, npartitions=1)

cat = Categorizer()
ddf = cat.fit_transform(ddf)

print(ddf.compute())
# this will show that categories are inferred as 
# ['a', 'c'] and ['c', 'a'] for c1 and c2 respectively
# I believe this is causing the problem
print(cat.categories_)
print(dd.get_dummies(ddf).compute())

```

**Environment**:

- Dask version: 2022.4.0
- Python version: 3.9
- Operating System: Ubuntu 20.04
- Install method (conda, pip, source): conda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Categorizer should sort categories #916

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Categorizer should sort categories #916

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions