Skip to content

Categorizer should sort categories #916

Open
dask/dask
#8898
@miltava

Description

@miltava

Hello everyone,

We are facing a problem when calling dd.get_dumies (or DummyEncoder) when using Categorizer to infer the categories.

The problem seems to arise when two columns have the same categorical values that appear in a different order. In that case, we will get a ValueError: The columns in the computed data do not match the columns in the provided metadata Order of columns does not match

The example below shows that get_dummies works fine when we explicitly define the categories in the same order. But when Categorizer infers the categories (in a different order) we will get the ValueError.

We would expected get_dummies to work in both cases.

Thanks for the great work.

Milton

import dask.dataframe as dd
import pandas as pd
from dask_ml.preprocessing import Categorizer
from pandas.api.types import CategoricalDtype

pdf = pd.DataFrame(
    {
        "c1": ["a", "c"],
        "c2": ["c", "a"],
        "c3": ["d", "d"],
    },
)


# setting categories explicitly in the same order works
ddf = dd.from_pandas(pdf, npartitions=1)
cat = Categorizer(
    categories={
        "c1": CategoricalDtype(categories=["a", "c"], ordered=False),
        "c2": CategoricalDtype(categories=["a", "c"], ordered=False),
        "c3": CategoricalDtype(categories=["d"], ordered=False),
    }
)
ddf = cat.fit_transform(ddf)
print(dd.get_dummies(ddf).compute())


# if categorizer infers the categories in a different
# order we'll get an exception on get_dummies
ddf = dd.from_pandas(pdf, npartitions=1)

cat = Categorizer()
ddf = cat.fit_transform(ddf)

print(ddf.compute())
# this will show that categories are inferred as 
# ['a', 'c'] and ['c', 'a'] for c1 and c2 respectively
# I believe this is causing the problem
print(cat.categories_)
print(dd.get_dummies(ddf).compute())

Environment:

  • Dask version: 2022.4.0
  • Python version: 3.9
  • Operating System: Ubuntu 20.04
  • Install method (conda, pip, source): conda

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions