Skip to content

Conversation

waridrox
Copy link
Contributor

@waridrox waridrox commented Sep 15, 2025

Closes #2029
CC: @thomass-dev

When debugging, in the data_accessor.py file:

if X is None:
raise ValueError(err_msg.format(f"X_{dataset}", data_source))
elif not sbd.is_dataframe(X):
X = pd.DataFrame(X, columns=[f"Feature {i}" for i in range(X.shape[1])])
if with_y:
if y is None:
raise ValueError(err_msg.format(f"y_{dataset}", data_source))
if isinstance(y, pd.Series) and y.name is not None:
y = y.to_frame()
elif not sbd.is_dataframe(y):
if y.ndim == 1:
columns = ["Target"]
else:
columns = [f"Target {i}" for i in range(y.shape[1])]
y = pd.DataFrame(y, columns=columns)
return X, y

This method only converts numpy arrays to DataFrames with string column names, but doesn't handle the case if DataFrames already exist but have integer type column names.

This then further passes down to the skrub functions (which expects strings), and the final error of TypeError: cannot use a string pattern on a bytes-like object occurs because suggested_name is an integer from the RangeIndex columns.

def _get_new_name(suggested_name, forbidden_names):
    """Get a new name for a column."""
    # .......    
    tags = re.findall(tag_pattern, suggested_name)  # <==== 
    # .......

tag_pattern is a string regex pattern and re.findall(tag_pattern, suggested_name) method expects both arguments to be strings or both to be bytes-like objects.

I tried to overcome this by ensuring that the DataFrame object has string column names if it already exists to avoid issues with skrub later when passing down from data_accessor.py function.

Alternatively, should I simply just raise exception errors if the compute fails in subsequent steps instead of this?

@thomass-dev
Copy link
Collaborator

thomass-dev commented Sep 15, 2025

Thanks for your investigation and your contribution. Please add a dedicated test.



def test_analyze_df_with_integer_column_names_multiclass():
"""Ensure analyze works when X/y are DataFrames with integer column names."""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test verifies that when X/y are DataFrames with integer RangeIndex columns on a multiclass dataset, report.data.analyze() runs and the resulting columns are strings.

assert df.shape[1] == X_train.shape[1] + 1


def test_analyze_df_with_mixed_column_name_types():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test verifies that mixed column name types in X (ints and strings) and integer column in y are normalized so analyze(data_source="all") succeeds and output columns are strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fix(skore): Data analyze fails when called with multi class dataset
2 participants