fix(skore): convert DataFrame column names to string type #2034

waridrox · 2025-09-15T07:25:56Z

Closes #2029
CC: @thomass-dev

When debugging, in the data_accessor.py file:

skore/skore/src/skore/_sklearn/_estimator/data_accessor.py

Lines 49 to 67 in eb0d6e9

    
           if X is None: 
        
               raise ValueError(err_msg.format(f"X_{dataset}", data_source)) 
        
           elif not sbd.is_dataframe(X): 
        
               X = pd.DataFrame(X, columns=[f"Feature {i}" for i in range(X.shape[1])]) 
        
           if with_y: 
        
               if y is None: 
        
                   raise ValueError(err_msg.format(f"y_{dataset}", data_source)) 
        
               if isinstance(y, pd.Series) and y.name is not None: 
        
                   y = y.to_frame() 
        
               elif not sbd.is_dataframe(y): 
        
                   if y.ndim == 1: 
        
                       columns = ["Target"] 
        
                   else: 
        
                       columns = [f"Target {i}" for i in range(y.shape[1])] 
        
                   y = pd.DataFrame(y, columns=columns) 
        
           return X, y

This method only converts numpy arrays to DataFrames with string column names, but doesn't handle the case if DataFrames already exist but have integer type column names.

This then further passes down to the skrub functions (which expects strings), and the final error of TypeError: cannot use a string pattern on a bytes-like object occurs because suggested_name is an integer from the RangeIndex columns.

def _get_new_name(suggested_name, forbidden_names):
    """Get a new name for a column."""
    # .......    
    tags = re.findall(tag_pattern, suggested_name)  # <==== 
    # .......

tag_pattern is a string regex pattern and re.findall(tag_pattern, suggested_name) method expects both arguments to be strings or both to be bytes-like objects.

I tried to overcome this by ensuring that the DataFrame object has string column names if it already exists to avoid issues with skrub later when passing down from data_accessor.py function.

Alternatively, should I simply just raise exception errors if the compute fails in subsequent steps instead of this?

thomass-dev · 2025-09-15T07:32:03Z

Thanks for your investigation and your contribution. Please add a dedicated test.

waridrox · 2025-09-16T06:14:15Z

skore/tests/unit/reports/estimator/data/test_accessor.py

+
+
+def test_analyze_df_with_integer_column_names_multiclass():
+    """Ensure analyze works when X/y are DataFrames with integer column names."""


This test verifies that when X/y are DataFrames with integer RangeIndex columns on a multiclass dataset, report.data.analyze() runs and the resulting columns are strings.

waridrox · 2025-09-16T06:15:46Z

skore/tests/unit/reports/estimator/data/test_accessor.py

+    assert df.shape[1] == X_train.shape[1] + 1
+
+
+def test_analyze_df_with_mixed_column_name_types():


This test verifies that mixed column name types in X (ints and strings) and integer column in y are normalized so analyze(data_source="all") succeeds and output columns are strings.

fix: convert DataFrame column names to string type

eca2659

github-actions bot assigned waridrox Sep 15, 2025

test: Added tests to assert different column types in DataFrame

c2c8de7

waridrox commented Sep 16, 2025

View reviewed changes

thomass-dev mentioned this pull request Sep 22, 2025

fix(skore): Data analyze fails when called with multi class dataset #2029

Open

auguste-probabl mentioned this pull request Sep 26, 2025

fix(data_accessor): Correct behaviour when data are DataFrames without column names #2052

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(skore): convert DataFrame column names to string type #2034

fix(skore): convert DataFrame column names to string type #2034

waridrox commented Sep 15, 2025 •

edited

Loading

Uh oh!

thomass-dev commented Sep 15, 2025 •

edited

Loading

Uh oh!

waridrox Sep 16, 2025

Uh oh!

waridrox Sep 16, 2025

Uh oh!

Uh oh!

	if X is None:
	raise ValueError(err_msg.format(f"X_{dataset}", data_source))
	elif not sbd.is_dataframe(X):
	X = pd.DataFrame(X, columns=[f"Feature {i}" for i in range(X.shape[1])])

	if with_y:
	if y is None:
	raise ValueError(err_msg.format(f"y_{dataset}", data_source))

	if isinstance(y, pd.Series) and y.name is not None:
	y = y.to_frame()
	elif not sbd.is_dataframe(y):
	if y.ndim == 1:
	columns = ["Target"]
	else:
	columns = [f"Target {i}" for i in range(y.shape[1])]
	y = pd.DataFrame(y, columns=columns)

	return X, y



		def test_analyze_df_with_integer_column_names_multiclass():
		"""Ensure analyze works when X/y are DataFrames with integer column names."""

		assert df.shape[1] == X_train.shape[1] + 1


		def test_analyze_df_with_mixed_column_name_types():

fix(skore): convert DataFrame column names to string type #2034

Are you sure you want to change the base?

fix(skore): convert DataFrame column names to string type #2034

Conversation

waridrox commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomass-dev commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

waridrox Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

waridrox Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

waridrox commented Sep 15, 2025 •

edited

Loading

thomass-dev commented Sep 15, 2025 •

edited

Loading