MondrianCP can't handle Pandas dataframe #526

lennartvandeguchte · 2024-10-30T11:04:53Z

Describe the bug

When using the new MondrianCP class I'm unable to fit my estimator with a Pandas dataframe, while using the standard MapieRegressor this works fine. Since I'm using a sklearn pipeline that contains some column transformers that use the pandas column name, I can't transform my data into a numpy array first because then sklearn gives me an error when fitting the estimator.

To Reproduce
Below the code to reproduce my problem.

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor, ColumnTransformer
from sklearn.preprocessing import  RobustScaler, OneHotEncoder
from mapie.regression import MapieRegressor
from mapie.mondrian import MondrianCP
from lightgbm import LGBMRegressor
import pandas as pd
from sklearn.model_selection import train_test_split

# Create some dummy data
data = pd.DataFrame(np.random.rand(100, 5), columns=['feature1', 'feature2', 'feature3', 'feature4', 'feature5'])
data['categorical_feature'] = np.random.choice(['A', 'B', 'C'], size=100)
y = pd.Series(np.random.rand(100))

# Create bins for the partition
data['BIN'] = pd.cut(y, bins=3, labels=[1, 2, 3])

# Split the data into a train and calibration set
data_train, data_calib, y_train, y_calib = train_test_split(data, y, test_size=0.2, random_state=42)

model = LGBMRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    min_child_samples=10,
    num_leaves=31,
    random_state=42
)

ct = ColumnTransformer([
    ("site", OneHotEncoder(), ['categorical_feature']),
    ("features", RobustScaler(), ['feature1', 'feature2', 'feature3', 'feature4', 'feature5']),
    ])
estimators = [('transformers',ct), ('model',  model)]
pre_pipe = Pipeline(estimators)
pipe = TransformedTargetRegressor(regressor=pre_pipe, transformer=RobustScaler())
pipe.fit(data_train, y_train)

strategy = "mondrian"
if strategy == "mondrian":    
    mapie_regressor = MondrianCP(MapieRegressor(pipe, cv='prefit'))
    mapie_regressor.fit(data_calib, y_calib, partition=data_calib['BIN'])
if strategy == "mondrian_numpy":    
    mapie_regressor = MondrianCP(MapieRegressor(pipe, cv='prefit'))
    mapie_regressor.fit(data_calib.to_numpy(), y_calib, partition=data_calib['BIN'])
else:
    mapie_regressor = MapieRegressor(estimator=pipe, cv='prefit')
    mapie_regressor = mapie_regressor.fit(data_calib, y_calib)

By changing the strategy to mondrian_numpy you can also reproduce the sklearn error I receive.

Expected behavior
Be able to use a Pandas dataframe as input data for MondrianCP class.

The text was updated successfully, but these errors were encountered:

lennartvandeguchte · 2024-10-30T15:32:17Z

I managed to resolve the sklearn issue when using the 'mondrian_numpy' strategy in the example above by using indices in the ColumnTransformer instead of column names:

numerical_indices = [data.columns.get_loc(col) for col in numeric_features]
categorical_indices = [data.columns.get_loc(col) for col in categorical_features]

ct = ColumnTransformer([
    ("site", OneHotEncoder(), categorical_indices),
    ("features", RobustScaler(), numerical_indices),
    ])

I don't know if the package maintainers still want the MondrianCP class to handle Pandas dataframes? Otherwise this issue can be closed.

Valentin-Laurent · 2024-10-31T09:35:42Z

Hi @lennartvandeguchte, thank you for reporting this. Good to know you found a workaround.

We need further internal discussion to decide what to do about this. We'll let you know.

Best,

Valentin-Laurent · 2024-10-31T10:45:25Z

Following our discussion: support for Pandas dataframes is something we'd like to have, but is not a quick win. Indeed, in a prefit setting, it is easy to address, but in a split or cross setting, we call .fit on the provided estimator (that can be a pipeline), and so we need to avoid casting X,y to NDArray otherwise we're losing some pd.Dataframe functionalities that can be required by the pipeline.

We're adding this to our backlog.

Valentin-Laurent added Backlog This is in the MAPIE team development backlog, yet to be prioritised. Enhancement Type: enhancement (new feature or request) and removed Needs decision The MAPIE team is deciding what to do next. Bug Type: bug labels Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MondrianCP can't handle Pandas dataframe #526

MondrianCP can't handle Pandas dataframe #526

lennartvandeguchte commented Oct 30, 2024 •

edited

Loading

lennartvandeguchte commented Oct 30, 2024

Valentin-Laurent commented Oct 31, 2024

Valentin-Laurent commented Oct 31, 2024

MondrianCP can't handle Pandas dataframe #526

MondrianCP can't handle Pandas dataframe #526

Comments

lennartvandeguchte commented Oct 30, 2024 • edited Loading

lennartvandeguchte commented Oct 30, 2024

Valentin-Laurent commented Oct 31, 2024

Valentin-Laurent commented Oct 31, 2024

lennartvandeguchte commented Oct 30, 2024 •

edited

Loading