Run sliceline on large datasets #28

praveenjune17 · 2022-12-16T11:33:40Z

praveenjune17
Dec 16, 2022

Kudos to the Sliceline team for making a cool MLI tool!

In my use-case, I'm using the python version of Sliceline and the train dataset I use has 75M records and 100+ columns so I'm not able fit this data to Slicleline so I took a random sample of 37K records but still I'm not able to explore the search space greater than level 2 given the huge number of columns. I see we have a system DS implementation of this package but since I'm new to this I'm not able to figure out how to use Sliceline using system DS. Could you create a notebook explaining on how to run Sliceline using system DS?.

adedaran · 2023-03-03T16:23:24Z

adedaran
Mar 3, 2023
Maintainer

Hello @praveenjune17

Thanks you for your support. Before using system DS, did you check:

That your columns are correctly typed?
That the cardinality of your categorical columns is not too high?

A common caveats is to have a numerical column typed as a string one, which ends up in a categorical variable with a very high cardinality. Sliceline struggles with such column. Indeed, the first step of the algorithm is to One-Hot encode categorical variables. So in that case, it would create hundreds or thousands of new columns making an out-of-memory error.

That being said, the current version of the Sliceline package does not handle system DS as in the original paper. We do not plan to add it.

You can:

Check the paper's authors code source (we are a different team).
Create an issue and let the community contribute.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run sliceline on large datasets #28

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Run sliceline on large datasets #28

praveenjune17 Dec 16, 2022

Replies: 1 comment

adedaran Mar 3, 2023 Maintainer

praveenjune17
Dec 16, 2022

adedaran
Mar 3, 2023
Maintainer