Run sliceline on large datasets #28
praveenjune17
started this conversation in
General
Replies: 1 comment
-
Hello @praveenjune17 Thanks you for your support. Before using system DS, did you check:
A common caveats is to have a numerical column typed as a string one, which ends up in a categorical variable with a very high cardinality. Sliceline struggles with such column. Indeed, the first step of the algorithm is to One-Hot encode categorical variables. So in that case, it would create hundreds or thousands of new columns making an out-of-memory error. That being said, the current version of the Sliceline package does not handle system DS as in the original paper. We do not plan to add it. You can:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Kudos to the Sliceline team for making a cool MLI tool!
In my use-case, I'm using the python version of Sliceline and the train dataset I use has 75M records and 100+ columns so I'm not able fit this data to Slicleline so I took a random sample of 37K records but still I'm not able to explore the search space greater than level 2 given the huge number of columns. I see we have a system DS implementation of this package but since I'm new to this I'm not able to figure out how to use Sliceline using system DS. Could you create a notebook explaining on how to run Sliceline using system DS?.
Beta Was this translation helpful? Give feedback.
All reactions