Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance question #4

Open
Heringer-Epson opened this issue Apr 24, 2020 · 0 comments
Open

Performance question #4

Heringer-Epson opened this issue Apr 24, 2020 · 0 comments

Comments

@Heringer-Epson
Copy link

Hi,

Thank you for making your repo available, it is extremely helpful.
Would you have any insights on the following problem I encountered:

I have a base dataframe with 50,000 rows and 8 columns, which I use to create roughly 1,500 features per multipliable feature in my feature family. The multipliers are nested such that there are 4 categorical multipliers and one time multiplier. While running the code with one multipliable feature alone works and takes about 2min, adding more multipliable features in the same feature family increases the run time exponentially:

1 feature - 2min
3 features - 17min
4 features - 2h20min

Is this an expected behaviour for this size of data? And if not, any suggestions on how to improve it?

I also noted that the function "append_features" is always being executed with a single executor in my cluster (while dynamic allocation being true).

Thank you,
Epson

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant