Performance question #4

Heringer-Epson · 2020-04-24T16:20:22Z

Hi,

Thank you for making your repo available, it is extremely helpful.
Would you have any insights on the following problem I encountered:

I have a base dataframe with 50,000 rows and 8 columns, which I use to create roughly 1,500 features per multipliable feature in my feature family. The multipliers are nested such that there are 4 categorical multipliers and one time multiplier. While running the code with one multipliable feature alone works and takes about 2min, adding more multipliable features in the same feature family increases the run time exponentially:

1 feature - 2min
3 features - 17min
4 features - 2h20min

Is this an expected behaviour for this size of data? And if not, any suggestions on how to improve it?

I also noted that the function "append_features" is always being executed with a single executor in my cluster (while dynamic allocation being true).

Thank you,
Epson

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance question #4

Performance question #4

Heringer-Epson commented Apr 24, 2020

Performance question #4

Performance question #4

Comments

Heringer-Epson commented Apr 24, 2020