Replies: 3 comments
-
In fact for a particular join I'm doing of a df with 3 cols x 250 million rows with a df with 2 cols x 250 millions rows, it's taking forever still running mostly single-threaded for 2 hours. The first df was filtered from a df with 1 billion rows, wondering if that makes a difference. The second df isn't filtered. |
Beta Was this translation helpful? Give feedback.
-
Aside: I've converted this to a discussion thread since it is not really an "issue" or "bug" in the typical sense. On topic: some ideas I mentioned in #2238 (comment) This is a rather busy week for us, will try to give a better answer to this next week! |
Beta Was this translation helpful? Give feedback.
-
Thank you @JovanVeljanoski, would be nice to get your thoughts on how to improve time performance, seems like 2 hours to do such an operation is quite long, and most of that time Vaex is using only a single thread. Would pre-sorting the join column in both dataframes help? Any other thoughts? |
Beta Was this translation helpful? Give feedback.
-
When performing a join on two large dataframes (each with only a 2 or 3 columns, but 10s to 100s of millions of rows, and allow_duplication=True), how do I improve Vaex performance? It will sometimes take well over an hour and much of that time Vaex is running single-threaded. Is there a way to improve performance? (like presorting the join column in each separate dataframe, or something else?)
Beta Was this translation helpful? Give feedback.
All reactions