How to increase join performance #2240

hermidalc · 2022-10-19T12:25:01Z

hermidalc
Oct 19, 2022

When performing a join on two large dataframes (each with only a 2 or 3 columns, but 10s to 100s of millions of rows, and allow_duplication=True), how do I improve Vaex performance? It will sometimes take well over an hour and much of that time Vaex is running single-threaded. Is there a way to improve performance? (like presorting the join column in each separate dataframe, or something else?)

hermidalc · 2022-10-19T13:33:41Z

hermidalc
Oct 19, 2022
Author

In fact for a particular join I'm doing of a df with 3 cols x 250 million rows with a df with 2 cols x 250 millions rows, it's taking forever still running mostly single-threaded for 2 hours. The first df was filtered from a df with 1 billion rows, wondering if that makes a difference. The second df isn't filtered.

0 replies

JovanVeljanoski · 2022-10-20T00:00:13Z

JovanVeljanoski
Oct 20, 2022
Maintainer

Aside: I've converted this to a discussion thread since it is not really an "issue" or "bug" in the typical sense.

On topic: some ideas I mentioned in #2238 (comment)

This is a rather busy week for us, will try to give a better answer to this next week!

0 replies

hermidalc · 2022-10-20T11:48:48Z

hermidalc
Oct 20, 2022
Author

Thank you @JovanVeljanoski, would be nice to get your thoughts on how to improve time performance, seems like 2 hours to do such an operation is quite long, and most of that time Vaex is using only a single thread.

Would pre-sorting the join column in both dataframes help? Any other thoughts?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to increase join performance #2240

{{title}}

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to increase join performance #2240

hermidalc Oct 19, 2022

Replies: 3 comments

hermidalc Oct 19, 2022 Author

JovanVeljanoski Oct 20, 2022 Maintainer

hermidalc Oct 20, 2022 Author

hermidalc
Oct 19, 2022

hermidalc
Oct 19, 2022
Author

JovanVeljanoski
Oct 20, 2022
Maintainer

hermidalc
Oct 20, 2022
Author