Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster indexing #52

Open
raulcf opened this issue Aug 23, 2016 · 4 comments
Open

Faster indexing #52

raulcf opened this issue Aug 23, 2016 · 4 comments
Milestone

Comments

@raulcf
Copy link
Contributor

raulcf commented Aug 23, 2016

  • Check how to improve elasticsearch's performance
  • Build a pre-indexer that filters out data that has been indexed for a given column. Basically this requires a count-min sketch per column, so that we can decide not to send certain data to the store if it's been already indexed (note that even though sending the data won't change the index, it requires processing anyway).
  • throw more strategies here... (that don't involve building our own stuff, for now)
@jmftrindade
Copy link
Member

@raulcf
Copy link
Contributor Author

raulcf commented Aug 24, 2016

Thanks!
Bulk requests should help.

In general, however, we need a more aggressive strategy here, as we are 1 order of magnitude lagging behind profiling. Ultimately, it is about reducing the amount of data we are indexing, and exploiting that we are loading columns of databases---repetition is common---filtering out data we have already seen (per column) should help a lot too.

@raulcf
Copy link
Contributor Author

raulcf commented Aug 24, 2016

I just implemented bulk request. It helped a lot, actually. Indexing is now only 3x slower than profiling (although I haven't optimized profiling yet). In any case this is great news, the gaps is closing.

@jmftrindade
Copy link
Member

Great, glad to hear that guide helped!

@raulcf raulcf added this to the v0.5 milestone Oct 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants