Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

indexdata: read posting list iff all ng exist #619

Merged
merged 4 commits into from
Jul 19, 2023

Conversation

stefanhengl
Copy link
Member

@stefanhengl stefanhengl commented Jul 17, 2023

The purpose of this PR is to reduce disk IO in case we skip a shard because of missing ngrams.

To achieve this, we first check whether ALL ngrams exist in the shard before loading the posting lists to determine their frequency. This means we have to loop twice over the ngrams for the benefit of not loading any posting list in case the shard would have been skipped anyways.

Test plan:

  • This is a refactor, so relying on CI

@stefanhengl stefanhengl marked this pull request as ready for review July 17, 2023 10:48
Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems pretty tricky. I wonder if we could design this slightly differently. These are the concerns:

  1. btree code caring about caseSensitive. I understand why we need to do this, but it kinda feels wrong. Would this code be better if we only did this sort of optimization for case sensitive search?
  2. allocations. I'm unsure on the impact of the GC since this does a bunch more allocations. It's not so much size I am worried about, but number of objects. In particular [][]int will have len() objects.
  3. Is this worth it? I suppose if it halves IO for attribution search it is.

I guess the other solution was changing how we laid out data on disk, but that requires a reindex while this doesn't which makes it appealing. What are your feelings about this after working on it?

I'm mulling on this a bit. And spending allocations is likely the right tradeoff. If you rule out a shard (the common case) that saved IO is good. If we don't rule it out, then more than likely we will be doing many more allocations while actually searching it.

Then there is two bits of future work that could likely help here:

  1. If this is good, change the shape of our data on disk so ngram is stored with posting list location
  2. The stuff we do for IR introducing opportunity for reusing buffers/etc

indexdata.go Outdated
frequencies = append(frequencies, freq)
frequencies := make([]uint32, 0, len(ngramOffs))
for _, ngramIndex := range ngramIndexes {
frequencies = append(frequencies, d.ngramIndexFrequency(ngramIndex, query.FileName))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might as well make ngramIndexFrequency take in the ngramIndexes slice and avoid the loop here.

@stefanhengl
Copy link
Member Author

Would this code be better if we only did this sort of optimization for case sensitive search?

Is attribution search case sensitive?

@keegancsmith
Copy link
Member

Would this code be better if we only did this sort of optimization for case sensitive search?

Is attribution search case sensitive?

yes. right now it is (mainly for speed)

@stefanhengl
Copy link
Member Author

Would this code be better if we only did this sort of optimization for case sensitive search?

Is attribution search case sensitive?

yes. right now it is (mainly for speed)

Alright. Then let's focus on case sensitive first. Once we have other optimisations in place we can revisit.

Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. How do you think we can validate if this makes a difference? experiments on production and then checking the raw search time? IE the same thing I am doing with the sorting change?

indexdata.go Outdated
Comment on lines 440 to 443
if query.CaseSensitive {
freq = d.ngramFrequency(o.ngram, query.FileName)
ngramLookups++
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this code path?

Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. How do you think we can validate if this makes a difference? experiments on production and then checking the raw search time? IE the same thing I am doing with the sorting change?

@stefanhengl
Copy link
Member Author

LGTM. How do you think we can validate if this makes a difference? experiments on production and then checking the raw search time? IE the same thing I am doing with the sorting change?

I think so. Let's finish validating your changes first. I will wait with merging until then.

@keegancsmith
Copy link
Member

@stefanhengl I merged all my PRs which horribly conflicted with this. So I went and resolved it for ya, please take a look at the changes to makes sure I didn't mess it up.

@keegancsmith
Copy link
Member

I am gonna merge this now so we can lay the ground work to get this out into production tomorrow.

@keegancsmith keegancsmith merged commit b7e5070 into main Jul 19, 2023
7 checks passed
@keegancsmith keegancsmith deleted the sh/stop-early-if-missing-ng branch July 19, 2023 14:01
keegancsmith added a commit that referenced this pull request Jul 24, 2023
This reverts commit b7e5070. Initial
data from production shows we didn't improve performance so we are
reverting since the complicates without improving perf.

Test Plan: CI
keegancsmith added a commit that referenced this pull request Jul 24, 2023
This reverts commit b7e5070. Initial
data from production shows we didn't improve performance so we are
reverting since the complicates without improving perf.

Test Plan: CI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants