-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
indexdata: read posting list iff all ng exist #619
Conversation
45435ab
to
e7c7e70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems pretty tricky. I wonder if we could design this slightly differently. These are the concerns:
- btree code caring about caseSensitive. I understand why we need to do this, but it kinda feels wrong. Would this code be better if we only did this sort of optimization for case sensitive search?
- allocations. I'm unsure on the impact of the GC since this does a bunch more allocations. It's not so much size I am worried about, but number of objects. In particular [][]int will have len() objects.
- Is this worth it? I suppose if it halves IO for attribution search it is.
I guess the other solution was changing how we laid out data on disk, but that requires a reindex while this doesn't which makes it appealing. What are your feelings about this after working on it?
I'm mulling on this a bit. And spending allocations is likely the right tradeoff. If you rule out a shard (the common case) that saved IO is good. If we don't rule it out, then more than likely we will be doing many more allocations while actually searching it.
Then there is two bits of future work that could likely help here:
- If this is good, change the shape of our data on disk so ngram is stored with posting list location
- The stuff we do for IR introducing opportunity for reusing buffers/etc
indexdata.go
Outdated
frequencies = append(frequencies, freq) | ||
frequencies := make([]uint32, 0, len(ngramOffs)) | ||
for _, ngramIndex := range ngramIndexes { | ||
frequencies = append(frequencies, d.ngramIndexFrequency(ngramIndex, query.FileName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might as well make ngramIndexFrequency take in the ngramIndexes slice and avoid the loop here.
Is attribution search case sensitive? |
yes. right now it is (mainly for speed) |
Alright. Then let's focus on case sensitive first. Once we have other optimisations in place we can revisit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. How do you think we can validate if this makes a difference? experiments on production and then checking the raw search time? IE the same thing I am doing with the sorting change?
indexdata.go
Outdated
if query.CaseSensitive { | ||
freq = d.ngramFrequency(o.ngram, query.FileName) | ||
ngramLookups++ | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove this code path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. How do you think we can validate if this makes a difference? experiments on production and then checking the raw search time? IE the same thing I am doing with the sorting change?
I think so. Let's finish validating your changes first. I will wait with merging until then. |
@stefanhengl I merged all my PRs which horribly conflicted with this. So I went and resolved it for ya, please take a look at the changes to makes sure I didn't mess it up. |
I am gonna merge this now so we can lay the ground work to get this out into production tomorrow. |
This reverts commit b7e5070. Initial data from production shows we didn't improve performance so we are reverting since the complicates without improving perf. Test Plan: CI
The purpose of this PR is to reduce disk IO in case we skip a shard because of missing ngrams.
To achieve this, we first check whether ALL ngrams exist in the shard before loading the posting lists to determine their frequency. This means we have to loop twice over the ngrams for the benefit of not loading any posting list in case the shard would have been skipped anyways.
Test plan: