indexdata: read posting list iff all ng exist #619

stefanhengl · 2023-07-17T09:38:58Z

The purpose of this PR is to reduce disk IO in case we skip a shard because of missing ngrams.

To achieve this, we first check whether ALL ngrams exist in the shard before loading the posting lists to determine their frequency. This means we have to loop twice over the ngrams for the benefit of not loading any posting list in case the shard would have been skipped anyways.

Test plan:

This is a refactor, so relying on CI

keegancsmith

this seems pretty tricky. I wonder if we could design this slightly differently. These are the concerns:

btree code caring about caseSensitive. I understand why we need to do this, but it kinda feels wrong. Would this code be better if we only did this sort of optimization for case sensitive search?
allocations. I'm unsure on the impact of the GC since this does a bunch more allocations. It's not so much size I am worried about, but number of objects. In particular [][]int will have len() objects.
Is this worth it? I suppose if it halves IO for attribution search it is.

I guess the other solution was changing how we laid out data on disk, but that requires a reindex while this doesn't which makes it appealing. What are your feelings about this after working on it?

I'm mulling on this a bit. And spending allocations is likely the right tradeoff. If you rule out a shard (the common case) that saved IO is good. If we don't rule it out, then more than likely we will be doing many more allocations while actually searching it.

Then there is two bits of future work that could likely help here:

If this is good, change the shape of our data on disk so ngram is stored with posting list location
The stuff we do for IR introducing opportunity for reusing buffers/etc

keegancsmith · 2023-07-17T11:45:36Z

indexdata.go

-		frequencies = append(frequencies, freq)
+	frequencies := make([]uint32, 0, len(ngramOffs))
+	for _, ngramIndex := range ngramIndexes {
+		frequencies = append(frequencies, d.ngramIndexFrequency(ngramIndex, query.FileName))


you might as well make ngramIndexFrequency take in the ngramIndexes slice and avoid the loop here.

stefanhengl · 2023-07-17T13:00:45Z

Would this code be better if we only did this sort of optimization for case sensitive search?

Is attribution search case sensitive?

keegancsmith · 2023-07-17T13:14:47Z

Would this code be better if we only did this sort of optimization for case sensitive search?

Is attribution search case sensitive?

yes. right now it is (mainly for speed)

stefanhengl · 2023-07-17T13:29:49Z

Would this code be better if we only did this sort of optimization for case sensitive search?

Is attribution search case sensitive?

yes. right now it is (mainly for speed)

Alright. Then let's focus on case sensitive first. Once we have other optimisations in place we can revisit.

keegancsmith

LGTM. How do you think we can validate if this makes a difference? experiments on production and then checking the raw search time? IE the same thing I am doing with the sorting change?

keegancsmith · 2023-07-17T14:54:32Z

indexdata.go

+			if query.CaseSensitive {
+				freq = d.ngramFrequency(o.ngram, query.FileName)
+				ngramLookups++
+			} else {


we can remove this code path?

keegancsmith

LGTM. How do you think we can validate if this makes a difference? experiments on production and then checking the raw search time? IE the same thing I am doing with the sorting change?

stefanhengl · 2023-07-17T15:10:23Z

LGTM. How do you think we can validate if this makes a difference? experiments on production and then checking the raw search time? IE the same thing I am doing with the sorting change?

I think so. Let's finish validating your changes first. I will wait with merging until then.

…ing-ng

keegancsmith · 2023-07-19T10:28:48Z

@stefanhengl I merged all my PRs which horribly conflicted with this. So I went and resolved it for ya, please take a look at the changes to makes sure I didn't mess it up.

keegancsmith · 2023-07-19T13:59:56Z

I am gonna merge this now so we can lay the ground work to get this out into production tomorrow.

This reverts commit b7e5070. Initial data from production shows we didn't improve performance so we are reverting since the complicates without improving perf. Test Plan: CI

indexdata: read posting list iff all ng exist

e7c7e70

stefanhengl force-pushed the sh/stop-early-if-missing-ng branch from 45435ab to e7c7e70 Compare July 17, 2023 10:35

remove ngramFrequency

0d12632

stefanhengl requested a review from keegancsmith July 17, 2023 10:48

stefanhengl marked this pull request as ready for review July 17, 2023 10:48

keegancsmith reviewed Jul 17, 2023

View reviewed changes

focus on cases sensitive searches

ac10de9

keegancsmith approved these changes Jul 17, 2023

View reviewed changes

Merge remote-tracking branch 'origin/main' into sh/stop-early-if-miss…

0783d1e

…ing-ng

keegancsmith merged commit b7e5070 into main Jul 19, 2023
7 checks passed

keegancsmith deleted the sh/stop-early-if-missing-ng branch July 19, 2023 14:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

indexdata: read posting list iff all ng exist #619

indexdata: read posting list iff all ng exist #619

stefanhengl commented Jul 17, 2023 •

edited

Loading

keegancsmith left a comment

keegancsmith Jul 17, 2023

stefanhengl commented Jul 17, 2023

keegancsmith commented Jul 17, 2023

stefanhengl commented Jul 17, 2023

keegancsmith left a comment

keegancsmith Jul 17, 2023

keegancsmith left a comment

stefanhengl commented Jul 17, 2023

keegancsmith commented Jul 19, 2023

keegancsmith commented Jul 19, 2023

indexdata: read posting list iff all ng exist #619

indexdata: read posting list iff all ng exist #619

Conversation

stefanhengl commented Jul 17, 2023 • edited Loading

keegancsmith left a comment

Choose a reason for hiding this comment

keegancsmith Jul 17, 2023

Choose a reason for hiding this comment

stefanhengl commented Jul 17, 2023

keegancsmith commented Jul 17, 2023

stefanhengl commented Jul 17, 2023

keegancsmith left a comment

Choose a reason for hiding this comment

keegancsmith Jul 17, 2023

Choose a reason for hiding this comment

keegancsmith left a comment

Choose a reason for hiding this comment

stefanhengl commented Jul 17, 2023

keegancsmith commented Jul 19, 2023

keegancsmith commented Jul 19, 2023

stefanhengl commented Jul 17, 2023 •

edited

Loading