scoring: use repo freshness as tiebreaker #832

stefanhengl · 2024-09-23T10:59:05Z

We ignore priority and instead use the latest commit date as repo rank.
This has a big impact on scoring for Sourcegraph because it means we switch from
star count to repo freshness as tiebreaker.

As a minor tweak, we also separate query based scores from tiebreakers.
To achieve this we reserve the last 7 digits of a score for tiebreakers:

5 digits (maxUint16) for repo rank
2 digits ([0,10]) for file order.

Example:

Before:
score: 8775.35 <- atom(2):200, fragment:8550.00, repo-rank: 19, doc-order:6.35

After:
score: 8750_00019_06.35 <- atom(2):200, fragment:8550.00, repo-rank: 19, doc-order:6.35

jtibshirani

Nice! I like how it's now clearly split into a tiebreaker. I will review more carefully tomorrow. A couple things on my mind:

The "months since 1970" trick makes sense, but is a bit surprising. As time moves on, scores always get higher. If a repo that hasn't been updated for 1 month, it feels like it should have the same "freshness" score, no matter when it's calculated (today vs. a few months from now)... but the score will change. However, it does have the nice property that the scores for a search never change, even as you rerun it over time.
Does this have any effect on our Zoekt "golden queries" evals? It'd be great to add some test queries to see more details on how it works! Specifically queries that need good multi-repo ranking.

stefanhengl · 2024-09-24T07:49:12Z

As time moves on, scores always get higher. If a repo that hasn't been updated for 1 month

I am not worried about that because it's a tiebreaker and the score is not combined with another score. We really just care about a stable order for last commits.

keegancsmith · 2024-09-25T11:17:48Z

api.go

+		// We use the number of months since 1970 as a simple measure of repo freshness.
+		// It is monotonically increasing and stable across re-indexes and restarts.
+		r.Rank = monthsSince1970(repo.LatestCommitDate)


as an atlernative, why not use the unix timestamp which is also monotonically increasing?

This is basically the unix timestamp, but in buckets of 1 month to fit into the uint16 we have. I think monthly increments make sense and the unit16 keeps the number of digits we have to reserve in the score to 5.

jtibshirani

I like the approach! Just left some questions.

As we discussed offline, I'm not sure this will kick in super often. But it feels like a good step forward, and we can then test on dot com and S2 to get more ideas.

jtibshirani · 2024-09-27T01:00:51Z

score.go

@@ -99,12 +102,24 @@ func (d *indexData) scoreFile(fileMatch *FileMatch, doc uint32, mt matchTree, kn
 		}
 	}

+	// Add tiebreakers


I think these docs would fit better in contentprovider.go, where we already have some explanations around the various score weights.

jtibshirani · 2024-09-27T01:03:54Z

score.go

 	md := d.repoMetaData[d.repos[doc]]
+	// digits 3-7 are the repo rank, this gives us increments of 1 per month.
+	addScore("repo-rank", scoreRepoRankFactor*float64(md.Rank))
+	// digits 1-2 and the decimals are the doc order.


I found this tricky, it'd be helpful to expand these explanations a bit more 🤔

Here's what I found unclear:

doc-order lies in the range [0, 1), so multiplying it by 10 gives [0, 10). So wouldn't it just be "digit 1 and the decimals" ?

It's not obvious what range repo-rank lies in. Especially since it can represent both "months since 1970" and "number of stars (with normalization)"

doc-order lies in the range [0, 1), so multiplying it by 10 gives [0, 10). So wouldn't it just be "digit 1 and the decimals" ?

It is [0, 1] not [0,1). Matches in the first document of a shard will get the full score.

It's not obvious what range repo-rank lies in. Especially since it can represent both "months since 1970" and "number of stars (with normalization)"

I see, I will update the docs.

It is [0, 1] not [0,1). Matches in the first document of a shard will get the full score.

Ah, I missed the fact we do "1 - doc_id/num_docs". Thanks for clarifying the docs.

This is motivated by #832 We use archive index in our e2e tests. In order to test our latest improvements to ranking, archive index needs to set the latest commit date. Test plan: - new unit test - I checked that the tar files downloaded from github have the correct mod time.

We ignore priority and instead use the latest commit date as repo rank. This has a big impact for Sourcegraph because it means we switch from star count to repo freshness as tiebreaker. As a minor tweak, we also separate query based scores from tiebreakers. To achieve this we reserve the last 7 digits of a score for tiebreakers: - 5 digits (maxUint16) for repo rank - 2 digits ([0,10]) for file order (2 digits). Example: Before: score: 8775.35 <- atom(2):200, fragment:8550.00, repo-rank: 19, doc-order:6.35 After: score: 8750_00019_06.35 <- atom(2):200, fragment:8550.00, repo-rank: 19, doc-order:6.35

stefanhengl · 2024-09-27T11:23:27Z

I had to rebase to pull in the latest changes of the e2e ranking test suite.

stefanhengl · 2024-09-27T11:27:08Z

internal/e2e/testdata/rank_stats.txt

 queries: 16
-recall@1: 8 (50%)
+recall@1: 9 (56%)
 recall@5: 11 (69%)
-mrr: 0.600787
+mrr: 0.632037


Tiny improvement, because "sourcegraph/conc" is now ranked higher than a ~2 months older "golang/go"

cla-bot bot added the cla-signed label Sep 23, 2024

stefanhengl requested a review from a team September 23, 2024 10:59

jtibshirani reviewed Sep 24, 2024

View reviewed changes

keegancsmith reviewed Sep 25, 2024

View reviewed changes

stefanhengl mentioned this pull request Sep 26, 2024

archive: use ModTime as proxy for LatestCommitDate #836

Merged

jtibshirani reviewed Sep 27, 2024

View reviewed changes

stefanhengl added 5 commits September 27, 2024 13:04

set zoekt.latest_commit_date

35941a2

revert changes to bytes_buffer.txt

885f2c1

update docs

54b3e4a

update e2e test

bbc7e7f

stefanhengl force-pushed the sh/repo-freshness branch from fd2f4c8 to bbc7e7f Compare September 27, 2024 11:22

stefanhengl commented Sep 27, 2024

View reviewed changes

jtibshirani approved these changes Sep 27, 2024

View reviewed changes

stefanhengl merged commit d15aa28 into main Oct 1, 2024
9 checks passed

stefanhengl deleted the sh/repo-freshness branch October 1, 2024 06:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scoring: use repo freshness as tiebreaker #832

scoring: use repo freshness as tiebreaker #832

stefanhengl commented Sep 23, 2024 •

edited

Loading

jtibshirani left a comment •

edited

Loading

stefanhengl commented Sep 24, 2024 •

edited

Loading

keegancsmith Sep 25, 2024

stefanhengl Sep 26, 2024

jtibshirani left a comment

jtibshirani Sep 27, 2024

jtibshirani Sep 27, 2024

stefanhengl Sep 27, 2024

stefanhengl Sep 27, 2024

jtibshirani Sep 27, 2024

stefanhengl commented Sep 27, 2024

stefanhengl Sep 27, 2024

scoring: use repo freshness as tiebreaker #832

scoring: use repo freshness as tiebreaker #832

Conversation

stefanhengl commented Sep 23, 2024 • edited Loading

jtibshirani left a comment • edited Loading

Choose a reason for hiding this comment

stefanhengl commented Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stefanhengl commented Sep 27, 2024

Choose a reason for hiding this comment

stefanhengl commented Sep 23, 2024 •

edited

Loading

jtibshirani left a comment •

edited

Loading

stefanhengl commented Sep 24, 2024 •

edited

Loading