Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scoring: use repo freshness as tiebreaker #832

Merged
merged 5 commits into from
Oct 1, 2024
Merged

Conversation

stefanhengl
Copy link
Member

@stefanhengl stefanhengl commented Sep 23, 2024

We ignore priority and instead use the latest commit date as repo rank.
This has a big impact on scoring for Sourcegraph because it means we switch from
star count to repo freshness as tiebreaker.

As a minor tweak, we also separate query based scores from tiebreakers.
To achieve this we reserve the last 7 digits of a score for tiebreakers:

  • 5 digits (maxUint16) for repo rank
  • 2 digits ([0,10]) for file order.

Example:

Before:
score: 8775.35 <- atom(2):200, fragment:8550.00, repo-rank: 19, doc-order:6.35

After:
score: 8750_00019_06.35 <- atom(2):200, fragment:8550.00, repo-rank: 19, doc-order:6.35

@cla-bot cla-bot bot added the cla-signed label Sep 23, 2024
@stefanhengl stefanhengl requested a review from a team September 23, 2024 10:59
Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like how it's now clearly split into a tiebreaker. I will review more carefully tomorrow. A couple things on my mind:

  • The "months since 1970" trick makes sense, but is a bit surprising. As time moves on, scores always get higher. If a repo that hasn't been updated for 1 month, it feels like it should have the same "freshness" score, no matter when it's calculated (today vs. a few months from now)... but the score will change. However, it does have the nice property that the scores for a search never change, even as you rerun it over time.
  • Does this have any effect on our Zoekt "golden queries" evals? It'd be great to add some test queries to see more details on how it works! Specifically queries that need good multi-repo ranking.

@stefanhengl
Copy link
Member Author

stefanhengl commented Sep 24, 2024

As time moves on, scores always get higher. If a repo that hasn't been updated for 1 month

I am not worried about that because it's a tiebreaker and the score is not combined with another score. We really just care about a stable order for last commits.

Comment on lines +644 to +646
// We use the number of months since 1970 as a simple measure of repo freshness.
// It is monotonically increasing and stable across re-indexes and restarts.
r.Rank = monthsSince1970(repo.LatestCommitDate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as an atlernative, why not use the unix timestamp which is also monotonically increasing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically the unix timestamp, but in buckets of 1 month to fit into the uint16 we have. I think monthly increments make sense and the unit16 keeps the number of digits we have to reserve in the score to 5.

Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach! Just left some questions.

As we discussed offline, I'm not sure this will kick in super often. But it feels like a good step forward, and we can then test on dot com and S2 to get more ideas.

@@ -99,12 +102,24 @@ func (d *indexData) scoreFile(fileMatch *FileMatch, doc uint32, mt matchTree, kn
}
}

// Add tiebreakers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these docs would fit better in contentprovider.go, where we already have some explanations around the various score weights.

score.go Outdated
md := d.repoMetaData[d.repos[doc]]
// digits 3-7 are the repo rank, this gives us increments of 1 per month.
addScore("repo-rank", scoreRepoRankFactor*float64(md.Rank))
// digits 1-2 and the decimals are the doc order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this tricky, it'd be helpful to expand these explanations a bit more 🤔

Here's what I found unclear:

  • doc-order lies in the range [0, 1), so multiplying it by 10 gives [0, 10). So wouldn't it just be "digit 1 and the decimals" ?
  • It's not obvious what range repo-rank lies in. Especially since it can represent both "months since 1970" and "number of stars (with normalization)"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc-order lies in the range [0, 1), so multiplying it by 10 gives [0, 10). So wouldn't it just be "digit 1 and the decimals" ?

It is [0, 1] not [0,1). Matches in the first document of a shard will get the full score.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious what range repo-rank lies in. Especially since it can represent both "months since 1970" and "number of stars (with normalization)"

I see, I will update the docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is [0, 1] not [0,1). Matches in the first document of a shard will get the full score.

Ah, I missed the fact we do "1 - doc_id/num_docs". Thanks for clarifying the docs.

stefanhengl added a commit that referenced this pull request Sep 27, 2024
This is motivated by #832 

We use archive index in our e2e tests. In order to test our latest improvements to ranking, archive index needs to set the latest commit date.

Test plan:
- new unit test
- I checked that the tar files downloaded from github have the correct mod time.
We ignore priority and instead use the latest commit date as repo rank.
This has a big impact for Sourcegraph because it means we switch from
star count to repo freshness as tiebreaker.

As a minor tweak, we also separate query based scores from tiebreakers.
To achieve this we reserve the last 7 digits of a score for tiebreakers:
- 5 digits (maxUint16) for repo rank
- 2 digits ([0,10]) for file order (2 digits).

Example:

Before:
score: 8775.35 <- atom(2):200, fragment:8550.00, repo-rank: 19, doc-order:6.35

After:
score: 8750_00019_06.35 <- atom(2):200, fragment:8550.00, repo-rank: 19, doc-order:6.35
@stefanhengl
Copy link
Member Author

I had to rebase to pull in the latest changes of the e2e ranking test suite.

Comment on lines 1 to +4
queries: 16
recall@1: 8 (50%)
recall@1: 9 (56%)
recall@5: 11 (69%)
mrr: 0.600787
mrr: 0.632037
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny improvement, because "sourcegraph/conc" is now ranked higher than a ~2 months older "golang/go"

@stefanhengl stefanhengl merged commit d15aa28 into main Oct 1, 2024
9 checks passed
@stefanhengl stefanhengl deleted the sh/repo-freshness branch October 1, 2024 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants