What are the scores in the search results and how are they calculated? #10

iancappelletti · 2023-10-27T03:37:22Z

Can you explain the score that appears next to each result on any given search results page? The highest value I've seen so far was a 10.1, and one time the values even came back as negative!

Thanks, fantastic tool BTW.

tfmorris · 2024-04-16T18:53:43Z

I was curious myself, so I checked the source code:

mreid-resolver/src/routes/Search.svelte

Line 125 in 346748d

{+Math.log(item.score).toFixed(2)}

It's the natural log (log base e or ln) of the resultScore as returned by the Google kgsearch API, which is, in turn, defined simply as "An indicator of how well the entity matched the request constraints" of type "number" where higher is better.

A negative natural log means that the original resultScore was less than 1.

You can't really infer anything from it other than "bigger is better."

AngryLoki · 2024-04-16T21:13:34Z

Hi, @iancappelletti and @tfmorris , sorry for slow response,

As you correctly noticed, I don't calculate score fully on client side. Also I don't know the exact algorithm on GKG side, but I encountered few patents, related to this topic,

From https://patentimages.storage.googleapis.com/7e/d3/17/fb60b3a8f7c0a2/US9672251.pdf
The system determines a frequency score for each pattern (step 602). The frequency score for a given pattern is a function of the total number of extractions produced by applying the pattern, e.g., as described above with reference
to FIG. 5. For example, in some implementations, the frequency score is equal to the total number of extractions.
In some other implementations, the frequency score is a logarithm, e.g., a base ten or base e logarithm, of the total number of extractions. In some implementations, the system uses the total number of distinct extractions produced by applying the pattern, i.e., by only counting two extractions that are the same as a single extraction produced by applying the pattern.

There is also https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/10331706 which describes "human evaluator", trust coefficients and so on, so I expect that some values are multiplied by hidden coefficients.

Regarding search results, consider search space as concatenation of all titles + descriptions + identifiers, indexed with full-text-search algorithm.

Overall score looks like a mix of score from number of extractions and similarity with description (BM25-like algorithm).
For example, if I search by b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d, this substring is contained neither in titles, nor in descriptions, so it receives no points from similarity function. Meanwhile, if you search for Michael Jackson, you will see Prabhu Deva somewhere on top, which is result of es conocido como el "Michael Jackson de la India" in description of relevant article of es.wikipedia (it was boosted by similarity coefficient).

Overall score returned from knowledge graph api looks like an additive score, which sometimes contains huge numbers, so I apply log function to make it easier to read.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What are the scores in the search results and how are they calculated? #10

What are the scores in the search results and how are they calculated? #10

iancappelletti commented Oct 27, 2023

tfmorris commented Apr 16, 2024

AngryLoki commented Apr 16, 2024

What are the scores in the search results and how are they calculated? #10

What are the scores in the search results and how are they calculated? #10

Comments

iancappelletti commented Oct 27, 2023

tfmorris commented Apr 16, 2024

AngryLoki commented Apr 16, 2024