Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What are the scores in the search results and how are they calculated? #10

Open
iancappelletti opened this issue Oct 27, 2023 · 2 comments

Comments

@iancappelletti
Copy link

Can you explain the score that appears next to each result on any given search results page? The highest value I've seen so far was a 10.1, and one time the values even came back as negative!

Thanks, fantastic tool BTW.

@tfmorris
Copy link

I was curious myself, so I checked the source code:

{+Math.log(item.score).toFixed(2)}

It's the natural log (log base e or ln) of the resultScore as returned by the Google kgsearch API, which is, in turn, defined simply as "An indicator of how well the entity matched the request constraints" of type "number" where higher is better.

A negative natural log means that the original resultScore was less than 1.

You can't really infer anything from it other than "bigger is better."

@AngryLoki
Copy link
Owner

Hi, @iancappelletti and @tfmorris , sorry for slow response,

As you correctly noticed, I don't calculate score fully on client side. Also I don't know the exact algorithm on GKG side, but I encountered few patents, related to this topic,

From https://patentimages.storage.googleapis.com/7e/d3/17/fb60b3a8f7c0a2/US9672251.pdf
The system determines a frequency score for each pattern (step 602). The frequency score for a given pattern is a function of the total number of extractions produced by applying the pattern, e.g., as described above with reference
to FIG. 5. For example, in some implementations, the frequency score is equal to the total number of extractions.
In some other implementations, the frequency score is a logarithm, e.g., a base ten or base e logarithm, of the total number of extractions. In some implementations, the system uses the total number of distinct extractions produced by applying the pattern, i.e., by only counting two extractions that are the same as a single extraction produced by applying the pattern.

There is also https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/10331706 which describes "human evaluator", trust coefficients and so on, so I expect that some values are multiplied by hidden coefficients.

Regarding search results, consider search space as concatenation of all titles + descriptions + identifiers, indexed with full-text-search algorithm.

Overall score looks like a mix of score from number of extractions and similarity with description (BM25-like algorithm).
For example, if I search by b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d, this substring is contained neither in titles, nor in descriptions, so it receives no points from similarity function. Meanwhile, if you search for Michael Jackson, you will see Prabhu Deva somewhere on top, which is result of es conocido como el "Michael Jackson de la India" in description of relevant article of es.wikipedia (it was boosted by similarity coefficient).

Overall score returned from knowledge graph api looks like an additive score, which sometimes contains huge numbers, so I apply log function to make it easier to read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants