Retrieve the top-𝑘 documents with respect to a given query by maximal inner product over dense and sparse vectors. This problem is solved by breaking the maximal inner product int two smaller MIPS problem:
- Retrieve the top-𝑘' documents from a sparse retrieval system defined over the sparse portion of the vectors
- Retrieve the top-𝑘' documents from a dense retrieval system defined over the dense portion of the vectors
Before merging the two sets and retrieving the top-𝑘 documents from the combined (much smaller) set. As 𝑘' approaches infinity, we see the final top-𝑘 ecoming exact, with the drawback that the retrieval becomes much slower.
The dataset that we decide to use are: nfcorpus and scifact
- Download the wanted dataset using Beir
- Pre-processing the queries and documents text
- Retrieve the sparse embedding using the ElasticSearch implementation of BM25 or the implemented version
- Retrieve the dense embedding using SentenceBert
- Obtaining the ground truth score and document rank at k for each query
- Obtaining the merged embedding using the dense and sparse representation at k'
- Retrieve the results over the ground truth at k and the merged version at k