To create a model which can when given a user input produces a summary of the most relevant information from the huge data corpus fed to it.
sus.json
contains a dictionary of list of information (paragraphs) about various topics.
- Using Asymmetric Semantic Search (where the
query
size anddata_corpus
size is different), to find the similarity between given data and the query.
Semantic Search - The idea behind semantic search is to embed all entries in your corpus, which can be sentences, paragraphs, or documents, into a vector space. At search time, the query is embedded into the same vector space and the closest embedding from your corpus is found.
-
Model Used -
msmarco-distilbert-base-dot-prod-v3
which uses dot product to find the similarity. -
Encodings and storing them - FAISS: (Facebook AI Similarity Search) is a library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other.
-
Summarizer : Used the
Hugging Face Pipeline
for the summarization with its default model (sshleifer/distilbart-cnn-12-6). However dedicated summarizer can be implemented to increase the efficiency and time optimization -
Finally saving the output to a ".txt" file
A sample of the produced summary is also given.
Article for reference :