Skip to content

Latest commit

 

History

History
40 lines (27 loc) · 1.29 KB

README.md

File metadata and controls

40 lines (27 loc) · 1.29 KB

docsearch

It is a python library for Searching similar documents in a large corpus of documents based on my recent project. It uses a 2-layer Earth Mover's Distance (my research) or a Jenson Shannon Distance over latent topic distribution of documents and word embeddings.

I'll update about further methodology once my paper is published.

Installation

pip install docsearch

link to pypi project : https://pypi.org/project/docsearch/

Classes

DocSearch() :

(i)init() takes 5 optional arguments. """ :param n_topics: number of topics (default 100) :param wv_size: word embedding dimension (default 100) :param stop_words: stop words list (default list) :param min_word_freq: minimum word frequency (default 15000) :param sim_metric: allowed values :['jenson-shannon', 'emd'] """

(ii) fit() takes one single argument which is the list of documents.

(iii) get_most_similar_documents() takes 2 arguments viz. query_document and number of similar documents to be shown(k).

Usage

import pandas as pd

docsearch = DocSearch()

path = "path/to/dataset.csv"
df = pd.read_csv(path)

docsearch.fit(df['text'])

print docsearch.get_most_similar_documents([str(df.at[100, 'text'])])```