docsearch

It is a python library for Searching similar documents in a large corpus of documents based on my recent project. It uses a 2-layer Earth Mover's Distance (my research) or a Jenson Shannon Distance over latent topic distribution of documents and word embeddings.

I'll update about further methodology once my paper is published.

Installation

pip install docsearch

link to pypi project : https://pypi.org/project/docsearch/

Classes

DocSearch() :

(i)init() takes 5 optional arguments. """ :param n_topics: number of topics (default 100) :param wv_size: word embedding dimension (default 100) :param stop_words: stop words list (default list) :param min_word_freq: minimum word frequency (default 15000) :param sim_metric: allowed values :['jenson-shannon', 'emd'] """

(ii) fit() takes one single argument which is the list of documents.

(iii) get_most_similar_documents() takes 2 arguments viz. query_document and number of similar documents to be shown(k).

Usage

import pandas as pd

docsearch = DocSearch()

path = "path/to/dataset.csv"
df = pd.read_csv(path)

docsearch.fit(df['text'])

print docsearch.get_most_similar_documents([str(df.at[100, 'text'])])```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

docsearch

Installation

Classes

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

docsearch

Installation

Classes

Usage