-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'master' of https://github.com/ankushbhatia2/docsearch
- Loading branch information
Showing
1 changed file
with
33 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,34 @@ | ||
# docsearch | ||
Searching similar documents in a large corpus of documents. | ||
It is a python library for Searching similar documents in a large corpus of documents based on my recent project. | ||
It uses a 2-layer Earth Mover's Distance (my research) or a Jenson Shannon Distance over latent topic distribution of documents and word embeddings. | ||
|
||
I'll update the further methodology once my paper is published. | ||
|
||
|
||
## Classes | ||
DocSearch() : | ||
|
||
(i)__init__() takes 5 optional arguments. | ||
""" | ||
:param n_topics: number of topics (default 100) | ||
:param wv_size: word embedding dimension (default 100) | ||
:param stop_words: stop words list (default list) | ||
:param min_word_freq: minimum word frequency (default 15000) | ||
:param sim_metric: allowed values :['jenson-shannon', 'emd'] | ||
""" | ||
|
||
(ii) __fit__() takes one single argument which is the list of documents. | ||
|
||
(iii) __get_most_similar_documents__() takes 2 arguments _viz._ query_document and number of similar documents to be shown(k). | ||
## Usage | ||
```from docsearch import DocSearch | ||
import pandas as pd | ||
docsearch = DocSearch() | ||
path = "path/to/dataset.csv" | ||
df = pd.read_csv(path) | ||
docsearch.fit(df['text']) | ||
print docsearch.get_most_similar_documents([str(df.at[100, 'text'])])``` |