Skip to content

Fast computation of document similarities using optimal transportation distances in Tensorflow

Notifications You must be signed in to change notification settings

pierrestock/document-distances

Repository files navigation

How to compute document similarities ?

This repo is a GPU-ready Tensorflow implementation of a project consisting in computing document similarities. The full pdf report (report folder) summarizes the project much more extensively if you have any questions.

  1. Setup

  2. Playing around

  3. Going further

Setup

If you wish to use this code, you will have to install the following package:

pip install gensim 

The pretrained word2vec model is available here (1.5 GB !). You will need it for the toy examples and for computing some customized cost matrixes.

You can take a look at the iPython Notebook toy_examples.ipynb that contains a very brief description of the algorithm, some fun properties of the word2vec metric and studies the influence of some parameters on a toy example. The toy data files are already in the data folder.

Playing around

The algorithm is implemented in Python 3.5 using Tensorflow 0.11 in the file compute_distance.py. If you wish to run the experiments.py file containing all the experiments made during this project, you will have to either

  • Use the precomputed cost matrix C_most_common_1000_2 and the associated keys keys_most_common_1000_2
  • Compute it yourself by chosing the number of words to include and the order of the norm in the embedding space by using the cost_matrix function defined in compute_cost_matrix.py

The experiments may take time to run (especially on a CPU), I ran them on a NVIDIA K80 using AWS.

Going further

If you are interested in this topic, you can read the full pdf report (report folder) that details some theoretical aspects, the methodology and the experimental results on large datasets. A bibliography is included if you want to dig deeper on the theoretical side.

About

Fast computation of document similarities using optimal transportation distances in Tensorflow

Resources

Stars

Watchers

Forks

Packages

No packages published