Skip to content

Learned Metric Index (LMI) is a machine learning based data structure for fast look-up of approximate nearest neighbors in complex data.

License

Notifications You must be signed in to change notification settings

Coda-Research-Group/LearnedMetricIndex

Repository files navigation

Introduction

DOI

Learned Metric Index (LMI) is an index for approximate nearest neighbor search on complex data using machine learning and probability-based navigation.

Getting started

See examples of how to index and search in a dataset in: 01_Introduction.ipynb notebook.

Installation

Using virtualenv

# 1) Clone the repo with submodules 
git clone --recursive git@github.com:LearnedMetricIndex/LearnedMetricIndex.git
# 2) Create and activate a new virtual environment
python -m venv lmi-env
source lmi-env/bin/activate
# 3) Install the dependencies
pip install -r requirements-cpu.txt # alternatively requirements-gpu.txt
pip install --editable .

Using docker

Requirements:

  • Docker
  • At least 1.5 gb disk space for the CPU and up to 5.5 gb for the GPU version
# 1) Clone the repo with submodules 
git clone --recursive git@github.com:LearnedMetricIndex/LearnedMetricIndex.git
# 2) Build the docker image (CPU version)
docker build -t lmi -f Dockerfile --build-arg version=cpu .
# alternatively: docker build -t lmi -f Dockerfile --build-arg version=gpu .
# 3) Run the docker image
docker run -p 8888:8888 -it lmi bash

Running

# Run jupyterlab, copy the outputted url into the browser and open 01_Introduction.ipynb
jupyter-lab --ip 0.0.0.0 --no-browser

# Run the search on 100k data subset, evaluate the results and plot them.
# Expected time to run = ~5-10 mins
python3 search/search.py && python eval/eval.py && python eval/plot.py res.csv

Performance

LMI comprised of 1 ML model

  • Recall: 91.421%
  • Search runtime (for 10k queries): ~220s
  • Build time: 20828s
  • Dataset: LAION1B, 10M subset
  • Hardware used:
    • CPU Intel Xeon Gold 6130
    • 42gb RAM
    • 1 CPU core
  • Hyperparameters:
    • 120 leaf nodes
    • 200 epochs
    • 1 hidden layer with 512 neurons
    • 0.01 learning rate
    • 4 leaf nodes stop condition

Hardware requirements

10M:

  • 42gb RAM
  • 1 CPU core
  • ~6h of runtime (waries depending on the hardware)

LMI in action

Publications

"LMI Proposition" (2021):

M. Antol, J. Ol'ha, T. Slanináková, V. Dohnal: Learned Metric Index—Proposition of learned indexing for unstructured data. Information Systems, 2021 - Elsevier (2021)

"Data-driven LMI" (2021):

T. Slanináková, M. Antol, J. Ol'ha, V. Kaňa, V. Dohnal: Learned Metric Index—Proposition of learned indexing for unstructured data. SISAP 2021 - Similarity Search and Applications pp 81-94 (2021)

"LMI in Proteins" (2022):

J. Ol'ha, T. Slanináková, M. Gendiar, M. Antol, V. Dohnal: Learned Indexing in Proteins: Extended Work on Substituting Complex Distance Calculations with Embedding and Clustering Techniques, and Learned Indexing in Proteins: Substituting Complex Distance Calculations with Embedding and Clustering Techniques SISAP 2022 - Similarity Search and Applications pp 274-282 (2022)

"Reproducible LMI" (2023):

T. Slanináková, M. Antol, J. Ol'ha, V. Kaňa, V. Dohnal, S. Ladra, M. A. Martinez-Prieto: Reproducible experiments with Learned Metric Index Framework. Information Systems, Volume 118, September 2023, 102255 (2023)

"LMI in a large (214M) protein database" (2024):

Procházka, D., Slanináková, T., Oľha, J., Rošinec, A., Grešová, K., Jánošová, M., Čillík, J., Porubská, J., Svobodová, R., Dohnal, V., & Antol, M. (2024). AlphaFind: discover structure similarity across the proteome in AlphaFold DB. Nucleic Acids Research.

Team

🔎Complex data analysis research group