Learned Metric Index (LMI) is an index for approximate nearest neighbor search on complex data using machine learning and probability-based navigation.
See examples of how to index and search in a dataset in: 01_Introduction.ipynb notebook.
# 1) Clone the repo with submodules
git clone --recursive git@github.com:LearnedMetricIndex/LearnedMetricIndex.git
# 2) Create and activate a new virtual environment
python -m venv lmi-env
source lmi-env/bin/activate
# 3) Install the dependencies
pip install -r requirements-cpu.txt # alternatively requirements-gpu.txt
pip install --editable .
Requirements:
- Docker
- At least 1.5 gb disk space for the CPU and up to 5.5 gb for the GPU version
# 1) Clone the repo with submodules
git clone --recursive git@github.com:LearnedMetricIndex/LearnedMetricIndex.git
# 2) Build the docker image (CPU version)
docker build -t lmi -f Dockerfile --build-arg version=cpu .
# alternatively: docker build -t lmi -f Dockerfile --build-arg version=gpu .
# 3) Run the docker image
docker run -p 8888:8888 -it lmi bash
# Run jupyterlab, copy the outputted url into the browser and open 01_Introduction.ipynb
jupyter-lab --ip 0.0.0.0 --no-browser
# Run the search on 100k data subset, evaluate the results and plot them.
# Expected time to run = ~5-10 mins
python3 search/search.py && python eval/eval.py && python eval/plot.py res.csv
LMI comprised of 1 ML model
- Recall: 91.421%
- Search runtime (for 10k queries): ~220s
- Build time: 20828s
- Dataset: LAION1B, 10M subset
- Hardware used:
- CPU Intel Xeon Gold 6130
- 42gb RAM
- 1 CPU core
- Hyperparameters:
- 120 leaf nodes
- 200 epochs
- 1 hidden layer with 512 neurons
- 0.01 learning rate
- 4 leaf nodes stop condition
10M:
- 42gb RAM
- 1 CPU core
- ~6h of runtime (waries depending on the hardware)
"LMI Proposition" (2021):
M. Antol, J. Ol'ha, T. Slanináková, V. Dohnal: Learned Metric Index—Proposition of learned indexing for unstructured data. Information Systems, 2021 - Elsevier (2021)
"Data-driven LMI" (2021):
T. Slanináková, M. Antol, J. Ol'ha, V. Kaňa, V. Dohnal: Learned Metric Index—Proposition of learned indexing for unstructured data. SISAP 2021 - Similarity Search and Applications pp 81-94 (2021)
"LMI in Proteins" (2022):
J. Ol'ha, T. Slanináková, M. Gendiar, M. Antol, V. Dohnal: Learned Indexing in Proteins: Extended Work on Substituting Complex Distance Calculations with Embedding and Clustering Techniques, and Learned Indexing in Proteins: Substituting Complex Distance Calculations with Embedding and Clustering Techniques SISAP 2022 - Similarity Search and Applications pp 274-282 (2022)
"Reproducible LMI" (2023):
T. Slanináková, M. Antol, J. Ol'ha, V. Kaňa, V. Dohnal, S. Ladra, M. A. Martinez-Prieto: Reproducible experiments with Learned Metric Index Framework. Information Systems, Volume 118, September 2023, 102255 (2023)
"LMI in a large (214M) protein database" (2024):
Procházka, D., Slanináková, T., Oľha, J., Rošinec, A., Grešová, K., Jánošová, M., Čillík, J., Porubská, J., Svobodová, R., Dohnal, V., & Antol, M. (2024). AlphaFind: discover structure similarity across the proteome in AlphaFold DB. Nucleic Acids Research.
🔎Complex data analysis research group
- Terézia Slanináková, Masaryk University
- David Procházka, Masaryk University
- Jaroslav Oľha, Masaryk University
- Matej Antol, Masaryk University
- Vlastislav Dohnal, Masaryk University