You can clone this repo and create your own search engine in a few steps!
You can use this small app to search through a list of popular startups.
- The neural search will read the description and look for similar startups.
- The keyword search will look up your exact term in the description.
- Python (v.3.11)
- Docker
python -m venv .venv
source .venv/bin/activate
pip install poetry
poetry install
wget https://storage.googleapis.com/generall-shared-data/startups_demo.json -P data/
docker-compose -f docker-compose-local.yaml up
python -m qdrant_demo.init_collection_startups
6. Go to http://localhost:8000/
You can add a larger dataset of companies provided by Crunchbase.
For this, you will need to register at https://www.crunchbase.com/ and get an API key.
wget 'https://api.crunchbase.com/odm/v4/odm.tar.gz?user_key=<CRUNCHBASE-API-KEY>' -O odm.tar.gz
tar -xvf odm.tar.gz
mv odm/organizations.csv ./data
python -m qdrant_demo.init_collection_crunchbase
Software Stack | |
---|---|
Qdrant | Vector database and a search engine with full-text and semantic capabilities. |
all-MiniLM-L6-v2 |
The embedding model that turns startup data to vectors. |
FastEmbed | Qdrant's package that simplifies this vectorization process. |
Frontend in TypeScript | Basic visuals that you see in the deployed application. |
Application Components | |
---|---|
init_collection_startups.py |
Uploads document embeddings to a Qdrant collection. |
neural_searcher.py |
Defines the semantic search process via vector search and optional payload filter. |
text_searcher.py |
Defines the keyword search process across startup metadata / payload. |
service.py |
Setup instructions for the entire FastAPI application. |
config.py |
Defines the directories for code, root, data, and static files |
This reads a JSON file containing startup data, restructures the data into a unified schema, and recreates a collection in Qdrant with specified vector and quantization configurations.
In this example, we are turning on Scalar Quantization to make sure less memory is used to process data.
A payload index is created for text search on a specified text field. Finally, it uploads the documents and their metadata to the Qdrant collection.
The NeuralSearcher class enables semantic searches. The search method takes a text query and an optional filter, performs a semantic search in the specified collection, and returns the top five results’ metadata.
The TextSearcher class defines text searches. The search method queries the specified text field for matches and returns the top results, while the highlight method wraps matching query terms in HTML tags for emphasis.
This initializes both searchers. A GET endpoint /api/search allows querying with a text string and a flag to choose between neural and text search methods.
This retrieves environment variables for the Qdrant URL, API key, collection name, and embeddings model. It sets the name of the field used for text data as “document”.