This service provides endpoints for handling and vectorizing S3 objects, and consume and populate an OpenSearch index, inspired by hexagonal architecture principles.
- service: Contains the third party services access logic.
- usecase: Contains business logic layer.
- controller: Contains the Flask API endpoint handlers. ⇧ back to top
- Python
- Flask
- boto3
- Llama-Index ⇧ back to top
- Clone the repository
git clone git@github.com:wizeline/clone-vector-search.git
- Create a Python virtual environment (recommended):
python3 -m venv env
source env/bin/activate
- Install Dependencies:
pip install -r requirements.txt
- Set Environment Variables (if applicable) in .env and .flaskenv files:
- Create the opensearch index. The application will create the needed mapping.
- In order to run this service locally, you'll need localstack in order to mock some AWS Services.
- Once you have localstack installed and running, create a
clone-ingestion-messages
bucket:aws --endpoint-url=http://localhost:4566 s3 mb s3://clone-ingestion-messages
- Add the required test files by running:
aws --endpoint-url=http://localhost:4566 s3 cp /path/to/your/file/filename.json s3://clone-ingestion-messages/key/to/file.json
- Once you have localstack installed and running, create a
- Start the Flask Server:
flask run
An opensearch index is required for running this service. You can create the index with the following mapping:
// PUT /clone-vector-index
{
"aliases": {},
"mappings": {
"properties": {
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"embedding": {
"type": "knn_vector",
"dimension": 384
},
"metadata": {
"properties": {
"_node_content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"_node_type": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"doc_id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"document_id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"file_uuid": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"processed_user": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"raw_text": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"ref_doc_id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"source_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"twin_id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"user_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
},
"settings": {
"index": {
"replication": {
"type": "DOCUMENT"
},
"number_of_shards": "1",
"number_of_replicas": "1"
}
}
}
docker compose up --build
Ensure you adhere to the following conventions when working with code in the Clone Vector Search project:
- Relate every commit to a ticket: If the commit is not related to a ticket, the branch name contains the related ticket.
- Work on one feature for each PR: Do not crowd unrelated features in one PR.
- Every line of code in your commits must be production-ready: Do not create incomplete, work-in-progress commits.
- Ensure the branching strategy is simple:
- Create a feature branch and then merge it with the main branch.
- Do not create extra branches beside the feature or fix branches to merge with the main.
- Remove any feature or fix branches after you merge the changes.