Content Ingestion Service

Overview

👷 This project is WIP - and a playground project for myself.

The vision: a service to search contents from your documents (EPUB, PDF or any text files)

There are 2 big business logic flows:

extracting the content from the user's documents, and save it in searchable ways
enabling, for the user, a fast search in their documents given a query (using full-text search and/or semantic search for ex)

Having said this, the project is not usable currently.

Tech

Current infra:

RabbitMQ for the message queue
PostgreSQL for the relational database
MinIO for the S3-compatible object storage
Meilisearch for the full-text search
Qdrant for the vector database

Roadmap

What has been done:

: REST gateway service to handle requests from the users: rest_gateway
: services to extract contents: content_ingestion_worker (name need to change)
: service to handle full-text search: fulltext_search_service
: service to handle semantic search: embedding_worker (name need to change)
: communication between services using a message broker (RabbitMQ): either messages representing queued jobs or RPC requests
: authentication based on JWT token

The current work:

: Replace RabbitMQ by Kafka (for the queue job) and gRPC (for the RPC requests)
: Implement a more Hexagonal/Clean architecture in Rust
: A diagram explaining the new backend architecture
: Re-work of the semantic search service
: Improve the content extraction: better handle text encodings, enabling reading PDF with OCR (and not just with the PDF encoded content)

Configuration

There are several environments depending on where/how you want to deploy the services and workers:

develop: not containerized, locally on your machine
local: containerized, locally on your machine
production: containerized, in production

Tests

Integration tests

Triggering integration tests with logs

To run with different logs: (sqlx logs are a bit spammy, cutting them out to reduce noise)

RUST_LOG="sqlx=error,info" TEST_LOG=enabled cargo test <a_test> | bunyan

Databases for integration tests

For each test, a new database is created (to enforce isolation). The name of each database will be: test_<%Y-%m-%d_%H-%M-%S>_<randomly generated UUID>

Learning Resources

I have learnt a lot about REST backend system in Rust thanks to Luca Palmieri's book: Zero To Production In Rust

Name		Name	Last commit message	Last commit date
Latest commit History 193 Commits
.github/workflows		.github/workflows
common		common
content_ingestion_worker		content_ingestion_worker
embedding_worker		embedding_worker
fulltext_search_service		fulltext_search_service
migrations		migrations
rest_gateway		rest_gateway
scripts		scripts
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
local.docker-compose.yml		local.docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Content Ingestion Service

Overview

Tech

Roadmap

Configuration

Tests

Integration tests

Triggering integration tests with logs

Databases for integration tests

Learning Resources

About

Releases

Packages

Languages

License

alexandremgo/content_ingestion_service

Folders and files

Latest commit

History

Repository files navigation

Content Ingestion Service

Overview

Tech

Roadmap

Configuration

Tests

Integration tests

Triggering integration tests with logs

Databases for integration tests

Learning Resources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages