crates

The crates service uses the database dump provided by crates.io and coerces their data model into CHAI's. It's containerized using Docker for easy deployment and consistency. It's also written in python as a first draft, and uses a lot of the core tools.

Getting Started

To just run the crates service, you can use the following commands:

docker compose build crates
docker compose run crates

Execution Steps

The crates loader goes through the following steps when executed:

Initialization: The loader starts by initializing the configuration and database connection.
Fetching: If the FETCH flag is set to true, the loader downloads the latest crates data from the configured source.
Transformation: The downloaded data is transformed into a format compatible with the CHAI database schema.
Loading: The transformed data is loaded into the database. This includes:
- Packages
- Users
- User Packages
- URLs
- Package URLs
- Versions
- Dependencies
Cleanup: After successful loading, temporary files are cleaned up if the NO_CACHE flag is set.

The main execution logic is in the run_pipeline function in main.py.

def run_pipeline(db: DB, config: Config) -> None:
    fetcher = fetch(config)
    transformer = CratesTransformer(config.url_types, config.user_types)
    load(db, transformer, config)
    fetcher.cleanup(config)

    coda = (
        "validate by running "
        + '`psql "postgresql://postgres:s3cr3t@localhost:5435/chai" '
        + '-c "SELECT * FROM load_history;"`'
    )
    logger.log(coda)

Configuration Flags

The crates loader supports several configuration flags:

DEBUG: Enables debug logging when set to true.
TEST: Runs the loader in test mode when set to true, skipping certain data insertions.
FETCH: Determines whether to fetch new data from the source when set to true.
FREQUENCY: Sets how often (in hours) the pipeline should run.
NO_CACHE: When set to true, deletes temporary files after processing.

These flags can be set in the docker-compose.yml file:

crates:
  build:
    context: .
    dockerfile: ./package_managers/crates/Dockerfile
  environment:
    - CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@db:5432/chai
    - PYTHONPATH=/
    - DEBUG=${DEBUG:-false}
    - TEST=${TEST:-false}
    - FETCH=${FETCH:-true}
    - FREQUENCY=${FREQUENCY:-24}
    - NO_CACHE=${NO_CACHE:-false}

Notes

We're reopening the same files multiple times, which is not efficient.
- versions.csv contains all the published_by ids
- crates.csv contains all the urls
The cache logic in the database client is super complicated, and needs some better explanation...it does work though.
Licenses are non-standardized.
Warnings on missing users are because gh_login in the source data is non-unique.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.env.bartio		.env.bartio
.env.local.example		.env.local.example
.gitignore		.gitignore
.gitmodules		.gitmodules
.npmrc		.npmrc
.nvmrc		.nvmrc
.prettierignore		.prettierignore
.prettierrc.js		.prettierrc.js
CODEOWNERS		CODEOWNERS
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
biome.json		biome.json
graphql.config.ts		graphql.config.ts
knip.ts		knip.ts
main.py		main.py
monobera-2.zip		monobera-2.zip
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
requirements.txt		requirements.txt
structs.py		structs.py
transformer.py		transformer.py
tsconfig.apps.base.json		tsconfig.apps.base.json
tsconfig.json		tsconfig.json
turbo.json		turbo.json
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crates

Getting Started

Execution Steps

Configuration Flags

Notes

About

Releases

Packages

Languages

License

nihat99/monobera

Folders and files

Latest commit

History

Repository files navigation

crates

Getting Started

Execution Steps

Configuration Flags

Notes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages