CHAI

CHAI is an attempt at an open-source data pipeline for package managers. The goal is to have a pipeline that can use the data from any package manager and provide a normalized data source for myriads of different use cases.

Getting Started

Use Docker

Install Docker
Clone the chai repository (https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository)
Using a terminal, navigate to the cloned repository directory
Run docker compose build to create the latest Docker images
Then, run docker compose up to launch.

Note

This will run CHAI for all package managers. As an example crates by itself will take over an hour and consume >5GB storage.

Currently, we support:

crates
Homebrew
Debian
pkgx

You can run a single package manager by running docker compose up -e ... <package_manager>

We are planning on supporting NPM, PyPI, and rubygems next.

Arguments

Specify these eg. FOO=bar docker compose up:

ENABLE_SCHEDULER: When true, the pipeline runs on a schedule set by FREQUENCY.
FREQUENCY: Sets how often (in hours) the pipeline should run.
TEST: Useful for running in a test code insertions.
FETCH: Determines whether to fetch new data or use whatever was saved locally.
NO_CACHE: When true, deletes temporary files after processing.

Note

The flag NO_CACHE does not mean that files will not get downloaded to your local storage (specifically, the ./data directory). It only means that we'll delete these temporary files from ./data once we're done processing them. If FETCH is false, the pipeline looks for source data in the cache, so this will fail if you run NO_CACHE first, and FETCH false second.

These arguments are all configurable in the docker-compose.yml file.

Docker Services Overview

db: PostgreSQL database for the reduced package data
alembic: handles migrations
package_managers: fetches and writes data for each package manager
api: a simple REST API for reading from the db
ranker: deduplicates and ranks the packages

Hard Reset

Stuff happens. Start over:

rm -rf ./data: removes all the data the fetcher is putting.

Goals

Our goal is to build a data schema that looks like this:

You can read more about specific data models in the dbs readme

Our specific application extracts the dependency graph understand what are critical pieces of the open-source graph. We also built a simple example that displays sbom-metadata for your repository.

There are many other potential use cases for this data:

License compatibility checker
Developer publications
Package popularity
Dependency analysis vulnerability tool (requires translating semver)

Tip

Help us add the above to the examples folder.

FAQs / Common Issues

The database url is postgresql://postgres:s3cr3t@localhost:5435/chai, and is used as CHAI_DATABASE_URL in the environment. psql CHAI_DATABASE_URL will connect you to the database.
If you're orchestrating via docker, swap localhost for host.docker.internal

Managing Dependencies

We use uv to manage dependencies (and sometimes execution). All dependencies are listed in pyproject.toml, under the dependency-groups header. Each group helps us classify the service we're adding a dependency for. For example, if we're adding a new dependency for all the indexers:

uv add --group indexer requests

# use the --all-groups flag to sync your venv for all dependencies
uv sync --all-groups
uv pip compile --group indexer -o core/requirements.txt

The last step writes the updated dependencies to a requirements.txt file, which is crucial for the Docker containers executing the individual services to build correctly. Each indexer shares the same set of dependencies, and that requirement file is generated by uv, and maintained in core/requirements.txt

Important

DO NOT UPDATE ANY requirements.txt FILES DIRECTLY uv provides a way to generate that automatically, based on the pyproject.toml

Have an idea on a better way to do this? Open to input...

Deployment

export CHAI_DATABASE_URL=postgresql://<user>:<pw>@host.docker.internal:<port>/chai
export PGPASSWORD=<pw>
docker compose up alembic

Tasks

These are tasks that can be run using [xcfile.dev]. If you use pkgx, typing dev loads the environment. Alternatively, run them manually.

reset

rm -rf db/data data .venv

build

docker compose build

start

Requires: build

docker compose up -d

test

Inputs: PACKAGE_MANAGER Env: PYTHONPATH=. Env: FETCH=false Env: TEST=true Env: DEBUG=true

pkgx uv run package_managers/$PACKAGE_MANAGER/main_v2.py

full-test

Requires: build Env: TEST=true Env: DEBUG=true

docker compose up

stop

docker compose down

logs

docker compose logs

db-start

Runs migrations and starts up the database

docker compose build --no-cache db alembic
docker compose up alembic -d

db-reset

Requires: stop

rm -rf db/data

db-generate-migration

Inputs: MIGRATION_NAME Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai

cd alembic
alembic revision --autogenerate -m "$MIGRATION_NAME"

db-upgrade

Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai

cd alembic
alembic upgrade head

db-downgrade

Inputs: STEP Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai

cd alembic
alembic downgrade -$STEP

db

psql "postgresql://postgres:s3cr3t@localhost:5435/chai"

db-list-packages

psql "postgresql://postgres:s3cr3t@localhost:5435/chai" -c "SELECT count(id) FROM packages;"

db-list-history

psql "postgresql://postgres:s3cr3t@localhost:5435/chai" -c "SELECT * FROM load_history;"

restart-api

Refreshes table knowledge from the db.

docker compose restart api

remove-orphans

docker compose down --remove-orphans

run-pipeline

Inputs: SERVICE Env: CHAI_DATABASE_URL=postgresql://postgres:[email protected]:5435/chai

docker compose up $SERVICE --build

check

Inputs: FOLDER Environment: FOLDER=.

pkgx [email protected] ty check $FOLDER

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
.github		.github
alembic		alembic
api		api
core		core
db		db
examples		examples
package_managers		package_managers
ranker		ranker
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pkgx.yaml		pkgx.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

teaxyz/chai

Folders and files

Latest commit

History

Repository files navigation