CHAI is an attempt at an open-source data pipeline for package managers. The goal is to have a pipeline that can use the data from any package manager and provide a normalized data source for myriads of different use cases.
Use Docker
- Install Docker
- Clone the chai repository (https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository)
- Using a terminal, navigate to the cloned repository directory
- Run
docker compose build
to create the latest Docker images - Then, run
docker compose up
to launch.
Note
This will run CHAI for all package managers. As an example crates by itself will take over an hour and consume >5GB storage.
Currently, we support:
- crates
- Homebrew
- Debian
- pkgx
You can run a single package manager by running
docker compose up -e ... <package_manager>
We are planning on supporting NPM
, PyPI
, and rubygems
next.
Specify these eg. FOO=bar docker compose up
:
ENABLE_SCHEDULER
: When true, the pipeline runs on a schedule set byFREQUENCY
.FREQUENCY
: Sets how often (in hours) the pipeline should run.TEST
: Useful for running in a test code insertions.FETCH
: Determines whether to fetch new data or use whatever was saved locally.NO_CACHE
: When true, deletes temporary files after processing.
Note
The flag NO_CACHE
does not mean that files will not get downloaded to your local
storage (specifically, the ./data directory). It only means that we'll
delete these temporary files from ./data once we're done processing them.
If FETCH
is false, the pipeline looks for source data in the cache, so this
will fail if you run NO_CACHE
first, and FETCH
false second.
These arguments are all configurable in the docker-compose.yml
file.
db
: PostgreSQL database for the reduced package dataalembic
: handles migrationspackage_managers
: fetches and writes data for each package managerapi
: a simple REST API for reading from the dbranker
: deduplicates and ranks the packages
Stuff happens. Start over:
rm -rf ./data
: removes all the data the fetcher is putting.
Our goal is to build a data schema that looks like this:
You can read more about specific data models in the dbs readme
Our specific application extracts the dependency graph understand what are critical pieces of the open-source graph. We also built a simple example that displays sbom-metadata for your repository.
There are many other potential use cases for this data:
- License compatibility checker
- Developer publications
- Package popularity
- Dependency analysis vulnerability tool (requires translating semver)
Tip
Help us add the above to the examples folder.
- The database url is
postgresql://postgres:s3cr3t@localhost:5435/chai
, and is used asCHAI_DATABASE_URL
in the environment.psql CHAI_DATABASE_URL
will connect you to the database. - If you're orchestrating via docker, swap
localhost
forhost.docker.internal
We use uv
to manage dependencies (and sometimes execution).
All dependencies are listed in pyproject.toml
, under the
dependency-groups
header. Each group helps us classify the service we're adding a
dependency for. For example, if we're adding a new dependency for all the indexers:
uv add --group indexer requests
# use the --all-groups flag to sync your venv for all dependencies
uv sync --all-groups
uv pip compile --group indexer -o core/requirements.txt
The last step writes the updated dependencies to a requirements.txt file, which is crucial for the Docker containers executing the individual services to build correctly. Each indexer shares the same set of dependencies, and that requirement file is generated by uv, and maintained in core/requirements.txt
Important
DO NOT UPDATE ANY requirements.txt
FILES DIRECTLY
uv
provides a way to generate that automatically, based on the pyproject.toml
Have an idea on a better way to do this? Open to input...
export CHAI_DATABASE_URL=postgresql://<user>:<pw>@host.docker.internal:<port>/chai
export PGPASSWORD=<pw>
docker compose up alembic
These are tasks that can be run using [xcfile.dev]. If you use pkgx
, typing
dev
loads the environment. Alternatively, run them manually.
rm -rf db/data data .venv
docker compose build
Requires: build
docker compose up -d
Inputs: PACKAGE_MANAGER Env: PYTHONPATH=. Env: FETCH=false Env: TEST=true Env: DEBUG=true
pkgx uv run package_managers/$PACKAGE_MANAGER/main_v2.py
Requires: build Env: TEST=true Env: DEBUG=true
docker compose up
docker compose down
docker compose logs
Runs migrations and starts up the database
docker compose build --no-cache db alembic
docker compose up alembic -d
Requires: stop
rm -rf db/data
Inputs: MIGRATION_NAME Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai
cd alembic
alembic revision --autogenerate -m "$MIGRATION_NAME"
Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai
cd alembic
alembic upgrade head
Inputs: STEP Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai
cd alembic
alembic downgrade -$STEP
psql "postgresql://postgres:s3cr3t@localhost:5435/chai"
psql "postgresql://postgres:s3cr3t@localhost:5435/chai" -c "SELECT count(id) FROM packages;"
psql "postgresql://postgres:s3cr3t@localhost:5435/chai" -c "SELECT * FROM load_history;"
Refreshes table knowledge from the db.
docker compose restart api
docker compose down --remove-orphans
Inputs: SERVICE Env: CHAI_DATABASE_URL=postgresql://postgres:[email protected]:5435/chai
docker compose up $SERVICE --build
Inputs: FOLDER Environment: FOLDER=.
pkgx [email protected] ty check $FOLDER