Code to clean and perform exploratory data analysis on traffic crash datasets provided by the City of Chicago:
- Orchestration and raw data fetching / loading: Prefect
- Raw data transformations: dbt
- Database: PostgreSQL
- Install PostgreSQL
- Optionally create a dedicated Postgres user and database for this project (I created a db called
traffic_crashes, then created a usercrashwith a separate password and granted the user on the database and all tables in it.)
- Clone this repo.
- Create a Python virtual environment and install requirements:
$ cd chicago_traffic_crashes
$ python -m venv venv
$ source venv/bin/activate
$ python -m pip install -r requirements.txt
- Rename
.env-exampleto.envand update values. - Rename
dbt_models/profiles-example.ymltoprofiles.ymland update values.
To run the code manually:
$ cd chicago_traffic_crashes/orchestration
$ python3 main.py
This will immediately kick off a run of the full pipeline:
- Fetch crashes, people, and vehicles CSVs from the Chicago Open Data Portal website
- Load the data from the CSVs into local
[resource]_rawPostgreSQL tables. - Run dbt models to transform raw data and load into
[resource]_stgtables. - Run dbt models to create analytics views using
[resource]_stgtables.
To run the pipeline on a schedule, run the Prefect server with prefect server. The default schedule is once a day at midnight central time, but can be adjusted in deployment.py.