This is my final project for the DataTalks MLOPS Zoomcamp 2024.
The project builds infrastructure for serving an XGBoost model for predicting the price of flight tickets on a number of Indian airlines. The problem itself is a toy example and was chosen to keep the problem simple and focus on the surrounding infrastructure.
The infrastructure is divided into three parts, one for the MLFlow tracking server, one for data monitoring and one for hosting the model as a Sagemaker Endpoint.
Please note that this project does create resources on AWS which may incur some minor cost for the user. The maximum cost for a single day during development has been $13. If you do create the resources, please remember to destroy them after you are finished testing.
- An AWS Account
- Prefect Cloud Account (create one for free here)
- Add
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
to the Github repo of you wish to test the Github workflows.- How to create access key and secret
Repo -> Settings -> Secrets and variables -> Actions -> New repository secret
The project was developed and tested on Ubuntu 23.10 and 24.04.
The preprocessing and training is monitored/logged using prefect
. All runs can be reviewed using the Prefect Cloud UI.
All infrastructure is managed by Terraform. If anything fails during the apply
stage the issue may resolve itself by running terraform apply
(or the make
command) one more time. If the error persists then something else is the issue.
There are 4 variables which will need to be changed from the current settings. I have highlighted these with bold in the readme below, please read it carefully! The pars are with regards to:
- Terraform state bucket name
model_id
in Model Registeringalarm_subscribers
in Servingmlflow_run_id
in Monitoring
In order to build this repository the following tools are required, in addition to those specified in pyproject.toml
and Prerequisites.
Follow the links for install instructions.
Even if you have slightly different versions installed, the code will probably still work, but the versions listed below were those used for developing and building the project locally and I recommend using the same versions:
Go to the following files and change mpierrau
to some other unique identifier (the S3 bucket name needs to be unique across all of the internet):
terraform {
...
backend "s3" {
bucket = "tf-state-flight-price-prediction-mpierrau"
...
}
}
Navigate to the root folder (flight-price-prediction/
) and install project package dependencies:
make setup
make get_data
- Authenticate against AWS using
aws configure sso
and thenaws sso login
. Help resources: - Run the make command below. It will:
- Create an ECR repo
- Build and upload the Docker container for the MLFlow server app to the ECR repo
- Build the rest of the required infra for the server (IAM roles, ECS service and task, network settings, RDS postgres DB, S3 bucket)
make build_mlflow_infra
This can take up to 15-20 minutes.
We also need some infra for training, tracking and monitoring (and S3 bucket and ECR repo):
make build_data_infra
Performs preprocessing of data and creates train/test split of the data. The features are created on-the-fly in the sklearn
pipeline, so there is no need to create a dataset in advance with all features.
make preprocess_data
First we do some local hyperparameter tuning on the data. Default is 30 runs, which takes a couple of minutes, depending on your machine. With the given seeds we get a model with a lowest loss of ~2900 rupees. We only log the metadata of these models - no artifacts.
make train_model_hyperpar_search
Then we locally train and register the 3 best models. The models that are saved are actually pipelines which performs feature engineering and feature selection before the inference step. These all get their artifacts uploaded to S3 via MLFlow. We also upload training script and feature engineering for tracability.
make register_model
Once the models are registered, the model id of the model with the lowest loss will be printed in the terminal. Take the Experiment ID and Run ID and replace the current value of model_id
in stg.tfvars
(and prod
) with the new values as '{experiment_id}/{run_id}'
.
If you want to head to the MLFlow UI to find another model ID follow the instructions below.
Run:
make get_mlflow_info
Go to the returned DNS adress in a browser and enter the username and password (these were automatically generated during the build process and are stored in AWS SSM).
Serve the model via an Sagemaker Endpoint and build related Cloudwatch alarms and a Subscribable SNS topic. If you wish to add your email to the subscription, append your email to the list alarm_subscribers
in stg.tfvars
(and prod
). Then run:
make build_sagemaker_infra
You will receive a confirmation email in which you need to confirm the subscription. Please note that Terraform does not have the capability to keep track of which subscriptions are confirmed or not, which may cause issues when destroying this resource if the subscription has not been confirmed. See the documentation for more information.
This can take up to 10 minutes.
Builds an AWS Lambda function which creates an EvidentlyAI report once daily and uploads it to an S3 bucket.
The link to the S3 bucket is outputted as report_bucket
once this command has successfully completed.
Here, again you are required to update mlflow_run_id
in infrastructure/monitoring/vars/stg.tfvars
(and prod
) to the new {exp_id}/{run_id}
from the training step, before running:
make build_monitoring_infra
This can take up to 10 minutes.
Once it is built you can test the inference using
make test_endpoint
If you want to test the model locally you can do so, but first you need to update the MLFLOW_MODEL_URI
in app/src/.envtemplate
to match the bucket name holding the MLFlow artifacts and the experiment and run ID. Then rename the file from .envtemplate
to .env
and run:
make launch_local_app
# Run in a new terminal
make predict_local
To destroy the resources run (set ENV
to what you are using in the tfvar
files):
make ENV=stg destroy_all
This rule first empties all relevant buckets and ECR repositories and then destroys all created terraform resources. This can take up to 15 minutes.
One remaining bug:
- Something is wrong in step
check-endpoint-exist
of the cd-deploy workflow, but I cannot figure out what the issue is right now. However, the resources are still deployed, so it doesn't hinder the application for now, although it will need to be fixed before we can do updates to the endpoint on the fly.
Some improvements that I have yet to complete:
- Store EvidentlyAI metrics in AWS RDS and connect to AWS Managed Grafana
- Add MLFlow run id as SSM parameter for easy access
- Add new infrastructure directory for "general" infrastructure that is used in multiple infrastructure subdirectories
- Improve integration tests using localstack
- Store predictions and input features in new RDS instance
- Easily added to Sagemaker Endpoint using DataCapture
- Utilize prefect for triggering workruns better - not just for "monitoring" and logging
- Add data management/versioning tool (DVC or similar)