Flight Price Prediction

This is my final project for the DataTalks MLOPS Zoomcamp 2024.

The project builds infrastructure for serving an XGBoost model for predicting the price of flight tickets on a number of Indian airlines. The problem itself is a toy example and was chosen to keep the problem simple and focus on the surrounding infrastructure.

The infrastructure is divided into three parts, one for the MLFlow tracking server, one for data monitoring and one for hosting the model as a Sagemaker Endpoint.

Please note that this project does create resources on AWS which may incur some minor cost for the user. The maximum cost for a single day during development has been $13. If you do create the resources, please remember to destroy them after you are finished testing.

Prerequisites:

An AWS Account
Prefect Cloud Account (create one for free here)
Add AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to the Github repo of you wish to test the Github workflows.
- How to create access key and secret
- Repo -> Settings -> Secrets and variables -> Actions -> New repository secret

The project was developed and tested on Ubuntu 23.10 and 24.04.

The preprocessing and training is monitored/logged using prefect. All runs can be reviewed using the Prefect Cloud UI.

All infrastructure is managed by Terraform. If anything fails during the apply stage the issue may resolve itself by running terraform apply (or the make command) one more time. If the error persists then something else is the issue.

There are 4 variables which will need to be changed from the current settings. I have highlighted these with bold in the readme below, please read it carefully! The pars are with regards to:

Terraform state bucket name
model_id in Model Registering
alarm_subscribers in Serving
mlflow_run_id in Monitoring

Setup

Dependencies

In order to build this repository the following tools are required, in addition to those specified in pyproject.toml and Prerequisites. Follow the links for install instructions. Even if you have slightly different versions installed, the code will probably still work, but the versions listed below were those used for developing and building the project locally and I recommend using the same versions:

Update terraform state bucket

Go to the following files and change mpierrau to some other unique identifier (the S3 bucket name needs to be unique across all of the internet):

terraform {
  ...
  backend "s3" {
    bucket = "tf-state-flight-price-prediction-mpierrau"
    ...
  }
}

Install project dependencies

Navigate to the root folder (flight-price-prediction/) and install project package dependencies:

make setup

Download data from Kaggle

make get_data

Build AWS hosted `MLFlow` server

Authenticate against AWS using aws configure sso and then aws sso login. Help resources:
- Medium blog post
- ChatGPT help
Run the make command below. It will:
1. Create an ECR repo
2. Build and upload the Docker container for the MLFlow server app to the ECR repo
3. Build the rest of the required infra for the server (IAM roles, ECS service and task, network settings, RDS postgres DB, S3 bucket)

make build_mlflow_infra

This can take up to 15-20 minutes.

We also need some infra for training, tracking and monitoring (and S3 bucket and ECR repo):

make build_data_infra

Preprocessing

Performs preprocessing of data and creates train/test split of the data. The features are created on-the-fly in the sklearn pipeline, so there is no need to create a dataset in advance with all features.

make preprocess_data

Training

Hyperparameter tuning

First we do some local hyperparameter tuning on the data. Default is 30 runs, which takes a couple of minutes, depending on your machine. With the given seeds we get a model with a lowest loss of ~2900 rupees. We only log the metadata of these models - no artifacts.

make train_model_hyperpar_search

Register model

Then we locally train and register the 3 best models. The models that are saved are actually pipelines which performs feature engineering and feature selection before the inference step. These all get their artifacts uploaded to S3 via MLFlow. We also upload training script and feature engineering for tracability.

make register_model

Once the models are registered, the model id of the model with the lowest loss will be printed in the terminal. Take the Experiment ID and Run ID and replace the current value of model_id in stg.tfvars (and prod) with the new values as '{experiment_id}/{run_id}'.

If you want to head to the MLFlow UI to find another model ID follow the instructions below.

To open the AWS MLFlow UI

Run:

make get_mlflow_info

Go to the returned DNS adress in a browser and enter the username and password (these were automatically generated during the build process and are stored in AWS SSM).

Serving

Serve the model via an Sagemaker Endpoint and build related Cloudwatch alarms and a Subscribable SNS topic. If you wish to add your email to the subscription, append your email to the list alarm_subscribers in stg.tfvars (and prod). Then run:

make build_sagemaker_infra

You will receive a confirmation email in which you need to confirm the subscription. Please note that Terraform does not have the capability to keep track of which subscriptions are confirmed or not, which may cause issues when destroying this resource if the subscription has not been confirmed. See the documentation for more information.

This can take up to 10 minutes.

Monitoring

Builds an AWS Lambda function which creates an EvidentlyAI report once daily and uploads it to an S3 bucket. The link to the S3 bucket is outputted as report_bucket once this command has successfully completed.

Here, again you are required to update mlflow_run_id in infrastructure/monitoring/vars/stg.tfvars (and prod) to the new {exp_id}/{run_id} from the training step, before running:

make build_monitoring_infra

This can take up to 10 minutes.

Inference

Once it is built you can test the inference using

make test_endpoint

Test inference locally

If you want to test the model locally you can do so, but first you need to update the MLFLOW_MODEL_URI in app/src/.envtemplate to match the bucket name holding the MLFlow artifacts and the experiment and run ID. Then rename the file from .envtemplate to .env and run:

make launch_local_app
# Run in a new terminal
make predict_local

Teardown

To destroy the resources run (set ENV to what you are using in the tfvar files):

make ENV=stg destroy_all

This rule first empties all relevant buckets and ECR repositories and then destroys all created terraform resources. This can take up to 15 minutes.

TODO:

One remaining bug:

Something is wrong in step check-endpoint-exist of the cd-deploy workflow, but I cannot figure out what the issue is right now. However, the resources are still deployed, so it doesn't hinder the application for now, although it will need to be fixed before we can do updates to the endpoint on the fly.

Some improvements that I have yet to complete:

Store EvidentlyAI metrics in AWS RDS and connect to AWS Managed Grafana
Add MLFlow run id as SSM parameter for easy access
Add new infrastructure directory for "general" infrastructure that is used in multiple infrastructure subdirectories
Improve integration tests using localstack
Store predictions and input features in new RDS instance
- Easily added to Sagemaker Endpoint using DataCapture
Utilize prefect for triggering workruns better - not just for "monitoring" and logging
Add data management/versioning tool (DVC or similar)

Name		Name	Last commit message	Last commit date
Latest commit History 180 Commits
.github/workflows		.github/workflows
infrastructure		infrastructure
integration-tests		integration-tests
scripts		scripts
tests		tests
training		training
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prefectignore		.prefectignore
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
prefect.yaml		prefect.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Flight Price Prediction

Prerequisites:

Setup

Dependencies

Update terraform state bucket

Install project dependencies

Download data from Kaggle

Build AWS hosted `MLFlow` server

Preprocessing

Training

Hyperparameter tuning

Register model

To open the AWS MLFlow UI

Serving

Monitoring

Inference

Test inference locally

Teardown

TODO:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

mpierrau/flight-price-prediction

Folders and files

Latest commit

History

Repository files navigation

Flight Price Prediction

Prerequisites:

Setup

Dependencies

Update terraform state bucket

Install project dependencies

Download data from Kaggle

Build AWS hosted MLFlow server

Preprocessing

Training

Hyperparameter tuning

Register model

To open the AWS MLFlow UI

Serving

Monitoring

Inference

Test inference locally

Teardown

TODO:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Build AWS hosted `MLFlow` server

Packages