Predict the Air Quality Index (AQI) from real‑world pollutant data (CO, NO₂, SO₂, O₃, PM2.5, PM10) across multiple global cities using Python & ML.
- 📊 Exploratory Data Analysis – Visual pollutant distributions & AQI trends.
- 🧹 Data Cleaning – Null checks, duplicates, drop unused columns.
- 🧮 Feature Engineering – One‑hot encode cities; scale numeric features.
- 🤖 Machine Learning Models – Linear Regression baseline vs Random Forest ensemble.
- 📈 Model Evaluation – R², RMSE, MAE comparison.
- 🔮 Custom Prediction – Plug in new pollutant readings and estimate AQI.
Poor air quality affects respiratory health, productivity, and urban planning. An ML model that estimates AQI from pollutant levels helps:
- Citizens track exposure risk.
- City agencies forecast alerts.
- Students learn regression modeling on environmental data.
Rows: 52,560 hourly records
Columns: City, CO, NO2, SO2, O3, PM2.5, PM10, AQI
Cities Covered: Brasilia, Cairo, Dubai, London, New York, Sydney
Use: Educational / learning project dataset (bundled locally in repo).
If you later host the dataset separately (e.g., Kaggle), update the link here.
- Python (pandas, numpy)
- Visualization: matplotlib, seaborn
- Modeling: scikit-learn (LinearRegression, RandomForestRegressor, MinMaxScaler, metrics)
- Environment: Jupyter Notebook
- Load CSV →
pandas.read_csv() - Inspect shape, dtypes, nulls
- Drop
Date(not modeled) - Encode
City→ one-hot columns - Split train/test
- Scale features →
MinMaxScaler - Train models:
- Linear Regression (baseline)
- Random Forest Regressor (ensemble)
- Evaluate → R², RMSE, MAE
- Predict on new samples
| Model | R² | RMSE | MAE | Notes |
|---|---|---|---|---|
| Linear Regression | 0.83 | 10.21 | 7.38 | Baseline |
| Random Forest | 0.86 | 9.37 | 6.33 | ✅ Best Model |
(Metrics from notebook run; will vary by random seed.)
First create the repo on GitHub under your account prachi757 named aqi-prediction (Public). Then run the steps below.
git clone https://github.com/prachi757/aqi-prediction.git
cd aqi-predictionIf you forked this repo instead: replace the URL with your fork (shown on GitHub after you click Fork).
macOS / Linux
python -m venv .venv
source .venv/bin/activateWindows PowerShell
python -m venv .venv
.\.venv\Scripts\activatepip install --upgrade pip
pip install -r requirements.txtjupyter notebook Major_Project.ipynbRun cells top→bottom.
After running the notebook and training the Random Forest model:
# Example new pollutant reading (scaled automatically below)
# Order: CO, NO2, SO2, O3, PM2_5, PM10, Brasilia, Cairo, Dubai, London, New_York, Sydney
new_sample = [[0.7, 45.0, 12.0, 32.0, 58.0, 105, 0, 1, 0, 0, 0, 0]]
# IMPORTANT: Use the *same* scaler fitted on training data
new_sample_scaled = scaler.transform(new_sample)
pred = AQI_Regressor.predict(new_sample_scaled)
print(f"Predicted AQI: {pred[0]:.2f}")aqi-prediction/
│
├── Major_Project.ipynb # Notebook: EDA + Modeling
├── Air_Quality_dataset.csv # Dataset (hourly pollutant readings)
├── requirements.txt # Environment + install instructions
└── README.md # You are here!
- Add Gradient Boosting / XGBoost
- Include time‑series features from
Date(hour, month, season) - Hyperparameter tuning (GridSearchCV)
- Streamlit mini‑app for live AQI prediction
- Feature importance + SHAP explainability
Prachi Garg
GitHub: prachi757
LinkedIn: Prachi Garg
Email: [email protected]
Educational & portfolio use. Feel free to fork, learn, and extend—please credit the original author.
If it helped you, star the repo and share! 🙌