This Python script (loan_checker.py) implements an end-to-end machine learning pipeline to predict the likelihood of loan default based on borrower data. It reads data from CSV-like files, performs cleaning, visualization, class balancing, feature engineering, trains a Decision Tree classifier, evaluates the model, predicts outcomes for new loan requests, and presents the results in an interactive command-line interface.
The pipeline demonstrates common steps in a data science workflow, including data preprocessing, exploratory visualization, model building, and basic deployment via a CLI.
- Data Loading: Reads borrower data from specified CSV-like files (
credit_risk_train.csv,loan_requests.csv). - Data Cleaning: Removes records with missing values or unrealistic age entries (>= 90). Reports counts of removed records and missing values per column.
- Data Visualization: Generates plots using
matplotlibto explore:- Age distribution of defaulters vs. non-defaulters (Histograms).
- Home ownership status among defaulters vs. non-defaulters (Pie Chart).
- Class Balancing: Addresses class imbalance in the training data by performing simple undersampling of the majority class (non-defaulters).
- Feature Engineering & Selection: Selects specific features (
loan_amnt,person_income,cb_person_cred_hist_length) and scales numerical features usingStandardScaler. - Model Training: Trains a
DecisionTreeClassifierusingscikit-learnon the prepared training data. - Model Evaluation: Assesses the trained model's performance on a held-out test set using:
- Accuracy Score.
- Classification Report (Precision, Recall, F1-score).
- Confusion Matrix.
- Prediction: Uses the trained model to predict default status for new loan requests from
loan_requests.csv. - Interactive Display: Presents the borrower details and predictions using a custom
Carouselclass, allowing the user to navigate back and forth through the records via the command line.
This script requires the following files to be present in the same directory:
credit_risk_train.csv: Contains the historical training data with borrower information and known loan outcomes (loan_status). Expected to be comma-separated with a header row.loan_requests.csv: Contains new loan applicant data for prediction. Expected to be comma-separated with a header row similar to the training data (excludingloan_status).carousel.py: Contains the definition for theCarouselclass used in the interactive display.
- Python 3.x
matplotlibscikit-learn- The custom
carousel.pyfile.
- Place Files: Ensure
loan_checker.py,credit_risk_train.csv,loan_requests.csv, andcarousel.pyare in the same directory. - Install Libraries: Open your terminal or command prompt and run:
pip install matplotlib scikit-learn # or pip3 install matplotlib scikit-learn
- Navigate: Open your terminal or command prompt and navigate to the directory containing all the required files.
- Run the script:
python loan_checker.py # or python3 loan_checker.py - Observe Output:
- The script will first print logs related to data cleaning, balancing, and model evaluation metrics.
- Plots generated during the visualization step will be displayed sequentially. You may need to close each plot window to proceed.
- Predictions for borrowers in
loan_requests.csvwill be printed. - You will be prompted to press Enter to start the interactive carousel display.
- Interact with Carousel:
- Use
1to move to the next borrower,2to move to the previous borrower, and0to exit the carousel interface.
- Use
The script is organized into functions responsible for different pipeline stages:
createDataFrame(): Loads data.dataCleaning(): Cleans the training data.dataVisualisation(): Generates plots.classBalancing(): Undersamples the majority class.featureSelection(): Selects and scales features.modelTraining(): Trains the Decision Tree.modelEvaluation(): Evaluates the model.borrowerPrediction(): Predicts on new data and populates the carousel.displayBorrower(): Formats and prints the current borrower's info.clear(): Clears the console screen.interface(): Handles user interaction with the carousel.main(): Orchestrates the execution of the entire pipeline.
MIT License
Andrew Obwocha