This project applies advanced Natural Language Processing (NLP) and machine learning techniques to classify news articles as either fake (Class 0) or real (Class 1). It incorporates traditional ML (Logistic regression), ensemble learning (XGBoost), neural networks (LSTM), and transformer-based deep learning (RoBERTa-base) models. Recall (Fake) is set as the success criteria because the project prioritizes to categorize fake news articles based on respective text.
Built with scikit-learn, XGBoost, TensorFlow/Keras, Torch, Hugging Face Transformers, Streamlit, FastAPI, and Docker.
| Task | Model Used | Deployment File |
|---|---|---|
| Fake News Classification | Finetuned RoBERTa (RUS) | roberta_rus |
| Interface | FastAPI + Streamlit | main.py and app.py |
The dataset consists of 22,465 news articles with labels indicating whether each article is fake (Class 0) or real (Class 1).
Preprocessing steps included:
- Tokenization and lemmatization using spaCy.
- Stopword removal and punctuation stripping.
- Domain extraction.
The dataset is imbalanced, with a higher proportion of real articles (11,699 compared to fake articles (10,766).

The rapid spread of fake news online erodes trust, misinforms the public, and can have serious political, social, and health consequences.
Manual fact-checking is slow and infeasible at scale. The goal is to build an automated, accurate, and interpretable detection system to identify fake news in real time.
Number of Sentences Distribuction

A baseline Logistic Regression model was built with imbalanced data and two balancing strategies:
- SMOTE oversampling for the minority class
- Random Undersampling (RUS) for the majority class
- Best configuration: RUS-balanced data: Recall (Fake) = 0.8214
- An XGBoost classifier:
- Best configuration: RUS-balanced data: Recall (Fake) = 0.8427
- An LSTM neural network:
- Best configuration: Imbalanced data: Recall (Fake) = 0.7130
- Finetune the RoBERTa-base transformer:
- Best configuration: RUS-balanced data: Recall (Fake) = 0.8712
| MODEL | Recall (Fake) | Recall (Real) | Precision (Fake) | Precision (Real) | AUC | Accuracy |
|---|---|---|---|---|---|---|
| LR-RUS | 0.8214 | 0.9014 | 0.8846 | 0.8458 | 0.930 | 0.8631 |
| XGBoost-RUS | 0.8427 | 0.9054 | 0.8913 | 0.8622 | 0.939 | 0.8754 |
| LSTM-Imbalanced | 0.7130 | 0.9425 | 0.9194 | 0.7811 | 0.867 | 0.8325 |
| RoBERTa-RUS | 0.8712 | 0.8920 | 0.8813 | 0.8827 | 0.951 | 0.8820 |
The Finetuned RoBERTa model using SMOTE-balanced dataset is the best performing alternative because:
-
It achieves the highest Recall (Fake) (0.8712) : Highest true positives for fake (
class 0) predicitions (2814**). -
It achieves the highest Precision (Real) (0.8872) : Highest certainity (88.27%) for
class 1predicitions when an article is actually real. -
This is corroborated by the confusion matrix, which reports the highest true positives for accurate fake news predictions (2814).
- LIME was used to explain the finetuned RoBERTa-base Transformer model's predictions at the feature level.
- Visualizations highlight the top 10 features supporting the prediction (green =
predicted class, red =alternative prediction).
The Finetuned RoBERTa-base transformer with SMOTE-balanced dataset is deployed using FastAPI, Streamlit, and Docker.
- Back-end : The FastAPI app (main.py) is responsible for model inference.
- Front-end : The Streamlit app (app.py) provides an interactive frontend web interface.
- Docker : Containerizes the Front-end and the Back-end.
- The Finetuned RoBERTa with RUS balancing achieved the highest **Recall (0.8712) and demonstrated strong capability in detecting fake news articles.
- Its contextual language understanding from pretraining and finetuning using RUS-balanced data improved recall on the minority class without sacrificing generalization.
- Collaborate with Media Outlets : Work alongsode news dissemination orgarnizations and social media platforms to incorporate the finetuned RoBERTa model for real-time inference.
- Educate the Public : Raise awareness about fake news detection tools, and encourage the public to leverage these technologies for more informed decision-making.
- Expand into Multiple Languages : Train the model with multilingual datasets to optimize robustness in catching fake news articles written/ published across multiple languages/ dialects.
- Deploy the containerized finetuned RoBERTa model via AWS SageMaker for low-latency inference.
- Develop browser extensions and social media plugins using the model’s API for real-time credibility scoring.
- Incorporate automation frameworks to flag low-confidence predictions for human review.
- Clone the repository
git clone https://github.com/mwakad/fake-news-detector.git - Install dependencies
pip install -r requirements.txt(cd deployment/backend) - Run FastAPI backend
uvicorn main:app --reload - Run Streamlit frontend
streamlit run streamlit_app.py


