Fake News Detection

Project Summary

This project applies advanced Natural Language Processing (NLP) and machine learning techniques to classify news articles as either fake (Class 0) or real (Class 1). It incorporates traditional ML (Logistic regression), ensemble learning (XGBoost), neural networks (LSTM), and transformer-based deep learning (RoBERTa-base) models. Recall (Fake) is set as the success criteria because the project prioritizes to categorize fake news articles based on respective text.

Built with scikit-learn, XGBoost, TensorFlow/Keras, Torch, Hugging Face Transformers, Streamlit, FastAPI, and Docker.

Task	Model Used	Deployment File
Fake News Classification	Finetuned RoBERTa (RUS)	roberta_rus
Interface	FastAPI + Streamlit	main.py and app.py

Data Understanding

The dataset consists of 22,465 news articles with labels indicating whether each article is fake (Class 0) or real (Class 1).

Preprocessing steps included:

Tokenization and lemmatization using spaCy.
Stopword removal and punctuation stripping.
Domain extraction.

The dataset is imbalanced, with a higher proportion of real articles (11,699 compared to fake articles (10,766).

Problem Statement

The rapid spread of fake news online erodes trust, misinforms the public, and can have serious political, social, and health consequences.
Manual fact-checking is slow and infeasible at scale. The goal is to build an automated, accurate, and interpretable detection system to identify fake news in real time.

Project Objectives

Objective 1: Perform Extraploratory Data Analysis (EDA)

Number of Sentences Distribuction

num_sentences_distribution_class1_filtered

Text Length Distribuction

text_length_distribution_class1_filtered

Top-10 Bigrams

Top-10 Trigrams

Objective 2: Build a baseline Logistic Regression model

A baseline Logistic Regression model was built with imbalanced data and two balancing strategies:

SMOTE oversampling for the minority class
Random Undersampling (RUS) for the majority class
- Best configuration: RUS-balanced data: Recall (Fake) = 0.8214

Objective 3: Build an XGBoost model, LSTM neural network, and Finetune the RoBERTa-base Transformer

An XGBoost classifier:
- Best configuration: RUS-balanced data: Recall (Fake) = 0.8427
An LSTM neural network:
- Best configuration: Imbalanced data: Recall (Fake) = 0.7130
Finetune the RoBERTa-base transformer:
- Best configuration: RUS-balanced data: Recall (Fake) = 0.8712

Objective 4: Interpret best performing model using the LIME library

MODEL	Recall (Fake)	Recall (Real)	Precision (Fake)	Precision (Real)	AUC	Accuracy
LR-RUS	0.8214	0.9014	0.8846	0.8458	0.930	0.8631
XGBoost-RUS	0.8427	0.9054	0.8913	0.8622	0.939	0.8754
LSTM-Imbalanced	0.7130	0.9425	0.9194	0.7811	0.867	0.8325
RoBERTa-RUS	0.8712	0.8920	0.8813	0.8827	0.951	0.8820

The Finetuned RoBERTa model using SMOTE-balanced dataset is the best performing alternative because:

It achieves the highest Recall (Fake) (0.8712) : Highest true positives for fake (class 0) predicitions (2814**).
It achieves the highest Precision (Real) (0.8872) : Highest certainity (88.27%) for class 1 predicitions when an article is actually real.
This is corroborated by the confusion matrix, which reports the highest true positives for accurate fake news predictions (2814).

LIME was used to explain the finetuned RoBERTa-base Transformer model's predictions at the feature level.
Visualizations highlight the top 10 features supporting the prediction (green = predicted class, red = alternative prediction).

Objective 5: Deploy selected model using FastAPI and Streamlit

The Finetuned RoBERTa-base transformer with SMOTE-balanced dataset is deployed using FastAPI, Streamlit, and Docker.

Back-end : The FastAPI app (main.py) is responsible for model inference.
Front-end : The Streamlit app (app.py) provides an interactive frontend web interface.
Docker : Containerizes the Front-end and the Back-end.

Conclusion

The Finetuned RoBERTa with RUS balancing achieved the highest **Recall (0.8712) and demonstrated strong capability in detecting fake news articles.
Its contextual language understanding from pretraining and finetuning using RUS-balanced data improved recall on the minority class without sacrificing generalization.

Recommendations

Collaborate with Media Outlets : Work alongsode news dissemination orgarnizations and social media platforms to incorporate the finetuned RoBERTa model for real-time inference.
Educate the Public : Raise awareness about fake news detection tools, and encourage the public to leverage these technologies for more informed decision-making.
Expand into Multiple Languages : Train the model with multilingual datasets to optimize robustness in catching fake news articles written/ published across multiple languages/ dialects.

Next Steps

Deploy the containerized finetuned RoBERTa model via AWS SageMaker for low-latency inference.
Develop browser extensions and social media plugins using the model’s API for real-time credibility scoring.
Incorporate automation frameworks to flag low-confidence predictions for human review.

Installation & Running the App

Clone the repository
git clone https://github.com/mwakad/fake-news-detector.git
Install dependencies pip install -r requirements.txt (cd deployment/backend)
Run FastAPI backend uvicorn main:app --reload
Run Streamlit frontend streamlit run streamlit_app.py

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
.devcontainer		.devcontainer
dataset		dataset
deployment		deployment
images		images
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
data report.pdf		data report.pdf
index.ipynb		index.ipynb
notebook.pdf		notebook.pdf
presentation.pdf		presentation.pdf
proposal.pdf		proposal.pdf
scrapping.ipynb		scrapping.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fake News Detection

Project Summary

Data Understanding

Problem Statement

Project Objectives

Objective 1: Perform Extraploratory Data Analysis (EDA)

Objective 2: Build a baseline Logistic Regression model

Objective 3: Build an XGBoost model, LSTM neural network, and Finetune the RoBERTa-base Transformer

Objective 4: Interpret best performing model using the LIME library

Objective 5: Deploy selected model using FastAPI and Streamlit

Conclusion

Recommendations

Next Steps

Installation & Running the App

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

mwakad/Fake-News-Detection-

Folders and files

Latest commit

History

Repository files navigation

Fake News Detection

Project Summary

Data Understanding

Problem Statement

Project Objectives

Objective 1: Perform Extraploratory Data Analysis (EDA)

Objective 2: Build a baseline Logistic Regression model

Objective 3: Build an XGBoost model, LSTM neural network, and Finetune the RoBERTa-base Transformer

Objective 4: Interpret best performing model using the LIME library

Objective 5: Deploy selected model using FastAPI and Streamlit

Conclusion

Recommendations

Next Steps

Installation & Running the App

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages