A binary text classification machine learning project that detects whether a news article is real or fake using Natural Language Processing (NLP) techniques and Logistic Regression.
Given a dataset of real and fake news articles, build a model that can classify new/unseen news as either real or fake based on its textual content.
- Source:
WELFake_Dataset.csv - Download Link: WELFake Dataset on Kaggle
- Columns Used:
title,text,label - Target Label:
1= Real news0= Fake news
- Final Input Feature: Combined
titleandtextinto a single text field before preprocessing
-
Import Dependencies
-
Load Dataset (
WELFake_Dataset.csv) -
Preprocess Data
- Remove nulls
- Combine text columns
- Clean text with regex
- Remove stopwords & apply stemming
-
Feature Engineering
- TF-IDF Vectorization using unigrams & bigrams
-
Train-Test Split
- Stratified 80/20 split
-
Model Training
- Compared Logistic Regression, Naive Bayes, and Random Forest
- Selected Logistic Regression for best balance of speed and performance
-
Model Evaluation
- Accuracy, Precision, Recall, F1-score
- Confusion Matrix (visualized with seaborn)
-
Save Model & Vectorizer using
pickle -
Custom Prediction
- In-notebook prediction function
| Metric | Training Set | Test Set |
|---|---|---|
| Accuracy | 95.86% | 94.64% |
| F1-score | 0.96 | 0.95 |
✔️ Indicates strong generalization with balanced performance on both classes.
Enter: "Breaking: President gives major update on national policy."
Output: "Prediction for custom news input: Real "git clone https://github.com/Toshaksha/fake_news_prediction.git
cd fake_news_prediction
pip install -r requirements.txtfake-news-detection/
│
├── fake_news_prediction.ipynb # Jupyter notebook (model training)
├── requirements.txt # Python dependencies
├── models/
│ ├── logistic_regression_model.pkl # Saved ML model
│ └── tfidf_vectorizer.pkl # Saved TF-IDF vectorizer
├── images/
│ └── confusion_matrix.jpg # Confusion matrix heatmap
└── README.md # Project documentation
- Python 3.x
- NLTK – for stopwords and stemming
- Scikit-learn – ML models and metrics
- Pandas, NumPy – data handling
- Seaborn, Matplotlib – visualization
- tqdm – progress bar for processing
Toshaksha – GitHub Profile
⭐ If you found this project helpful, please give it a star on GitHub!
