Skip to content

Commit 4bda01e

Browse files
authored
Email/spam (#200)
* email done * updated * Update index.md * added visualization
1 parent 91b60b7 commit 4bda01e

File tree

2 files changed

+116
-132
lines changed

2 files changed

+116
-132
lines changed
Lines changed: 115 additions & 131 deletions
Original file line numberDiff line numberDiff line change
@@ -1,204 +1,188 @@
1+
# 🌟 Email Spam Detection
12

2-
# Email Spam Detection
3+
<div align="center">
4+
<img src="https://github.com/user-attachments/assets/c90bf132-68a6-4155-b191-d2da7e35d0ca" />
5+
</div>
36

4-
### AIM
5-
To develop a machine learning-based system that classifies email content as spam or ham (not spam).
7+
## 🎯 AIM
8+
To classify emails as spam or ham using machine learning models, ensuring better email filtering and security.
69

7-
### DATASET LINK
8-
[https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification](https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification)
10+
## 📊 DATASET LINK
11+
[Email Spam Detection Dataset](https://www.kaggle.com/datasets/shantanudhakadd/email-spam-detection-dataset-classification)
912

13+
## 📚 KAGGLE NOTEBOOK
14+
[Notebook Link](https://www.kaggle.com/code/thatarguy/email-spam-classifier?kernelSessionId=224262023)
1015

11-
### NOTEBOOK LINK
12-
[https://www.kaggle.com/code/inshak9/email-spam-detection](https://www.kaggle.com/code/inshak9/email-spam-detection)
16+
??? Abstract "Kaggle Notebook"
1317

18+
<iframe src="https://www.kaggle.com/embed/thatarguy/email-spam-classifier?kernelSessionId=224262023" height="800" style="margin: 0 auto; width: 100%; max-width: 950px;" frameborder="0" scrolling="auto" title="email-spam-classifier"></iframe>
1419

15-
### LIBRARIES NEEDED
20+
## ⚙️ TECH STACK
1621

17-
??? quote "LIBRARIES USED"
22+
| **Category** | **Technologies** |
23+
|--------------------------|---------------------------------------------|
24+
| **Languages** | Python |
25+
| **Libraries/Frameworks** | Scikit-learn, NumPy, Pandas, Matplotlib, Seaborn |
26+
| **Databases** | NOT USED |
27+
| **Tools** | Kaggle, Jupyter Notebook |
28+
| **Deployment** | NOT USED |
1829

19-
- pandas
20-
- numpy
21-
- scikit-learn
22-
- matplotlib
23-
- seaborn
24-
25-
---
30+
---
2631

27-
### DESCRIPTION
32+
## 📝 DESCRIPTION
2833
!!! info "What is the requirement of the project?"
29-
- A robust system to detect spam emails is essential to combat increasing spam content.
30-
- It improves user experience by automatically filtering unwanted messages.
31-
32-
??? info "Why is it necessary?"
33-
- Spam emails consume resources, time, and may pose security risks like phishing.
34-
- Helps organizations and individuals streamline their email communication.
34+
- To efficiently classify emails as spam or ham.
35+
- To improve email security by filtering out spam messages.
3536

3637
??? info "How is it beneficial and used?"
37-
- Provides a quick and automated solution for spam classification.
38-
- Used in email services, IT systems, and anti-spam software to filter messages.
38+
- Helps in reducing unwanted spam emails in user inboxes.
39+
- Enhances productivity by filtering out irrelevant emails.
40+
- Can be integrated into email service providers for automatic filtering.
3941

4042
??? info "How did you start approaching this project? (Initial thoughts and planning)"
41-
- Analyzed the dataset and prepared features.
42-
- Implemented various machine learning models for comparison.
43+
- Collected and preprocessed the dataset.
44+
- Explored various machine learning models.
45+
- Evaluated models based on performance metrics.
46+
- Visualized results for better understanding.
4347

4448
??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)."
45-
- Documentation from [scikit-learn](https://scikit-learn.org)
46-
- Blog: Introduction to Spam Classification with ML
49+
- Scikit-learn documentation.
50+
- Various Kaggle notebooks related to spam detection.
4751

4852
---
4953

50-
### EXPLANATION
54+
## 🔍 PROJECT EXPLANATION
55+
56+
### 🧩 DATASET OVERVIEW & FEATURE DETAILS
57+
58+
??? example "📂 spam.csv"
5159

52-
#### DETAILS OF THE DIFFERENT FEATURES
53-
The dataset contains features like word frequency, capital letter counts, and others that help in distinguishing spam emails from ham.
60+
- The dataset contains the following features:
5461

55-
| Feature | Description |
56-
|----------------------|-------------------------------------------------|
57-
| `word_freq_x` | Frequency of specific words in the email body |
58-
| `capital_run_length` | Length of consecutive capital letters |
59-
| `char_freq` | Frequency of special characters like `;` and `$` |
60-
| `is_spam` | Target variable (1 = Spam, 0 = Ham) |
62+
| Feature Name | Description | Datatype |
63+
|--------------|-------------|:------------:|
64+
| Category | Spam or Ham | object |
65+
| Text | Email text | object |
66+
| Length | Length of email | int64 |
67+
68+
??? example "🛠 Developed Features from spam.csv"
69+
70+
| Feature Name | Description | Reason | Datatype |
71+
|--------------|-------------|----------|:------------:|
72+
| Length | Email text length | Helps in spam detection | int64 |
6173

6274
---
6375

64-
#### WHAT I HAVE DONE
76+
### 🛤 PROJECT WORKFLOW
6577

66-
=== "Step 1"
78+
!!! success "Project workflow"
79+
80+
``` mermaid
81+
graph LR
82+
A[Start] --> B[Load Dataset]
83+
B --> C[Preprocess Data]
84+
C --> D[Vectorize Text]
85+
D --> E[Train Models]
86+
E --> F[Evaluate Models]
87+
F --> G[Visualize Results]
88+
```
6789

68-
Initial data exploration and understanding:
69-
- Loaded the dataset using pandas.
70-
- Explored dataset features and target variable distribution.
90+
=== "Step 1"
91+
- Load the dataset and clean unnecessary columns.
7192

7293
=== "Step 2"
73-
74-
Data cleaning and preprocessing:
75-
- Checked for missing values.
76-
- Standardized features using scaling techniques.
94+
- Preprocess text and convert categorical labels.
7795

7896
=== "Step 3"
79-
80-
Feature engineering and selection:
81-
- Extracted relevant features for spam classification.
82-
- Used correlation matrix to select significant features.
97+
- Convert text into numerical features using CountVectorizer.
8398

8499
=== "Step 4"
85-
86-
Model training and evaluation:
87-
- Trained models: KNN, Naive Bayes, SVM, and Random Forest.
88-
- Evaluated models using accuracy, precision, and recall.
100+
- Train machine learning models.
89101

90102
=== "Step 5"
91-
92-
Model optimization and fine-tuning:
93-
- Tuned hyperparameters using GridSearchCV.
103+
- Evaluate models using accuracy, precision, recall, and F1 score.
94104

95105
=== "Step 6"
96-
97-
Validation and testing:
98-
- Tested models on unseen data to check performance.
106+
- Visualize performance using confusion matrices and heatmaps.
99107

100108
---
101109

102-
#### PROJECT TRADE-OFFS AND SOLUTIONS
103-
104-
=== "Trade Off 1"
105-
- **Accuracy vs. Training Time**:
106-
- Models like Random Forest took longer to train but achieved higher accuracy compared to Naive Bayes.
107-
108-
=== "Trade Off 2"
109-
- **Complexity vs. Interpretability**:
110-
- Simpler models like Naive Bayes were more interpretable but slightly less accurate.
110+
### 🖥 CODE EXPLANATION
111111

112-
---
112+
=== "Section 1"
113+
- Data loading and preprocessing.
113114

114-
### SCREENSHOTS
115-
<!-- Attach the screenshots and images -->
115+
=== "Section 2"
116+
- Text vectorization using CountVectorizer.
116117

117-
!!! success "Project flowchart"
118-
119-
``` mermaid
120-
graph LR
121-
A[Start] --> B[Load Dataset];
122-
B --> C[Preprocessing];
123-
C --> D[Train Models];
124-
D --> E{Compare Performance};
125-
E -->|Best Model| F[Deploy];
126-
E -->|Retry| C;
127-
```
118+
=== "Section 3"
119+
- Training models (MLP Classifier, MultinomialNB, BernoulliNB).
128120

129-
??? tip "Confusion Matrix"
121+
=== "Section 4"
122+
- Evaluating models using various metrics.
130123

131-
=== "SVM"
132-
![Confusion Matrix - SVM](https://github.com/user-attachments/assets/5abda820-040a-4ea8-b389-cd114d329c62)
124+
=== "Section 5"
125+
- Visualizing confusion matrices and metric comparisons.
133126

134-
=== "Naive Bayes"
135-
![Confusion Matrix - Naive Bayes](https://github.com/user-attachments/assets/bdae9210-9b9b-45c7-9371-36c0a66a9184)
127+
---
136128

137-
=== "Decision Tree"
138-
![Confusion Matrix - Decision Tree](https://github.com/user-attachments/assets/8e92fc53-4aff-4973-b0a1-b65a7fc4a79e)
129+
### ⚖️ PROJECT TRADE-OFFS AND SOLUTIONS
139130

140-
=== "AdaBoost"
141-
![Confusion Matrix - AdaBoost](https://github.com/user-attachments/assets/043692e3-f733-419c-9fb2-834f2e199506)
131+
=== "Trade Off 1"
132+
- Balancing accuracy and computational efficiency.
133+
- Used Naive Bayes for speed and MLP for improved accuracy.
142134

143-
=== "Random Forest"
144-
![Confusion Matrix - Random Forest](https://github.com/user-attachments/assets/5c689f57-9ec5-4e49-9ef5-3537825ac772)
135+
=== "Trade Off 2"
136+
- Handling false positives vs. false negatives.
137+
- Tuned models to improve precision for spam detection.
145138

146139
---
147140

148-
### MODELS USED AND THEIR EVALUATION METRICS
141+
## 🎮 SCREENSHOTS
149142

150-
| Model | Accuracy | Precision | Recall |
151-
|----------------------|----------|-----------|--------|
152-
| KNN | 90% | 89% | 88% |
153-
| Naive Bayes | 92% | 91% | 90% |
154-
| SVM | 94% | 93% | 91% |
155-
| Random Forest | 95% | 94% | 93% |
156-
| AdaBoost | 97% | 97% | 100% |
143+
!!! tip "Visualizations and EDA of different features"
157144

158-
---
159-
160-
#### MODELS COMPARISON GRAPHS
145+
=== "Confusion Matrix comparision"
146+
![img](https://github.com/user-attachments/assets/94a3b2d8-c7e5-41a5-bba7-8ba4cb1435a7)
161147

162-
!!! tip "Models Comparison Graphs"
163148

164-
=== "Accuracy Comparison"
165-
![Model accracy comparison](https://github.com/user-attachments/assets/1e17844d-e953-4eb0-a24d-b3dbc727db93)
149+
??? example "Model performance graphs"
166150

167-
---
151+
=== "Meteric comparison"
152+
![img](https://github.com/user-attachments/assets/c2be4340-89c9-4aee-9a27-8c40bf2c0066)
168153

169-
### CONCLUSION
170154

171-
#### WHAT YOU HAVE LEARNED
172-
173-
!!! tip "Insights gained from the data"
174-
- Feature importance significantly impacts spam detection.
175-
- Simple models like Naive Bayes can achieve competitive performance.
155+
---
176156

177-
??? tip "Improvements in understanding machine learning concepts"
178-
- Gained hands-on experience with classification models and model evaluation techniques.
157+
## 📉 MODELS USED AND THEIR EVALUATION METRICS
179158

180-
??? tip "Challenges faced and how they were overcome"
181-
- Balancing between accuracy and training time was challenging, solved using model tuning.
159+
| Model | Accuracy | Precision | Recall | F1 Score |
160+
|------------|----------|------------|--------|----------|
161+
| MLP Classifier | 95% | 0.94 | 0.90 | 0.92 |
162+
| Multinomial NB | 93% | 0.91 | 0.88 | 0.89 |
163+
| Bernoulli NB | 92% | 0.89 | 0.85 | 0.87 |
182164

183165
---
184166

185-
#### USE CASES OF THIS MODEL
186-
187-
=== "Application 1"
167+
## ✅ CONCLUSION
188168

189-
**Email Service Providers**
190-
- Automated filtering of spam emails for improved user experience.
169+
### 🔑 KEY LEARNINGS
191170

192-
=== "Application 2"
171+
!!! tip "Insights gained from the data"
172+
- Text length plays a role in spam detection.
173+
- Certain words appear more frequently in spam emails.
193174

194-
**Enterprise Email Security**
195-
- Used in enterprise software to detect phishing and spam emails.
175+
??? tip "Improvements in understanding machine learning concepts"
176+
- Gained insights into text vectorization techniques.
177+
- Understood trade-offs between different classification models.
196178

197179
---
198180

199-
### FEATURES PLANNED BUT NOT IMPLEMENTED
181+
### 🌍 USE CASES
200182

201-
=== "Feature 1"
183+
=== "Email Filtering Systems"
184+
- Can be integrated into email services like Gmail and Outlook.
202185

203-
- Integration of deep learning models (LSTM) for improved accuracy.
186+
=== "SMS Spam Detection"
187+
- Used in mobile networks to block spam messages.
204188

docs/natural-language-processing/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929
<!-- Email Spam Detection -->
3030
<figure style="padding: 1rem; background: rgba(39, 39, 43, 0.5); border-radius: 10px; border: 1px solid rgba(76, 76, 82, 0.4); box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); transition: transform 0.2s ease-in-out; text-align: center; max-width: 320px; margin: auto;">
3131
<a href="email-spam-detection" style="color: white; text-decoration: none; display: block;">
32-
<img src="https://img.freepik.com/free-photo/spam-mail-concept-with-envelopes_23-2149133736.jpg" alt="Email Spam Detection" style="width: 100%; height: 150px; object-fit: cover; border-radius: 8px; transition: transform 0.2s;" />
32+
<img src="https://github.com/user-attachments/assets/c90bf132-68a6-4155-b191-d2da7e35d0ca" alt="Email Spam Detection" style="width: 100%; height: 150px; object-fit: cover; border-radius: 8px; transition: transform 0.2s;" />
3333
<div style="padding: 0.8rem;">
3434
<h3 style="margin: 0; font-size: 18px;">Email Spam Detection</h3>
3535
<p style="font-size: 14px; opacity: 0.8;">ML-Based Email Spam Classification</p>

0 commit comments

Comments
 (0)