|
| 1 | +# 🌟 Email Spam Detection |
1 | 2 |
|
2 |
| -# Email Spam Detection |
| 3 | +<div align="center"> |
| 4 | + <img src="https://github.com/user-attachments/assets/c90bf132-68a6-4155-b191-d2da7e35d0ca" /> |
| 5 | +</div> |
3 | 6 |
|
4 |
| -### AIM |
5 |
| -To develop a machine learning-based system that classifies email content as spam or ham (not spam). |
| 7 | +## 🎯 AIM |
| 8 | +To classify emails as spam or ham using machine learning models, ensuring better email filtering and security. |
6 | 9 |
|
7 |
| -### DATASET LINK |
8 |
| -[https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification](https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification) |
| 10 | +## 📊 DATASET LINK |
| 11 | +[Email Spam Detection Dataset](https://www.kaggle.com/datasets/shantanudhakadd/email-spam-detection-dataset-classification) |
9 | 12 |
|
| 13 | +## 📚 KAGGLE NOTEBOOK |
| 14 | +[Notebook Link](https://www.kaggle.com/code/thatarguy/email-spam-classifier?kernelSessionId=224262023) |
10 | 15 |
|
11 |
| -### NOTEBOOK LINK |
12 |
| -[https://www.kaggle.com/code/inshak9/email-spam-detection](https://www.kaggle.com/code/inshak9/email-spam-detection) |
| 16 | +??? Abstract "Kaggle Notebook" |
13 | 17 |
|
| 18 | + <iframe src="https://www.kaggle.com/embed/thatarguy/email-spam-classifier?kernelSessionId=224262023" height="800" style="margin: 0 auto; width: 100%; max-width: 950px;" frameborder="0" scrolling="auto" title="email-spam-classifier"></iframe> |
14 | 19 |
|
15 |
| -### LIBRARIES NEEDED |
| 20 | +## ⚙️ TECH STACK |
16 | 21 |
|
17 |
| -??? quote "LIBRARIES USED" |
| 22 | +| **Category** | **Technologies** | |
| 23 | +|--------------------------|---------------------------------------------| |
| 24 | +| **Languages** | Python | |
| 25 | +| **Libraries/Frameworks** | Scikit-learn, NumPy, Pandas, Matplotlib, Seaborn | |
| 26 | +| **Databases** | NOT USED | |
| 27 | +| **Tools** | Kaggle, Jupyter Notebook | |
| 28 | +| **Deployment** | NOT USED | |
18 | 29 |
|
19 |
| - - pandas |
20 |
| - - numpy |
21 |
| - - scikit-learn |
22 |
| - - matplotlib |
23 |
| - - seaborn |
24 |
| - |
25 |
| ---- |
| 30 | +--- |
26 | 31 |
|
27 |
| -### DESCRIPTION |
| 32 | +## 📝 DESCRIPTION |
28 | 33 | !!! info "What is the requirement of the project?"
|
29 |
| - - A robust system to detect spam emails is essential to combat increasing spam content. |
30 |
| - - It improves user experience by automatically filtering unwanted messages. |
31 |
| - |
32 |
| -??? info "Why is it necessary?" |
33 |
| - - Spam emails consume resources, time, and may pose security risks like phishing. |
34 |
| - - Helps organizations and individuals streamline their email communication. |
| 34 | + - To efficiently classify emails as spam or ham. |
| 35 | + - To improve email security by filtering out spam messages. |
35 | 36 |
|
36 | 37 | ??? info "How is it beneficial and used?"
|
37 |
| - - Provides a quick and automated solution for spam classification. |
38 |
| - - Used in email services, IT systems, and anti-spam software to filter messages. |
| 38 | + - Helps in reducing unwanted spam emails in user inboxes. |
| 39 | + - Enhances productivity by filtering out irrelevant emails. |
| 40 | + - Can be integrated into email service providers for automatic filtering. |
39 | 41 |
|
40 | 42 | ??? info "How did you start approaching this project? (Initial thoughts and planning)"
|
41 |
| - - Analyzed the dataset and prepared features. |
42 |
| - - Implemented various machine learning models for comparison. |
| 43 | + - Collected and preprocessed the dataset. |
| 44 | + - Explored various machine learning models. |
| 45 | + - Evaluated models based on performance metrics. |
| 46 | + - Visualized results for better understanding. |
43 | 47 |
|
44 | 48 | ??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)."
|
45 |
| - - Documentation from [scikit-learn](https://scikit-learn.org) |
46 |
| - - Blog: Introduction to Spam Classification with ML |
| 49 | + - Scikit-learn documentation. |
| 50 | + - Various Kaggle notebooks related to spam detection. |
47 | 51 |
|
48 | 52 | ---
|
49 | 53 |
|
50 |
| -### EXPLANATION |
| 54 | +## 🔍 PROJECT EXPLANATION |
| 55 | + |
| 56 | +### 🧩 DATASET OVERVIEW & FEATURE DETAILS |
| 57 | + |
| 58 | +??? example "📂 spam.csv" |
51 | 59 |
|
52 |
| -#### DETAILS OF THE DIFFERENT FEATURES |
53 |
| -The dataset contains features like word frequency, capital letter counts, and others that help in distinguishing spam emails from ham. |
| 60 | + - The dataset contains the following features: |
54 | 61 |
|
55 |
| -| Feature | Description | |
56 |
| -|----------------------|-------------------------------------------------| |
57 |
| -| `word_freq_x` | Frequency of specific words in the email body | |
58 |
| -| `capital_run_length` | Length of consecutive capital letters | |
59 |
| -| `char_freq` | Frequency of special characters like `;` and `$` | |
60 |
| -| `is_spam` | Target variable (1 = Spam, 0 = Ham) | |
| 62 | + | Feature Name | Description | Datatype | |
| 63 | + |--------------|-------------|:------------:| |
| 64 | + | Category | Spam or Ham | object | |
| 65 | + | Text | Email text | object | |
| 66 | + | Length | Length of email | int64 | |
| 67 | + |
| 68 | +??? example "🛠 Developed Features from spam.csv" |
| 69 | + |
| 70 | + | Feature Name | Description | Reason | Datatype | |
| 71 | + |--------------|-------------|----------|:------------:| |
| 72 | + | Length | Email text length | Helps in spam detection | int64 | |
61 | 73 |
|
62 | 74 | ---
|
63 | 75 |
|
64 |
| -#### WHAT I HAVE DONE |
| 76 | +### 🛤 PROJECT WORKFLOW |
65 | 77 |
|
66 |
| -=== "Step 1" |
| 78 | +!!! success "Project workflow" |
| 79 | + |
| 80 | + ``` mermaid |
| 81 | + graph LR |
| 82 | + A[Start] --> B[Load Dataset] |
| 83 | + B --> C[Preprocess Data] |
| 84 | + C --> D[Vectorize Text] |
| 85 | + D --> E[Train Models] |
| 86 | + E --> F[Evaluate Models] |
| 87 | + F --> G[Visualize Results] |
| 88 | + ``` |
67 | 89 |
|
68 |
| - Initial data exploration and understanding: |
69 |
| - - Loaded the dataset using pandas. |
70 |
| - - Explored dataset features and target variable distribution. |
| 90 | +=== "Step 1" |
| 91 | + - Load the dataset and clean unnecessary columns. |
71 | 92 |
|
72 | 93 | === "Step 2"
|
73 |
| - |
74 |
| - Data cleaning and preprocessing: |
75 |
| - - Checked for missing values. |
76 |
| - - Standardized features using scaling techniques. |
| 94 | + - Preprocess text and convert categorical labels. |
77 | 95 |
|
78 | 96 | === "Step 3"
|
79 |
| - |
80 |
| - Feature engineering and selection: |
81 |
| - - Extracted relevant features for spam classification. |
82 |
| - - Used correlation matrix to select significant features. |
| 97 | + - Convert text into numerical features using CountVectorizer. |
83 | 98 |
|
84 | 99 | === "Step 4"
|
85 |
| - |
86 |
| - Model training and evaluation: |
87 |
| - - Trained models: KNN, Naive Bayes, SVM, and Random Forest. |
88 |
| - - Evaluated models using accuracy, precision, and recall. |
| 100 | + - Train machine learning models. |
89 | 101 |
|
90 | 102 | === "Step 5"
|
91 |
| - |
92 |
| - Model optimization and fine-tuning: |
93 |
| - - Tuned hyperparameters using GridSearchCV. |
| 103 | + - Evaluate models using accuracy, precision, recall, and F1 score. |
94 | 104 |
|
95 | 105 | === "Step 6"
|
96 |
| - |
97 |
| - Validation and testing: |
98 |
| - - Tested models on unseen data to check performance. |
| 106 | + - Visualize performance using confusion matrices and heatmaps. |
99 | 107 |
|
100 | 108 | ---
|
101 | 109 |
|
102 |
| -#### PROJECT TRADE-OFFS AND SOLUTIONS |
103 |
| - |
104 |
| -=== "Trade Off 1" |
105 |
| - - **Accuracy vs. Training Time**: |
106 |
| - - Models like Random Forest took longer to train but achieved higher accuracy compared to Naive Bayes. |
107 |
| - |
108 |
| -=== "Trade Off 2" |
109 |
| - - **Complexity vs. Interpretability**: |
110 |
| - - Simpler models like Naive Bayes were more interpretable but slightly less accurate. |
| 110 | +### 🖥 CODE EXPLANATION |
111 | 111 |
|
112 |
| ---- |
| 112 | +=== "Section 1" |
| 113 | + - Data loading and preprocessing. |
113 | 114 |
|
114 |
| -### SCREENSHOTS |
115 |
| -<!-- Attach the screenshots and images --> |
| 115 | +=== "Section 2" |
| 116 | + - Text vectorization using CountVectorizer. |
116 | 117 |
|
117 |
| -!!! success "Project flowchart" |
118 |
| - |
119 |
| - ``` mermaid |
120 |
| - graph LR |
121 |
| - A[Start] --> B[Load Dataset]; |
122 |
| - B --> C[Preprocessing]; |
123 |
| - C --> D[Train Models]; |
124 |
| - D --> E{Compare Performance}; |
125 |
| - E -->|Best Model| F[Deploy]; |
126 |
| - E -->|Retry| C; |
127 |
| - ``` |
| 118 | +=== "Section 3" |
| 119 | + - Training models (MLP Classifier, MultinomialNB, BernoulliNB). |
128 | 120 |
|
129 |
| -??? tip "Confusion Matrix" |
| 121 | +=== "Section 4" |
| 122 | + - Evaluating models using various metrics. |
130 | 123 |
|
131 |
| - === "SVM" |
132 |
| -  |
| 124 | +=== "Section 5" |
| 125 | + - Visualizing confusion matrices and metric comparisons. |
133 | 126 |
|
134 |
| - === "Naive Bayes" |
135 |
| -  |
| 127 | +--- |
136 | 128 |
|
137 |
| - === "Decision Tree" |
138 |
| -  |
| 129 | +### ⚖️ PROJECT TRADE-OFFS AND SOLUTIONS |
139 | 130 |
|
140 |
| - === "AdaBoost" |
141 |
| -  |
| 131 | +=== "Trade Off 1" |
| 132 | + - Balancing accuracy and computational efficiency. |
| 133 | + - Used Naive Bayes for speed and MLP for improved accuracy. |
142 | 134 |
|
143 |
| - === "Random Forest" |
144 |
| -  |
| 135 | +=== "Trade Off 2" |
| 136 | + - Handling false positives vs. false negatives. |
| 137 | + - Tuned models to improve precision for spam detection. |
145 | 138 |
|
146 | 139 | ---
|
147 | 140 |
|
148 |
| -### MODELS USED AND THEIR EVALUATION METRICS |
| 141 | +## 🎮 SCREENSHOTS |
149 | 142 |
|
150 |
| -| Model | Accuracy | Precision | Recall | |
151 |
| -|----------------------|----------|-----------|--------| |
152 |
| -| KNN | 90% | 89% | 88% | |
153 |
| -| Naive Bayes | 92% | 91% | 90% | |
154 |
| -| SVM | 94% | 93% | 91% | |
155 |
| -| Random Forest | 95% | 94% | 93% | |
156 |
| -| AdaBoost | 97% | 97% | 100% | |
| 143 | +!!! tip "Visualizations and EDA of different features" |
157 | 144 |
|
158 |
| ---- |
159 |
| - |
160 |
| -#### MODELS COMPARISON GRAPHS |
| 145 | + === "Confusion Matrix comparision" |
| 146 | +  |
161 | 147 |
|
162 |
| -!!! tip "Models Comparison Graphs" |
163 | 148 |
|
164 |
| - === "Accuracy Comparison" |
165 |
| -  |
| 149 | +??? example "Model performance graphs" |
166 | 150 |
|
167 |
| ---- |
| 151 | + === "Meteric comparison" |
| 152 | +  |
168 | 153 |
|
169 |
| -### CONCLUSION |
170 | 154 |
|
171 |
| -#### WHAT YOU HAVE LEARNED |
172 |
| - |
173 |
| -!!! tip "Insights gained from the data" |
174 |
| - - Feature importance significantly impacts spam detection. |
175 |
| - - Simple models like Naive Bayes can achieve competitive performance. |
| 155 | +--- |
176 | 156 |
|
177 |
| -??? tip "Improvements in understanding machine learning concepts" |
178 |
| - - Gained hands-on experience with classification models and model evaluation techniques. |
| 157 | +## 📉 MODELS USED AND THEIR EVALUATION METRICS |
179 | 158 |
|
180 |
| -??? tip "Challenges faced and how they were overcome" |
181 |
| - - Balancing between accuracy and training time was challenging, solved using model tuning. |
| 159 | +| Model | Accuracy | Precision | Recall | F1 Score | |
| 160 | +|------------|----------|------------|--------|----------| |
| 161 | +| MLP Classifier | 95% | 0.94 | 0.90 | 0.92 | |
| 162 | +| Multinomial NB | 93% | 0.91 | 0.88 | 0.89 | |
| 163 | +| Bernoulli NB | 92% | 0.89 | 0.85 | 0.87 | |
182 | 164 |
|
183 | 165 | ---
|
184 | 166 |
|
185 |
| -#### USE CASES OF THIS MODEL |
186 |
| - |
187 |
| -=== "Application 1" |
| 167 | +## ✅ CONCLUSION |
188 | 168 |
|
189 |
| - **Email Service Providers** |
190 |
| - - Automated filtering of spam emails for improved user experience. |
| 169 | +### 🔑 KEY LEARNINGS |
191 | 170 |
|
192 |
| -=== "Application 2" |
| 171 | +!!! tip "Insights gained from the data" |
| 172 | + - Text length plays a role in spam detection. |
| 173 | + - Certain words appear more frequently in spam emails. |
193 | 174 |
|
194 |
| - **Enterprise Email Security** |
195 |
| - - Used in enterprise software to detect phishing and spam emails. |
| 175 | +??? tip "Improvements in understanding machine learning concepts" |
| 176 | + - Gained insights into text vectorization techniques. |
| 177 | + - Understood trade-offs between different classification models. |
196 | 178 |
|
197 | 179 | ---
|
198 | 180 |
|
199 |
| -### FEATURES PLANNED BUT NOT IMPLEMENTED |
| 181 | +### 🌍 USE CASES |
200 | 182 |
|
201 |
| -=== "Feature 1" |
| 183 | +=== "Email Filtering Systems" |
| 184 | + - Can be integrated into email services like Gmail and Outlook. |
202 | 185 |
|
203 |
| - - Integration of deep learning models (LSTM) for improved accuracy. |
| 186 | +=== "SMS Spam Detection" |
| 187 | + - Used in mobile networks to block spam messages. |
204 | 188 |
|
0 commit comments