In this blog, we are going to classify emails into Spam and Anti Spam. Here I have used SVM Machine Learning Model for that.
All the source code and dataset are present in my GitHub repository. Links are available in the bottom of this blog.
So let’s understand the dataset first.
Here in the dataset, you can see there are two features.
- Label — Ham or Spam
- Email Text — Actual Email
So basically our model will recognize the pattern and will predict whether the mail is spam or genuine.
Algorithm used — SVM
“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. However, it is mostly used in classification problems. In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is a number of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two classes very well.
Import Important Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn import svm
Load our Dataset
data = pd.read_csv(‘spam.csv’)
Checking the information of the dataset
2. AI for CFD: byteLAKE’s approach (part3)
3. AI Fail: To Popularize and Scale Chatbots, We Need Better Data
4. Top 5 Jupyter Widgets to boost your productivity!
Splitting our data into X and y.
X = data[‘EmailText’].values
y = data[‘Label’].values
Splitting our data into training and testing.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=0)
Converting text into integer using CountVectorizer()
# Converting String to Integer
cv = CountVectorizer()
X_train = cv.fit_transform(X_train)
X_test = cv.transform(X_test)
Applying SVM algorithm
from sklearn.svm import SVC
classifier = SVC(kernel = ‘rbf’, random_state = 0)
Here we are getting around 97.66% which is a great approach. I also request to clone my repository from here and work further with this dataset and can comment me their accuracy with different classification models.