All modern spam detection systems rely on machine learning. ML has proven to be superior at many classification tasks given sufficient training data.
This tutorial shows you how to build a spam detector using supervised learning. More specifically, you will use Python to train a logistic regression model that classifies emails into spam and non-spam.
Prerequisites
You will work with NumPy, SciPy, scikit-learn and Matplotlib:
import numpy as np
import scipy.io
import sklearn.metrics
import matplotlib.pyplot as plt
Download the spam dataset consisting of 4,601 emails. It is split 3:1 into a training and a test set. Each email is labeled as either 0 (non-spam) or 1 (spam) and comes with 57 features (3 length statistics on consecutive capitalised letters, frequency percentages of 48 words and 6 characters):
features = np.array(
[
"word_freq_make",
"word_freq_address",
"word_freq_all",
"word_freq_3d",
"word_freq_our",
"word_freq_over",
"word_freq_remove",
"word_freq_internet",
"word_freq_order",
"word_freq_mail",
"word_freq_receive",
"word_freq_will",
"word_freq_people",
"word_freq_report",
"word_freq_addresses",
"word_freq_free",
"word_freq_business",
"word_freq_email",
"word_freq_you",
"word_freq_credit",
"word_freq_your",
"word_freq_font",
"word_freq_000",
"word_freq_money",
"word_freq_hp",
"word_freq_hpl",
"word_freq_george",
"word_freq_650",
"word_freq_lab",
"word_freq_labs",
"word_freq_telnet",
"word_freq_857",
"word_freq_data",
"word_freq_415",
"word_freq_85",
"word_freq_technology",
"word_freq_1999",
"word_freq_parts",
"word_freq_pm",
"word_freq_direct",
"word_freq_cs",
"word_freq_meeting",
"word_freq_original",
"word_freq_project",
"word_freq_re",
"word_freq_edu",
"word_freq_table",
"word_freq_conference",
"char_freq_;",
"char_freq_(",
"char_freq_[",
"char_freq_!",
"char_freq_$",
"char_freq_#",
"capital_run_length_average",
"capital_run_length_longest",
"capital_run_length_total",
]
)
Load the data
First, load the data into appropriate train/test variables:
data = scipy.io.loadmat("spamData.mat")
X = data["Xtrain"]
N = X.shape[0]
D = X.shape[1]
Xtest = data["Xtest"]
Ntest = Xtest.shape[0]
y = data["ytrain"].squeeze().astype(int)
ytest = data["ytest"].squeeze().astype(int)
Next, normalize the scale of each feature by computing the z-scores:
Xz = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
Xtestz = (Xtest - np.mean(X, axis=0)) / np.std(X, axis=0)
Define the logistic regression model
Define helper functions and the log-likelihood:
def logsumexp(x):
offset = np.max(x, axis=0)
return offset + np.log(np.sum(np.exp(x - offset), axis=0))
def logsigma(x):
if not isinstance(x, np.ndarray):
return -logsumexp(np.array([0, -x]))
else:
return -logsumexp(np.vstack((np.zeros(x.shape[0]),-x)))
def l(y, X, w):
return np.sum(y*logsigma(X.dot(w)) + (1-y)*logsigma(-(X.dot(w))))
Define the gradient of the log-likelihood:
def sigma(x):
return np.exp(x)/(1 + np.exp(x))
def dl(y, X, w):
return (y - sigma(X.dot(w))).dot(X)
Determine the parameters using maximum likelihood estimation (MLE) and gradient descent (GD)
Here is a Python framework for implementing GD:
def optimize(obj_up, theta0, nepochs=50, eps0=0.01, verbose=True):
f, update = obj_up
theta = theta0
values = np.zeros(nepochs + 1)
eps = np.zeros(nepochs + 1)
values[0] = f(theta0)
eps[0] = eps0
for epoch in range(nepochs):
if verbose:
print(
"Epoch {:3d}: f={:10.3f}, eps={:10.9f}".format(
epoch, values[epoch], eps[epoch]
)
)
theta = update(theta, eps[epoch])
values[epoch + 1] = f(theta)
if values[epoch] < values[epoch + 1]:
eps[epoch + 1] = eps[epoch] / 2.0
else:
eps[epoch + 1] = eps[epoch] * 1.05
if verbose:
print("Result after {} epochs: f={}".format(nepochs, values[-1]))
return theta, values, eps
def gd(y, X):
def objective(w):
return -(l(y, X, w))
def update(w, eps):
return w + eps * dl(y, X, w)
return (objective, update)
You can now run GD to obtain optimized weights:
np.random.seed(0)
w0 = np.random.normal(size=D)
wz_gd, vz_gd, ez_gd = optimize(gd(y, Xz), w0, nepochs=100)
Predict
Finally, you can define a spam confidence values predictor and classifier:
def predict(Xtest, w):
return sigma(Xtest.dot(w))
def classify(Xtest, w):
threshold = 0.5 # initial threshold
return (sigma(Xtest.dot(w)) > threshold).astype(int)
yhat = predict(Xtestz, wz_gd)
ypred = classify(Xtestz, wz_gd)
Plot the precision-recall-curve to find a better threshold value:
precision, recall, thresholds = sklearn.metrics.precision_recall_curve(ytest, yhat)
plt.plot(recall, precision)
for x in np.linspace(0, 1, 10, endpoint=False):
index = int(x * (precision.size - 1))
plt.text(recall[index], precision[index], "{:3.2f}".format(thresholds[index]))
plt.xlabel("Recall")
plt.ylabel("Precision")
# 0.44 looks good
Have a look at the largest weights:
features[np.where(wz_gd>2)]
Unsurprisingly, you will find that char_freq_$ and capital_run_length_longest have an outsized impact, i.e., spam emails frequently include dollar signs and capitalized words.
Conclusion
In this tutorial you have learned to build an email spam detector using machine learning and Python. If you want to practice more, try finding another dataset and build a binary classification model using the framework introduced here.


