In this lab, you will build a machine learning system that classifies text
messages as either spam or legitimate (called ham). Along the way you will
learn how to clean data, design features, evaluate a classifier honestly, and let
a machine learn feature weights automatically.
This lab follows the historical arc of how spam detection actually evolved: from
hand-written rules, to hand-designed features with machine-learned weights.
Part 1: Exploring the Data
💻
Open spam.py and run it to load the dataset:
$ uv run python spam.py
You should see the first few messages and a count of spam vs. ham:
label message0 ham Go until jurong point, crazy.. Available only ...1 ham Ok lar... Joking wif u oni...2 spam Free entry in 2 a wkly comp to win FA Cup fina...3 ham U dun say so early hor... U c already then say...4 ham Nah I don't think he goes to usher, then we'd ...ham 4825spam 747Name: label, dtype: int64
Data cleaning
Before training any model, it is important to understand what the data actually
contains. Open spam.py and look at the explore_data function.
💻
Uncomment each line in explore_data one at a time and run
the script to answer these questions in spam_analysis.md:
How many duplicate messages are in the dataset? What should we do with them?
What fraction of messages are spam? Why does this matter for evaluation?
What are some unusual characters or formatting patterns you notice?
Part 2: Evaluating a Classifier
Before we build anything sophisticated, we need to understand how to measure
whether a classifier is actually good.
The problem with accuracy
Consider a simple strategy: classify every message as ham. On our dataset,
that strategy is correct 86% of the time—because 86% of messages are ham.
But this "classifier" would miss every single spam message.
Accuracy (the fraction of predictions that are correct) is a poor metric
when classes are imbalanced.
Precision, recall, and F1
We need two numbers, not one:
Term
Question answered
Formula
Precision
Of the messages we flagged as spam, what fraction actually were?
TP / (TP + FP)
Recall
Of all the actual spam messages, what fraction did we catch?
TP / (TP + FN)
F1 score
Harmonic mean of precision and recall
2 × P × R / (P + R)
A high-precision spam filter almost never marks legitimate email as spam. A
high-recall spam filter catches almost all spam. These goals are in tension:
being more aggressive catches more spam but also flags more ham.
Train/test split
We need to evaluate our classifier on data it has not seen during training.
Otherwise, we would only know how well it memorized the training data, not
whether it generalizes.
👁
Open spam.py and find the split_data function. Notice
how it uses train_test_split to hold out 20% of the data for evaluation.
The random_state parameter ensures everyone gets the same split.
Write a hand-coded classifier
💻
Implement the classify_by_rules function in spam.py.
Your classifier should look at a message and return "spam" or "ham". Start
with a few simple rules, then use evaluate to measure its performance.
Example output from evaluate:
=== Hand-written rules ===Accuracy: 0.872Precision: 0.631Recall: 0.415F1: 0.501Confusion matrix: Predicted ham Predicted spamActual ham 955 10Actual spam 93 57
Part 3: Feature Engineering
Instead of making final decisions directly (spam or ham), we can extract
numerical features from each message—measurable properties that might
correlate with the class label.
Features let us separate two concerns: what to measure (feature engineering)
and how to combine the measurements (learning).
Design your features
💻
Open spam.py and implement the extract_features
function. It should take a message string and return a dictionary of feature
names to numerical values. Start with the features already sketched out, then
add at least three of your own:
def extract_features(message): text = message.lower() return { "contains_free": int("free" in text), "num_exclamations": message.count("!"), "length": len(message), # Add your features here }
Think about what makes spam messages distinctive. Some ideas:
Mentions of money, prizes, or urgent action
Number of uppercase words
Presence of phone numbers or URLs
Ratio of non-alphabetic characters
Why features beat raw text
Each feature reduces a complex message to a single number. This transformation
throws away a lot of information—but in a principled way. The features you
choose express your hypothesis about what is predictive of spam.
Evaluate your features
💻
Run your feature-based classifier:
$ uv run python spam.py --features
Try adding and removing features and observe the effect on precision and recall.
Part 4: Learning Feature Weights
So far, you have chosen which features to use. Now let the computer learn how
much weight to give each one.
Logistic regression is a classification algorithm that learns a weight for
each feature. A positive weight means the feature pushes predictions toward
spam; a negative weight pushes toward ham. The magnitude of the weight reflects
how strongly the feature influences the prediction.
Train a logistic regression model
💻
Run logistic regression on your features:
$ uv run python spam.py --logreg
The output shows the learned weights alongside your feature names:
The C parameter controls regularization: how much the model is penalized
for having large weights. With very small C, weights are driven toward zero
(underfitting—the model ignores even informative features). With very large C,
the model fits the training data as tightly as possible (potentially overfitting).
💻
Try several values of C and record the results:
$ uv run python spam.py --logreg --C 0.01$ uv run python spam.py --logreg --C 0.1$ uv run python spam.py --logreg --C 10$ uv run python spam.py --logreg --C 100
Look at both training and test performance. When training accuracy is much
higher than test accuracy, the model is overfitting—it has memorized the
training examples rather than learning a general rule.