Classification: Features

Lab setup

First, make sure you have completed the initial setup.

Open Terminal. Run the update command to make sure you have the latest code.
```
$ mwc update
```

Move to this lab's directory.

$ cd ~/Desktop/making_with_code/shuyuan/labs/classification_features

Move to your MWC directory.
```
$ cd ~/Desktop/making_with_code
```

Get a copy of this lab's materials.

git clone https://git.makingwithcode.org/mwc/classification_features.git

How this lab meets the learning objectives

This lab gives students hands-on experience with the core workflow of supervised classification: designing features, training a model, and evaluating it rigorously. The spam detection context is immediately relatable and keeps motivation high.

A4.2.1 — Data cleaning (Parts 1 and 2): Students examine a real dataset, discover its quirks (class imbalance, encoding issues, duplicate messages), and make deliberate decisions about how to handle them. Part 2 formalizes why this matters: a model trained on dirty data performs worse, and on imbalanced data, accuracy alone is misleading.

A4.2.2 — Feature selection (Part 3): Students design features by hand, observe which ones matter, then let the model rank them by learned weight. The contrast between human-chosen and machine-ranked features is the core lesson. Teaching notes throughout Part 3 connect the activity to filter, wrapper, and embedded methods.

A4.3.3 — Hyperparameter tuning (Parts 2 and 4): Precision, recall, and F1 are introduced in Part 2 as evaluation tools; students use them throughout. Part 4 asks students to tune the regularization strength of logistic regression and interpret the effect on the confusion matrix. Overfitting is visible when a model performs much better on training data than on held-out data.

Pacing

Suggested pacing (8 class periods):

Period 1–2: Parts 1 and 2 (load data, clean, evaluate hand-written rules)
Period 3–5: Part 3 (feature engineering, feature selection)
Period 6–8: Part 4 (logistic regression, hyperparameter tuning)

The lab works well in groups of 2–3. Checkpoints are natural moments to check in with groups before they move on.

In this lab, you will build a machine learning system that classifies text messages as either spam or legitimate (called ham). Along the way you will learn how to clean data, design features, evaluate a classifier honestly, and let a machine learn feature weights automatically.

This lab follows the historical arc of how spam detection actually evolved: from hand-written rules, to hand-designed features with machine-learned weights.

Part 1: Exploring the Data

💻 Open spam.py and run it to load the dataset:

$ uv run python spam.py

You should see the first few messages and a count of spam vs. ham:

   label                                            message
0    ham  Go until jurong point, crazy.. Available only ...
1    ham                      Ok lar... Joking wif u oni...
2   spam  Free entry in 2 a wkly comp to win FA Cup fina...
3    ham  U dun say so early hor... U c already then say...
4    ham  Nah I don't think he goes to usher, then we'd ...

ham     4825
spam     747
Name: label, dtype: int64

Data cleaning

Before training any model, it is important to understand what the data actually contains. Open spam.py and look at the explore_data function.

💻 Uncomment each line in explore_data one at a time and run the script to answer these questions in spam_analysis.md:

How many duplicate messages are in the dataset? What should we do with them?
What fraction of messages are spam? Why does this matter for evaluation?
What are some unusual characters or formatting patterns you notice?

Part 2: Evaluating a Classifier

Before we build anything sophisticated, we need to understand how to measure whether a classifier is actually good.

The problem with accuracy

Consider a simple strategy: classify every message as ham. On our dataset, that strategy is correct 86% of the time—because 86% of messages are ham. But this "classifier" would miss every single spam message.

Accuracy (the fraction of predictions that are correct) is a poor metric when classes are imbalanced.

Precision, recall, and F1

We need two numbers, not one:

Term	Question answered	Formula
Precision	Of the messages we flagged as spam, what fraction actually were?	TP / (TP + FP)
Recall	Of all the actual spam messages, what fraction did we catch?	TP / (TP + FN)
F1 score	Harmonic mean of precision and recall	2 × P × R / (P + R)

A high-precision spam filter almost never marks legitimate email as spam. A high-recall spam filter catches almost all spam. These goals are in tension: being more aggressive catches more spam but also flags more ham.

Train/test split

We need to evaluate our classifier on data it has not seen during training. Otherwise, we would only know how well it memorized the training data, not whether it generalizes.

👁 Open spam.py and find the split_data function. Notice how it uses train_test_split to hold out 20% of the data for evaluation. The random_state parameter ensures everyone gets the same split.

Write a hand-coded classifier

💻 Implement the classify_by_rules function in spam.py. Your classifier should look at a message and return "spam" or "ham". Start with a few simple rules, then use evaluate to measure its performance.

Example output from evaluate:

=== Hand-written rules ===
Accuracy:   0.872
Precision:  0.631
Recall:     0.415
F1:         0.501

Confusion matrix:
              Predicted ham  Predicted spam
Actual ham          955               10
Actual spam          93               57

Part 3: Feature Engineering

Instead of making final decisions directly (spam or ham), we can extract numerical features from each message—measurable properties that might correlate with the class label.

Features let us separate two concerns: what to measure (feature engineering) and how to combine the measurements (learning).

Design your features

💻 Open spam.py and implement the extract_features function. It should take a message string and return a dictionary of feature names to numerical values. Start with the features already sketched out, then add at least three of your own:

def extract_features(message):
    text = message.lower()
    return {
        "contains_free": int("free" in text),
        "num_exclamations": message.count("!"),
        "length": len(message),
        # Add your features here
    }

Think about what makes spam messages distinctive. Some ideas:

Mentions of money, prizes, or urgent action
Number of uppercase words
Presence of phone numbers or URLs
Ratio of non-alphabetic characters

Why features beat raw text

Each feature reduces a complex message to a single number. This transformation throws away a lot of information—but in a principled way. The features you choose express your hypothesis about what is predictive of spam.

Evaluate your features

💻 Run your feature-based classifier:

$ uv run python spam.py --features

Try adding and removing features and observe the effect on precision and recall.

Part 4: Learning Feature Weights

So far, you have chosen which features to use. Now let the computer learn how much weight to give each one.

Logistic regression is a classification algorithm that learns a weight for each feature. A positive weight means the feature pushes predictions toward spam; a negative weight pushes toward ham. The magnitude of the weight reflects how strongly the feature influences the prediction.

Train a logistic regression model

💻 Run logistic regression on your features:

$ uv run python spam.py --logreg

The output shows the learned weights alongside your feature names:

=== Logistic Regression (C=1.0) ===
Accuracy:  0.981
Precision: 0.962
Recall:    0.920
F1:        0.941

Feature weights (most influential first):
  contains_free        +3.24
  num_exclamations     +1.87
  has_url              +1.53
  length               +0.01
  num_capitals         -0.03
  contains_call        +2.11
  ...

Hyperparameter tuning

The C parameter controls regularization: how much the model is penalized for having large weights. With very small C, weights are driven toward zero (underfitting—the model ignores even informative features). With very large C, the model fits the training data as tightly as possible (potentially overfitting).

💻 Try several values of C and record the results:

$ uv run python spam.py --logreg --C 0.01
$ uv run python spam.py --logreg --C 0.1
$ uv run python spam.py --logreg --C 10
$ uv run python spam.py --logreg --C 100

Look at both training and test performance. When training accuracy is much higher than test accuracy, the model is overfitting—it has memorized the training examples rather than learning a general rule.

✅ CHECKPOINT 4

In spam_analysis.md (Section 4):

Weights. Which three features received the highest positive weights? The highest negative weights? Does this match your expectations from Part 3?
Tuning. Fill in the table in spam_analysis.md with results for each value of C. At what value of C do you first see signs of overfitting?
Trade-offs. For a spam filter on a school email system, would you optimize for precision or recall? Justify your choice. How would you change C to achieve it?
Comparison. Compare your best logistic regression results with your best hand-written classifier. What did the machine learn that you did not?

Push your work when you are done:

$ mwc submit

Discussion prompts

Connecting to daily experience:

What spam or phishing messages have you received? What patterns made them recognizable?
Modern spam filters use far more features than a few dozen. Gmail reportedly uses hundreds of features. What features might they use that you did not?

Connecting to bias and fairness:

Spam filters are trained on historical data. If certain communities write email in a particular style (e.g., Spanglish, AAVE, informal shorthand), could a spam filter be systematically unfair to them? How would you detect this?
Who decides what counts as spam? Can a legitimate message be unfairly flagged? What are the consequences?

On feature engineering vs. deep learning:

In this lab, you designed features by hand. In the next lab, we will use neural networks that learn their own features automatically. What is the advantage of hand-designed features? What is their limitation?

Lab reflection

Strengths:

The spam task has immediate intuitive appeal; students usually bring real experience with spam messages to the lab.
Precision and recall are introduced in a context where the trade-off is genuinely meaningful (missing spam vs. flagging legitimate mail), which makes the concepts stick better than abstract definitions.
The progression from hand-written rules → hand features → learned weights mirrors the historical development of spam filtering and gives each step a clear motivation.

Weaknesses and open questions:

The dataset is from 2011 and reflects SMS spam from that era; modern spam looks quite different. Consider whether this context is a feature (historical artifact) or a bug (less relatable to students who use different channels).
Logistic regression is introduced as a black box. Students who want to understand the math will need supplementary material. The gradient descent intuition from the next lab may help retroactively.
The --features flag and --logreg flag in spam.py keep the lab modular, but students who run ahead may encounter confusing output if they jump to Part 4 before implementing extract_features.
Open question: Should we include cross-validation (k-fold CV) in the evaluation section? It is more robust than a single train/test split, and cross_val_score is a one-liner. Currently excluded because it adds complexity without changing the core lesson—but worth revisiting.