Classification: Features

Lab setup

First, make sure you have completed the initial setup.

If you are part of a course

  1. Open Terminal. Run the update command to make sure you have the latest code.
    $ mwc update
  2. Move to this lab's directory.
    $ cd ~/Desktop/making_with_code/shuyuan/labs/classification_features
    

If you are working on your own

  1. Move to your MWC directory.
    $ cd ~/Desktop/making_with_code
    
  2. Get a copy of this lab's materials.
    git clone https://git.makingwithcode.org/mwc/classification_features.git

In this lab, you will build a machine learning system that classifies text messages as either spam or legitimate (called ham). Along the way you will learn how to clean data, design features, evaluate a classifier honestly, and let a machine learn feature weights automatically.

This lab follows the historical arc of how spam detection actually evolved: from hand-written rules, to hand-designed features with machine-learned weights.


Part 1: Exploring the Data

💻 Open spam.py and run it to load the dataset:

$ uv run python spam.py

You should see the first few messages and a count of spam vs. ham:

   label                                            message
0    ham  Go until jurong point, crazy.. Available only ...
1    ham                      Ok lar... Joking wif u oni...
2   spam  Free entry in 2 a wkly comp to win FA Cup fina...
3    ham  U dun say so early hor... U c already then say...
4    ham  Nah I don't think he goes to usher, then we'd ...

ham     4825
spam     747
Name: label, dtype: int64

Data cleaning

Before training any model, it is important to understand what the data actually contains. Open spam.py and look at the explore_data function.

💻 Uncomment each line in explore_data one at a time and run the script to answer these questions in spam_analysis.md:


Part 2: Evaluating a Classifier

Before we build anything sophisticated, we need to understand how to measure whether a classifier is actually good.

The problem with accuracy

Consider a simple strategy: classify every message as ham. On our dataset, that strategy is correct 86% of the time—because 86% of messages are ham. But this "classifier" would miss every single spam message.

Accuracy (the fraction of predictions that are correct) is a poor metric when classes are imbalanced.

Precision, recall, and F1

We need two numbers, not one:

TermQuestion answeredFormula
PrecisionOf the messages we flagged as spam, what fraction actually were?TP / (TP + FP)
RecallOf all the actual spam messages, what fraction did we catch?TP / (TP + FN)
F1 scoreHarmonic mean of precision and recall2 × P × R / (P + R)

A high-precision spam filter almost never marks legitimate email as spam. A high-recall spam filter catches almost all spam. These goals are in tension: being more aggressive catches more spam but also flags more ham.

Train/test split

We need to evaluate our classifier on data it has not seen during training. Otherwise, we would only know how well it memorized the training data, not whether it generalizes.

👁 Open spam.py and find the split_data function. Notice how it uses train_test_split to hold out 20% of the data for evaluation. The random_state parameter ensures everyone gets the same split.

Write a hand-coded classifier

💻 Implement the classify_by_rules function in spam.py. Your classifier should look at a message and return "spam" or "ham". Start with a few simple rules, then use evaluate to measure its performance.

Example output from evaluate:

=== Hand-written rules ===
Accuracy:   0.872
Precision:  0.631
Recall:     0.415
F1:         0.501

Confusion matrix:
              Predicted ham  Predicted spam
Actual ham          955               10
Actual spam          93               57

Part 3: Feature Engineering

Instead of making final decisions directly (spam or ham), we can extract numerical features from each message—measurable properties that might correlate with the class label.

Features let us separate two concerns: what to measure (feature engineering) and how to combine the measurements (learning).

Design your features

💻 Open spam.py and implement the extract_features function. It should take a message string and return a dictionary of feature names to numerical values. Start with the features already sketched out, then add at least three of your own:

def extract_features(message):
    text = message.lower()
    return {
        "contains_free": int("free" in text),
        "num_exclamations": message.count("!"),
        "length": len(message),
        # Add your features here
    }

Think about what makes spam messages distinctive. Some ideas:

Why features beat raw text

Each feature reduces a complex message to a single number. This transformation throws away a lot of information—but in a principled way. The features you choose express your hypothesis about what is predictive of spam.

Evaluate your features

💻 Run your feature-based classifier:

$ uv run python spam.py --features

Try adding and removing features and observe the effect on precision and recall.


Part 4: Learning Feature Weights

So far, you have chosen which features to use. Now let the computer learn how much weight to give each one.

Logistic regression is a classification algorithm that learns a weight for each feature. A positive weight means the feature pushes predictions toward spam; a negative weight pushes toward ham. The magnitude of the weight reflects how strongly the feature influences the prediction.

Train a logistic regression model

💻 Run logistic regression on your features:

$ uv run python spam.py --logreg

The output shows the learned weights alongside your feature names:

=== Logistic Regression (C=1.0) ===
Accuracy:  0.981
Precision: 0.962
Recall:    0.920
F1:        0.941

Feature weights (most influential first):
  contains_free        +3.24
  num_exclamations     +1.87
  has_url              +1.53
  length               +0.01
  num_capitals         -0.03
  contains_call        +2.11
  ...

Hyperparameter tuning

The C parameter controls regularization: how much the model is penalized for having large weights. With very small C, weights are driven toward zero (underfitting—the model ignores even informative features). With very large C, the model fits the training data as tightly as possible (potentially overfitting).

💻 Try several values of C and record the results:

$ uv run python spam.py --logreg --C 0.01
$ uv run python spam.py --logreg --C 0.1
$ uv run python spam.py --logreg --C 10
$ uv run python spam.py --logreg --C 100

Look at both training and test performance. When training accuracy is much higher than test accuracy, the model is overfitting—it has memorized the training examples rather than learning a general rule.