In the previous lab you built a spam classifier. You extracted features, learned
their weights, and got strong results. Now we face a harder problem: classifying
handwritten digits.
The question is not just can the previous approach work here, but why it
might not—and what to do about it.
Part 1: The MNIST Problem
MNIST is the classic benchmark dataset of handwritten digits. It contains
70,000 grayscale images, each 28×28 pixels, labeled 0–9.
💻
Load and visualize some examples:
$ uv run python mnist.py --explore
This prints a few digits as ASCII art and shows the label distribution.
In the spam lab, you designed features like "contains_free" and "number of
exclamation marks." You knew what patterns to look for because you had read
hundreds of spam messages.
What features would you design for handwritten digits?
Part 2: Classic Classification Algorithms
Before neural networks, two algorithms dominated classification: K-Nearest
Neighbours and decision trees. Both are interpretable—you can explain exactly
why they made a given prediction.
K-Nearest Neighbours
KNN classifies a new example by finding its K nearest neighbors in the
training set and taking a majority vote of their labels. "Nearest" is measured
by Euclidean distance over the feature vector—which, for images, means treating
each pixel as a feature.
For MNIST, each image is 784 pixels. KNN compares a new image to every training
image (60,000 of them), finds the K closest, and votes.
💻
Train KNN on MNIST:
$ uv run python mnist.py --knn
Hyperparameter: K
💻
Try several values of K:
$ uv run python mnist.py --knn --k 1$ uv run python mnist.py --knn --k 5$ uv run python mnist.py --knn --k 15
Decision trees
A decision tree classifies examples by asking a sequence of yes/no questions
about features. For images, each question is of the form "Is pixel (row, col)
brighter than threshold T?" The tree chooses questions greedily to maximize
information gain.
💻
Train a decision tree:
$ uv run python mnist.py --tree$ uv run python mnist.py --tree --depth 5$ uv run python mnist.py --tree --depth 20
Visualizing the tree (small depth only):
$ uv run python mnist.py --tree --depth 5 --show-tree
This prints the first few levels of the decision tree, showing which pixels it
chose to split on and at what thresholds.
Part 3: Artificial Neural Networks
Neither KNN nor decision trees can match human-level accuracy on MNIST
(~98–99%). For that, we need a different architecture: the artificial neural
network.
Structure of a single neuron
A single perceptron takes several inputs, multiplies each by a weight, adds
them up (plus a bias term), and passes the result through an activation
function that produces the output.
The activation function introduces non-linearity. Without it, any number of
layers could be collapsed into a single linear transformation. Common choices:
ReLU: max(0, x) — outputs 0 for negative inputs, x for positive
Sigmoid: maps any value to (0, 1) — useful for binary outputs
Softmax: maps a vector to probabilities that sum to 1 — useful for
multi-class outputs
💻
Sketch (on paper or in your notebook):
A single perceptron with 5 inputs, weights, a bias, a ReLU activation,
and a single output.
Label the inputs, weights, bias, activation function, and output.
Multi-layer perceptron (MLP)
A multi-layer perceptron stacks multiple layers of neurons. Each layer's
outputs become the next layer's inputs. A typical architecture:
For MNIST, the input layer has 784 neurons (one per pixel). The output layer has
10 neurons (one per digit 0–9). The digit with the highest output value is the
prediction.
The network learns by backpropagation: for each training example, compute
the prediction error, then propagate that error backward through the layers,
adjusting each weight slightly to reduce the error. Repeat for thousands of
examples.
$ uv run python mnist.py --mlp --hidden 64 64$ uv run python mnist.py --mlp --hidden 256 128 64
Part 4: Convolutional Neural Networks
The MLP ignores the spatial structure of images. Pixels that are neighbors on
the digit image are not treated as neighbors in the input vector. A
convolutional neural network (CNN) is designed to exploit spatial structure.
Convolution: detecting local patterns
A convolutional layer applies a small filter (called a kernel) across
the image. The kernel slides over every position and computes a dot product at
each location, producing an activation map that shows where that pattern
appears.
For example, a horizontal-edge-detecting kernel might produce high values
wherever there is a horizontal line in the image, and low values elsewhere.
Input image (28×28) ↓Convolutional layer (32 filters, 3×3 kernel) ↓ produces 32 activation maps, each 26×26Pooling layer (2×2 max pooling) ↓ each map shrinks to 13×13Convolutional layer (64 filters, 3×3 kernel) ↓ produces 64 maps, each 11×11Pooling layer (2×2 max pooling) ↓ each map shrinks to 5×5Flatten ↓ 5×5×64 = 1600 valuesFully connected layer (128 neurons) ↓Output layer (10 neurons, softmax)
💻
Sketch (on paper) the CNN architecture above. Label:
The input layer (size)
Each convolutional layer (number of filters, kernel size)
Each pooling layer
The fully connected layer
The output layer
Pooling
After each convolutional layer, a pooling layer reduces the size of the
activation maps. Max pooling takes the maximum value in each small region
(typically 2×2). This makes the network more robust to small translations—if
the digit shifts by a pixel, the pooled output barely changes.
Stride controls how far the filter moves at each step. Stride 1 means it
moves one pixel at a time; stride 2 skips every other position, halving the
output size. (Stride 2 can replace max pooling in some architectures.)
Train a CNN
💻
Train the CNN:
$ uv run python mnist.py --cnn
Training takes longer than the MLP (typically 5–10 minutes on a laptop).
You can reduce epochs to get results faster, at some cost to accuracy: