Classification: Networks

Lab setup

First, make sure you have completed the initial setup.

If you are part of a course

  1. Open Terminal. Run the update command to make sure you have the latest code.
    $ mwc update
  2. Move to this lab's directory.
    $ cd ~/Desktop/making_with_code/shuyuan/labs/classification_neural
    

If you are working on your own

  1. Move to your MWC directory.
    $ cd ~/Desktop/making_with_code
    
  2. Get a copy of this lab's materials.
    git clone https://git.makingwithcode.org/mwc/classification_neural.git

In the previous lab you built a spam classifier. You extracted features, learned their weights, and got strong results. Now we face a harder problem: classifying handwritten digits.

The question is not just can the previous approach work here, but why it might not—and what to do about it.


Part 1: The MNIST Problem

MNIST is the classic benchmark dataset of handwritten digits. It contains 70,000 grayscale images, each 28×28 pixels, labeled 0–9.

💻 Load and visualize some examples:

$ uv run python mnist.py --explore

This prints a few digits as ASCII art and shows the label distribution.

Label: 5
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . # # # # # # # # # # . . . . . . . . . . .
. . . . . . # # # # # # # # # # # # . . . . . . . . . .
. . . . . . # # # # . . . . . . . . . . . . . . . . . .
...

Why hand-designed features struggle here

In the spam lab, you designed features like "contains_free" and "number of exclamation marks." You knew what patterns to look for because you had read hundreds of spam messages.

What features would you design for handwritten digits?


Part 2: Classic Classification Algorithms

Before neural networks, two algorithms dominated classification: K-Nearest Neighbours and decision trees. Both are interpretable—you can explain exactly why they made a given prediction.

K-Nearest Neighbours

KNN classifies a new example by finding its K nearest neighbors in the training set and taking a majority vote of their labels. "Nearest" is measured by Euclidean distance over the feature vector—which, for images, means treating each pixel as a feature.

For MNIST, each image is 784 pixels. KNN compares a new image to every training image (60,000 of them), finds the K closest, and votes.

💻 Train KNN on MNIST:

$ uv run python mnist.py --knn

Hyperparameter: K

💻 Try several values of K:

$ uv run python mnist.py --knn --k 1
$ uv run python mnist.py --knn --k 5
$ uv run python mnist.py --knn --k 15

Decision trees

A decision tree classifies examples by asking a sequence of yes/no questions about features. For images, each question is of the form "Is pixel (row, col) brighter than threshold T?" The tree chooses questions greedily to maximize information gain.

💻 Train a decision tree:

$ uv run python mnist.py --tree
$ uv run python mnist.py --tree --depth 5
$ uv run python mnist.py --tree --depth 20

Visualizing the tree (small depth only):

$ uv run python mnist.py --tree --depth 5 --show-tree

This prints the first few levels of the decision tree, showing which pixels it chose to split on and at what thresholds.


Part 3: Artificial Neural Networks

Neither KNN nor decision trees can match human-level accuracy on MNIST (~98–99%). For that, we need a different architecture: the artificial neural network.

Structure of a single neuron

A single perceptron takes several inputs, multiplies each by a weight, adds them up (plus a bias term), and passes the result through an activation function that produces the output.

Inputs:    x₁  x₂  x₃ ... xₙ
Weights:   w₁  w₂  w₃ ... wₙ
              ↓ ↓ ↓       ↓
           [sum + bias] → activation → output

The activation function introduces non-linearity. Without it, any number of layers could be collapsed into a single linear transformation. Common choices:

💻 Sketch (on paper or in your notebook):

  1. A single perceptron with 5 inputs, weights, a bias, a ReLU activation, and a single output.
  2. Label the inputs, weights, bias, activation function, and output.

Multi-layer perceptron (MLP)

A multi-layer perceptron stacks multiple layers of neurons. Each layer's outputs become the next layer's inputs. A typical architecture:

Input layer   Hidden layer 1   Hidden layer 2   Output layer
  784 neurons → 128 neurons  →  64 neurons    →  10 neurons

For MNIST, the input layer has 784 neurons (one per pixel). The output layer has 10 neurons (one per digit 0–9). The digit with the highest output value is the prediction.

The network learns by backpropagation: for each training example, compute the prediction error, then propagate that error backward through the layers, adjusting each weight slightly to reduce the error. Repeat for thousands of examples.

💻 Train a MLP on MNIST:

$ uv run python mnist.py --mlp

You should see accuracy climb with each epoch:

Epoch 1/10  loss=0.521  val_accuracy=0.891
Epoch 2/10  loss=0.241  val_accuracy=0.930
Epoch 3/10  loss=0.183  val_accuracy=0.947
...
Epoch 10/10 loss=0.084  val_accuracy=0.975

💻 Try different hidden layer sizes:

$ uv run python mnist.py --mlp --hidden 64 64
$ uv run python mnist.py --mlp --hidden 256 128 64

Part 4: Convolutional Neural Networks

The MLP ignores the spatial structure of images. Pixels that are neighbors on the digit image are not treated as neighbors in the input vector. A convolutional neural network (CNN) is designed to exploit spatial structure.

Convolution: detecting local patterns

A convolutional layer applies a small filter (called a kernel) across the image. The kernel slides over every position and computes a dot product at each location, producing an activation map that shows where that pattern appears.

For example, a horizontal-edge-detecting kernel might produce high values wherever there is a horizontal line in the image, and low values elsewhere.

Input image (28×28)

Convolutional layer (32 filters, 3×3 kernel)
       ↓  produces 32 activation maps, each 26×26
Pooling layer (2×2 max pooling)
       ↓  each map shrinks to 13×13
Convolutional layer (64 filters, 3×3 kernel)
       ↓  produces 64 maps, each 11×11
Pooling layer (2×2 max pooling)
       ↓  each map shrinks to 5×5
Flatten
       ↓  5×5×64 = 1600 values
Fully connected layer (128 neurons)

Output layer (10 neurons, softmax)

💻 Sketch (on paper) the CNN architecture above. Label:

Pooling

After each convolutional layer, a pooling layer reduces the size of the activation maps. Max pooling takes the maximum value in each small region (typically 2×2). This makes the network more robust to small translations—if the digit shifts by a pixel, the pooled output barely changes.

Stride controls how far the filter moves at each step. Stride 1 means it moves one pixel at a time; stride 2 skips every other position, halving the output size. (Stride 2 can replace max pooling in some architectures.)

Train a CNN

💻 Train the CNN:

$ uv run python mnist.py --cnn

Training takes longer than the MLP (typically 5–10 minutes on a laptop). You can reduce epochs to get results faster, at some cost to accuracy:

$ uv run python mnist.py --cnn --epochs 3