Classification: Networks

Lab setup

First, make sure you have completed the initial setup.

Open Terminal. Run the update command to make sure you have the latest code.
```
$ mwc update
```

Move to this lab's directory.

$ cd ~/Desktop/making_with_code/shuyuan/labs/classification_neural

Move to your MWC directory.
```
$ cd ~/Desktop/making_with_code
```

Get a copy of this lab's materials.

git clone https://git.makingwithcode.org/mwc/classification_neural.git

How this lab meets the learning objectives

This lab picks up immediately from the classification_features lab. Students bring what they learned there—feature engineering, logistic regression, evaluation metrics—and discover that those tools are insufficient for image classification. That failure motivates neural networks.

A4.3.2 — Classification techniques (Parts 1 and 2): Students apply K-Nearest Neighbours and decision trees to MNIST, comparing their performance with each other and with logistic regression. The comparison is grounded in the same evaluation framework (precision, recall, F1) established in the previous lab.

A4.3.3 — Hyperparameter tuning (throughout): K in KNN, max depth in decision trees, hidden layer sizes in MLP, kernel size and stride in CNN—each is a hyperparameter with a measurable effect on performance. Students tune at least two of these and document the results.

A4.3.8 — ANNs (Part 3): Students examine the architecture of a single perceptron and an MLP before running one. The perceptron sketch and MLP diagram activities ensure students understand the structure before treating it as a black box.

A4.3.9 — CNNs (Part 4): Students trace an image through convolutional and pooling layers before training a CNN. The comparison with MLP (same data, different architecture) makes the design choice concrete and measurable.

Pacing

Suggested pacing (8 class periods):

Period 1–2: Part 1 (MNIST exploration, feature approach fails)
Period 3–4: Part 2 (KNN and decision trees)
Period 5–6: Part 3 (MLP architecture and training)
Period 7–8: Part 4 (CNN architecture, training, comparison)

Parts 3 and 4 can run training jobs while students work on analysis questions. Training times on a laptop: MLP ~2 min, CNN ~5 min. On older hardware, reduce epochs to keep things moving.

In the previous lab you built a spam classifier. You extracted features, learned their weights, and got strong results. Now we face a harder problem: classifying handwritten digits.

The question is not just can the previous approach work here, but why it might not—and what to do about it.

Part 1: The MNIST Problem

MNIST is the classic benchmark dataset of handwritten digits. It contains 70,000 grayscale images, each 28×28 pixels, labeled 0–9.

💻 Load and visualize some examples:

$ uv run python mnist.py --explore

This prints a few digits as ASCII art and shows the label distribution.

Label: 5
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . # # # # # # # # # # . . . . . . . . . . .
. . . . . . # # # # # # # # # # # # . . . . . . . . . .
. . . . . . # # # # . . . . . . . . . . . . . . . . . .
...

Why hand-designed features struggle here

In the spam lab, you designed features like "contains_free" and "number of exclamation marks." You knew what patterns to look for because you had read hundreds of spam messages.

What features would you design for handwritten digits?

Part 2: Classic Classification Algorithms

Before neural networks, two algorithms dominated classification: K-Nearest Neighbours and decision trees. Both are interpretable—you can explain exactly why they made a given prediction.

K-Nearest Neighbours

KNN classifies a new example by finding its K nearest neighbors in the training set and taking a majority vote of their labels. "Nearest" is measured by Euclidean distance over the feature vector—which, for images, means treating each pixel as a feature.

For MNIST, each image is 784 pixels. KNN compares a new image to every training image (60,000 of them), finds the K closest, and votes.

💻 Train KNN on MNIST:

$ uv run python mnist.py --knn

Hyperparameter: K

💻 Try several values of K:

$ uv run python mnist.py --knn --k 1
$ uv run python mnist.py --knn --k 5
$ uv run python mnist.py --knn --k 15

Decision trees

A decision tree classifies examples by asking a sequence of yes/no questions about features. For images, each question is of the form "Is pixel (row, col) brighter than threshold T?" The tree chooses questions greedily to maximize information gain.

💻 Train a decision tree:

$ uv run python mnist.py --tree
$ uv run python mnist.py --tree --depth 5
$ uv run python mnist.py --tree --depth 20

Visualizing the tree (small depth only):

$ uv run python mnist.py --tree --depth 5 --show-tree

This prints the first few levels of the decision tree, showing which pixels it chose to split on and at what thresholds.

Part 3: Artificial Neural Networks

Neither KNN nor decision trees can match human-level accuracy on MNIST (~98–99%). For that, we need a different architecture: the artificial neural network.

Structure of a single neuron

A single perceptron takes several inputs, multiplies each by a weight, adds them up (plus a bias term), and passes the result through an activation function that produces the output.

Inputs:    x₁  x₂  x₃ ... xₙ
Weights:   w₁  w₂  w₃ ... wₙ
              ↓ ↓ ↓       ↓
           [sum + bias] → activation → output

The activation function introduces non-linearity. Without it, any number of layers could be collapsed into a single linear transformation. Common choices:

ReLU: max(0, x) — outputs 0 for negative inputs, x for positive
Sigmoid: maps any value to (0, 1) — useful for binary outputs
Softmax: maps a vector to probabilities that sum to 1 — useful for multi-class outputs

💻 Sketch (on paper or in your notebook):

A single perceptron with 5 inputs, weights, a bias, a ReLU activation, and a single output.
Label the inputs, weights, bias, activation function, and output.

Multi-layer perceptron (MLP)

A multi-layer perceptron stacks multiple layers of neurons. Each layer's outputs become the next layer's inputs. A typical architecture:

Input layer   Hidden layer 1   Hidden layer 2   Output layer
  784 neurons → 128 neurons  →  64 neurons    →  10 neurons

For MNIST, the input layer has 784 neurons (one per pixel). The output layer has 10 neurons (one per digit 0–9). The digit with the highest output value is the prediction.

The network learns by backpropagation: for each training example, compute the prediction error, then propagate that error backward through the layers, adjusting each weight slightly to reduce the error. Repeat for thousands of examples.

💻 Train a MLP on MNIST:

$ uv run python mnist.py --mlp

You should see accuracy climb with each epoch:

Epoch 1/10  loss=0.521  val_accuracy=0.891
Epoch 2/10  loss=0.241  val_accuracy=0.930
Epoch 3/10  loss=0.183  val_accuracy=0.947
...
Epoch 10/10 loss=0.084  val_accuracy=0.975

💻 Try different hidden layer sizes:

$ uv run python mnist.py --mlp --hidden 64 64
$ uv run python mnist.py --mlp --hidden 256 128 64

Part 4: Convolutional Neural Networks

The MLP ignores the spatial structure of images. Pixels that are neighbors on the digit image are not treated as neighbors in the input vector. A convolutional neural network (CNN) is designed to exploit spatial structure.

Convolution: detecting local patterns

A convolutional layer applies a small filter (called a kernel) across the image. The kernel slides over every position and computes a dot product at each location, producing an activation map that shows where that pattern appears.

For example, a horizontal-edge-detecting kernel might produce high values wherever there is a horizontal line in the image, and low values elsewhere.

Input image (28×28)
       ↓
Convolutional layer (32 filters, 3×3 kernel)
       ↓  produces 32 activation maps, each 26×26
Pooling layer (2×2 max pooling)
       ↓  each map shrinks to 13×13
Convolutional layer (64 filters, 3×3 kernel)
       ↓  produces 64 maps, each 11×11
Pooling layer (2×2 max pooling)
       ↓  each map shrinks to 5×5
Flatten
       ↓  5×5×64 = 1600 values
Fully connected layer (128 neurons)
       ↓
Output layer (10 neurons, softmax)

💻 Sketch (on paper) the CNN architecture above. Label:

The input layer (size)
Each convolutional layer (number of filters, kernel size)
Each pooling layer
The fully connected layer
The output layer

Pooling

After each convolutional layer, a pooling layer reduces the size of the activation maps. Max pooling takes the maximum value in each small region (typically 2×2). This makes the network more robust to small translations—if the digit shifts by a pixel, the pooled output barely changes.

Stride controls how far the filter moves at each step. Stride 1 means it moves one pixel at a time; stride 2 skips every other position, halving the output size. (Stride 2 can replace max pooling in some architectures.)

Train a CNN

💻 Train the CNN:

$ uv run python mnist.py --cnn

Training takes longer than the MLP (typically 5–10 minutes on a laptop). You can reduce epochs to get results faster, at some cost to accuracy:

$ uv run python mnist.py --cnn --epochs 3

✅ CHECKPOINT 4

In mnist_analysis.md (Section 4):

Fill in the final comparison table with all classifiers you have tried:

Classifier	Hyperparameters	Test accuracy	F1	Notes
Hand features
KNN	K=
Decision tree	depth=
MLP	hidden=
CNN	filters=

Then answer the final questions:

Architecture comparison. The MLP and CNN both process the same 784-pixel images, but CNN reliably outperforms MLP. What does the CNN know about images that the MLP does not?
Model selection. If you needed to deploy a digit classifier on a device with very limited memory and compute (e.g., a microcontroller), which algorithm would you choose, and why? (Consider model size, prediction speed, and accuracy.)
Real-world applications. CNNs are used for object detection, face recognition, and medical imaging. What properties of CNNs make them well suited for these applications?

Push your work:

$ mwc submit

Discussion prompts

On model selection:

KNN, decision trees, MLP, and CNN all solve the same problem—which one you should use depends on the context. What factors would lead you to choose each?
Modern image classifiers use CNNs with dozens of layers (ResNet, VGG). What challenges might arise when training a very deep network?

On interpretability:

Decision trees are interpretable: you can explain exactly why a prediction was made. CNNs are not. When does interpretability matter more than accuracy? (Medical diagnosis? College admissions? Hiring?)
What would it mean for a digit classifier to be unfair? Are some digits harder to classify than others? Are some writing styles less well represented in training data?

Connecting to A4.3.4 (clustering):

We've done supervised classification (labeled training data). What would unsupervised classification of handwritten digits look like? Could you group similar-looking digits without knowing the labels? (K-means clustering with pixel features.) Why might this be useful?

Lab reflection

Strengths:

The failure of hand-designed features in Part 1 is an important experience: students understand why more powerful methods are needed, not just that they exist.
The perceptron sketch is required by the IB standard and is more meaningful after students have trained a model.
The CNN architecture trace (sketch activity) makes the abstract architecture diagram concrete.
The comparison table in the final checkpoint makes model selection tangible.

Weaknesses and open questions:

Training times vary significantly by hardware. CNN training on older laptops may be prohibitively slow. Consider providing pre-trained model weights so students can evaluate without re-training.
KNN prediction on 60,000 training examples is slow (~2 min). An alternative is to subsample the training set (e.g., 10,000 examples) for speed, at some cost to accuracy.
The MLP treats pixels as independent—this is explicitly presented as a limitation. Some students may ask why we don't just normalize or rotate images to align them. This is a good discussion but can derail if not managed.
A4.3.4 (clustering) is listed in the original standards for this lab but is only touched on in the discussion prompts. Consider adding a short activity applying K-means to MNIST embeddings if time allows.
Open question: Should students train on the full 60,000-image training set or a subsample? Full set gives more realistic results but slower iteration. A 10,000-image subsample cuts training time by 6× at a cost of ~2–3% accuracy.