Reinforcement Learning

Lab setup

First, make sure you have completed the initial setup.

Open Terminal. Run the update command to make sure you have the latest code.
```
$ mwc update
```

Move to this lab's directory.

$ cd ~/Desktop/making_with_code/shuyuan/labs/reinforcement_learning

Move to your MWC directory.
```
$ cd ~/Desktop/making_with_code
```

Get a copy of this lab's materials.

git clone https://git.makingwithcode.org/mwc/reinforcement_learning.git

How this lab meets the learning objectives

This lab gives students hands-on experience with reinforcement learning and neural networks through two complementary activities: implementing Q-learning from scratch for a tiny game, then observing and interpreting a series of training experiments on a more complex game.

A4.3.6 — Reinforcement learning fundamentals (core to the lab): Section 1 introduces agent–environment interaction, cumulative reward, and the exploration/exploitation trade-off concretely through Q-learning. Section 2 reinforces these concepts through interpretation of real training evidence.

A4.3.8 — ANNs and A4.3.9 — CNNs (authentic context): The Q-network used in Sections 2 and 3 is a deep neural network. The "Training Snake" narrative compares MLP and CNN architectures in terms of their fitness for different observation structures. Teaching notes throughout Section 2 point to where these topics arise. Direct instruction on ANN/CNN architecture should accompany or precede the lab.

A4.2.2 — Feature selection and A4.2.3 — Dimensionality reduction (authentic context): The central lesson of "Training Snake" is that what the agent sees matters as much as how it learns. The shift from full-board to explicit features, and from full-board to egocentric view, are directly about feature selection and dimensionality reduction. Teaching notes in Section 2 connect each experiment to these concepts.

A4.3.3 — Hyperparameter tuning (core to Section 3): Section 3 asks students to run their own training experiments and document the effect of hyperparameter changes. The snake case studies in Section 2 model this reasoning process.

A4.3.10 — Model selection and comparison (authentic context): The snake case studies compare architectures (MLP vs. CNN), observation designs (board only vs. board + features vs. egocentric), and hyperparameters (learning rate, epsilon decay, loss function). The conceptual questions ask students to reason about these trade-offs.

Pacing and integration

The lab is designed to complement—not substitute for—direct instruction. Definitions of key terms (Q-table, Bellman equation, exploration rate, MLP, CNN, etc.) are provided in context, but students benefit from class discussion before and after each section. The checkpoints create natural stopping points for whole-class discussion.

Suggested pacing (8 class periods):

Period 1–3: Section 1 (play BabySnake, learn Q-table, implement Q-learning)
Period 4–5: Section 2 (read case studies, watch checkpoints, answer questions)
Period 6–8: Section 3 (set up forager training, experiment, document)

Training in Section 3 takes 10–30 minutes per run depending on hardware. Students can start a run, work on analysis questions while it runs, then check results.

In this lab, you will train a computer to play games—without telling it the rules. Instead, you will set up a system where the computer tries things, observes what happens, and gradually learns which actions lead to better outcomes. This is called reinforcement learning.

You have probably already seen how machine learning models learn to classify data. Reinforcement learning is different: there is no labeled dataset, no right answer to compare against. The agent learns entirely from the consequences of its own actions.

Part 1: Training BabySnake

Play the game first

💻 Run the BabySnake game:

$ uv run python -m babysnake

BabySnake is a simple game on a 4×4 grid. You control the @ character and try to collect the * food. Use the arrow keys to move. The game ends when your energy runs out.

How Q-learning works

The reasoning you just wrote down is, in some form, a policy: a rule that maps situations to actions. In reinforcement learning, the goal is to discover a good policy automatically.

One way to represent a policy is a Q-table. A Q-table stores an estimated value—called a Q-value—for every possible (state, action) pair. The Q-value for (state, action) estimates how much total future reward the agent expects if it takes that action in that state and then continues to act optimally.

For BabySnake, the state is a tuple of four integers: (agent_x, agent_y, food_x, food_y).

On a 4×4 board with 4 possible actions, the complete Q-table has at most 240 × 4 = 960 entries—small enough to print and read.

At the start of training, every Q-value is 0. The agent updates Q-values as it collects experience, using the Bellman equation:

Q(s, a) ← Q(s, a) + α · ( r + γ · max_a' Q(s', a') − Q(s, a) )

In words: the new Q-value is the old one plus a correction. The correction is proportional to the temporal difference (TD) error: the difference between the old estimate and a better estimate based on what actually happened.

r is the reward received (e.g., +1 for food, −0.01 per step)
γ (gamma) is the discount factor: how much to value future rewards relative to immediate ones
α (alpha) is the learning rate: how large an update to make each time

Manual Q-update exercise

Work through this update by hand before implementing it in code.

Situation: The board is 4×4. The agent is at position (2, 2). Food is at (3, 3).

. . . .
. . . .
. . @ .
. . . *

All Q-values are currently 0, except:

State	Action	Q-value
(3, 2, 3, 3)	DOWN	0.5

(This means the agent has already learned that from position (3,2) with food at (3,3), moving DOWN is promising.)

The agent takes action RIGHT. It moves from (2,2) to (3,2). It does not land on food. Reward: r = −0.01.

New state: (3, 2, 3, 3). New α = 0.1, γ = 0.95.

Calculate the new Q-value for ((2, 2, 3, 3), RIGHT).

Exploration vs. exploitation

During training, the agent faces a dilemma at every turn: should it exploit what it already knows (choose the action with the highest Q-value), or explore by trying something new (maybe there is a better action it has not discovered yet)?

We handle this with an epsilon-greedy policy:

With probability ε (epsilon), take a random action (explore)
With probability 1 − ε, take the best known action (exploit)

At the start of training, ε is high (e.g., 1.0: always random). As training progresses, ε decays toward a small floor (e.g., 0.05). This schedule lets the agent explore widely early on and exploit what it has learned later.

Implement Q-learning

💻 Open q_learning.py. You need to implement two functions:

choose_action(q_table, state, epsilon)

With probability epsilon, return a random action from ACTIONS. Otherwise, return the action with the highest Q-value for state. Use q_table.get((state, action), 0.0) to look up a Q-value (defaulting to 0 if the pair has not been seen before).

update_q(q_table, state, action, reward, next_state, alpha, gamma)

Apply the Bellman equation to update q_table[(state, action)] in place. The right-hand side needs the best Q-value from next_state—use a list comprehension over ACTIONS to find it.

The rest of the training loop is already written for you. Once your two functions are implemented, run:

$ uv run python q_learning.py

You should see output like:

Episode   100  reward=  -2.1  score=0  epsilon=0.605  q_entries=36
Episode   200  reward=   1.4  score=1  epsilon=0.366  q_entries=88
Episode   500  reward=   3.8  score=4  epsilon=0.082  q_entries=204
Episode  1000  reward=   5.1  score=5  epsilon=0.050  q_entries=287

The reward will be negative early (step penalties accumulate) and should improve as the agent learns. score counts how many food items were collected in the last episode.

✅ CHECKPOINT 2

Train your agent to consistently score 3 or more food items per episode. Then watch it play:

$ uv run python -c "from q_learning import watch; watch()"

(Run q_learning.py first to train; the watch() function also trains if needed.)

In your group, discuss:

At what episode did the agent start reliably finding food?
Print out q_table after training. Can you read the policy? For a given state, does the highest Q-value point toward the food?
How does the trained agent's behavior compare to the reasoning you wrote down at the beginning?

Part 2: Training Snake

BabySnake's state space has 240 entries. The Q-table can hold the entire policy in less than a kilobyte of memory. Now consider the original Snake game on a 32×16 board: the state space is enormous (trillions of possible board configurations), the agent grows a tail that it can collide with, and the optimal strategy is far more complex.

A Q-table cannot scale to this. We need a Q-network: a neural network that approximates the Q-function. Instead of looking up a value in a table, the network takes the current observation and predicts Q-values for all actions.

This is Deep Q-Learning (DQN). Training it is more subtle than Q-learning. This section walks through five training experiments that led to a working agent—what we tried, what went wrong, and how we fixed it.

Connecting to learning objectives

Each subsection below is organized around a concept that connects to the learning objectives. As students read, draw their attention to:

Feature selection (A4.2.2): Attempts 1 and 2 — why explicit features matter more than raw input size.
Dimensionality reduction (A4.2.3): Attempt 4 — egocentric view as a principled way to reduce input size while preserving relevant information.
Hyperparameter tuning (A4.3.3): Attempts 3 and 5 — learning rate, loss function, and epsilon decay as levers with measurable effects.
Exploration/exploitation (A4.3.6): Attempt 5 — epsilon decay rate and its effect on when the agent commits to a policy.
ANNs / CNNs (A4.3.8, A4.3.9): Attempt 4 — MLP vs. CNN in the context of a spatially-structured observation.
Model selection (A4.3.10): The overall narrative — systematic experimentation and documented reasoning.

The snake game has a simple reward structure: +50 for eating an apple, −1 for each step toward the apple and +1 for each step away (incentivizing approach), −10 for dying. The agent also has energy that depletes each step and refills when it eats; running out of energy ends the game.

Attempt 1: Can the network see the apple?

Hypothesis. A CNN processes 2D spatial inputs efficiently. If we feed the agent the raw game board, the CNN should be able to detect where the apple is and learn to navigate toward it.

Setup.

Full 32×16 board (3,072 numbers)
CNN architecture (spatial = true)
No explicit direction-to-apple features
45,000 training episodes

Evidence.

[ep_0100]  avg_reward=-9.5   avg_steps=48   epsilon=0.905  avg_loss=9.2
[ep_0500]  avg_reward=-8.7   avg_steps=108  epsilon=0.606  avg_loss=43.6
[ep_2000]  avg_reward=-9.3   avg_steps=133  epsilon=0.135  avg_loss=10.3
[ep_5000]  avg_reward=-8.9   avg_steps=134  epsilon=0.050  avg_loss=9.6
...
[ep_45700] avg_reward=-9.3   avg_steps=130  epsilon=0.050  avg_loss=8.7

What happened. The agent learned to survive—avg_steps grew from 48 to ~130—but the reward stayed flat and negative through the entire run. After 6 hours and 45,000 episodes, the agent was wandering the board, avoiding walls, but never reliably finding the apple.

Why. The full board gives the agent 3,072 numbers as input. Somewhere in those numbers is information about where the apple is, but it is deeply implicit: the agent has to figure out which numbers change when the apple moves and build a spatial representation of the board from scratch. The reward signal (+50 when the snake happens to reach the apple, after potentially hundreds of random steps) is far too sparse to guide that learning.

Attempt 2: Give the agent a compass

Hypothesis. The board encoding buries the apple's location in 3,072 numbers. What if we added two features that directly encode the direction to the apple?

Setup. Added two values to the observation:

apple_dx = (apple_x − head_x) / board_width
apple_dy = (apple_y − head_y) / board_height

These are positive when the apple is to the right or below, negative when it is to the left or above, and zero when directly in line.

What happened. Within hundreds of episodes, the agent began making positive progress. The first checkpoint with reliably positive reward appeared around episode 400—compared to nothing after 45,000 episodes in Attempt 1.

Why. Two features replaced thousands of implicit ones. The agent no longer needed to discover the spatial structure of the board from scratch. A direct signal pointing toward the goal gave the reward function something to work with.

Attempt 3: Diagnosing runaway loss

Hypothesis. Training with explicit features is working. Let's see how far it gets.

Early results (with features added, initial settings).

[ep_0300]  avg_loss=48.7    avg_reward=+8.1
[ep_0500]  avg_loss=347     avg_reward=+12.4
[ep_0700]  avg_loss=4,102   avg_reward=+6.5
[ep_1100]  avg_loss=686,000 avg_reward=-3.1

Training started promisingly, then the loss exploded and performance collapsed.

What happened. The loss grew without bound—a phenomenon called Q-value divergence. The apple gives +50 reward. With a learning rate of 0.001 and MSE (mean squared error) loss, large rewards pushed Q-values high. High Q-values created large TD errors (the difference between predicted and target Q-values). MSE loss squares those errors, so larger errors create quadratically larger gradients. Large gradient updates pushed Q-values even higher. A feedback loop.

Fix. Two changes stabilized training:

Huber loss instead of MSE. Huber loss behaves like MSE for small errors but becomes linear for large ones, capping the gradient. This breaks the feedback loop.
Lower learning rate: 0.001 → 0.0001. Smaller updates give the target network time to stabilize before the online network chases a new target.

Attempt 4: Zooming in — the egocentric view

Hypothesis. The full 32×16 board is large (3,072 inputs). The snake only needs to know what is nearby. What if we cropped the observation to a window centered on the snake's head?

Setup. Instead of the full board, the agent sees a 17×17 crop centered on the snake's head—wherever the snake happens to be. Areas outside the board are filled with empty space.

Full board: 32 × 16 × 6 = 3,072 inputs
Egocentric 17×17 crop: 17 × 17 × 6 = 1,734 board inputs + 2 state = 1,736 total

Two benefits of the egocentric view.

First, smaller input: 1,736 numbers instead of 3,072. The network is simpler, trains faster, and generalizes better.

Second, position invariance: the snake's head is always at the center of its own observation. A wall to the left looks the same whether the snake is at position (3,5) or (28,12). The network does not need to relearn the same spatial relationships at every board location.

With an egocentric crop, the full-board CNN is no longer needed. We used a flat MLP (spatial = false) that treats the 1,736 inputs as a single vector. The egocentric window already encodes local spatial context; additional convolutions over the full board are not necessary.

Attempt 5: Teaching the agent to explore

Hypothesis. Exploration rate (epsilon) should decay slowly enough to give the agent meaningful experience before it commits.

The problem with fast decay. With epsilon_decay = 0.995, epsilon falls from 1.0 to 0.05 by episode ~450. At that point the agent is acting greedily 95% of the time—but after only 450 episodes, the Q-network has barely trained. It commits to whatever policy it happened to discover early, which may be far from optimal.

epsilon after 450 episodes (decay=0.995): 0.995^450 ≈ 0.10
epsilon after 450 episodes (decay=0.9997): 0.9997^450 ≈ 0.87

Fix. With epsilon_decay = 0.9997, epsilon is still 0.55 at episode 2,000. The agent keeps exploring well into training, discovering better strategies before committing.

The successful run

With all five improvements in place, training produced a genuinely competent snake agent. Here is the training log:

[ep_0100]  avg_reward=-5.3   avg_steps=50   epsilon=0.970  avg_loss=0.9
[ep_0400]  avg_reward=+9.7   avg_steps=61   epsilon=0.887  avg_loss=1.7
[ep_1100]  avg_reward=+34.5  avg_steps=57   epsilon=0.719  avg_loss=4.6
[ep_1800]  avg_reward=+4.4   avg_steps=98   epsilon=0.583  avg_loss=5.1
[ep_3800]  avg_reward=+51.2  avg_steps=33   epsilon=0.320  avg_loss=1.2
[ep_5400]  avg_reward=+83.5  avg_steps=43   epsilon=0.198  avg_loss=2.0
[ep_9000]  avg_reward=+246.0 avg_steps=85   epsilon=0.067  avg_loss=6.4
[ep_13000] avg_reward=+375.6 avg_steps=107  epsilon=0.050  avg_loss=5.4
[ep_17100] avg_reward=+288.3 avg_steps=86   epsilon=0.050  avg_loss=4.9

The learning curve has a characteristic shape:

Exploration (ep 0–300): reward negative, agent mostly random
First breakthroughs (ep 400–1100): agent starts finding the apple
Consolidation dip (ep 1500–2300): reward falls as the agent refines its strategy—a normal phase of reorganization
Efficiency breakthrough (ep 3700+): episodes suddenly shorten (avg_steps drops to 33); the agent has learned to reach the apple quickly
Maturation (ep 8000+): longer episodes, higher reward, complex strategy

✅ CHECKPOINT 3

Watch the trained agent at three points in its learning. Run each command and observe the agent's behavior for at least 30 seconds:

$ retro-gamer play runs/snake --checkpoint ep_1100
$ retro-gamer play runs/snake --checkpoint ep_5400
$ retro-gamer play runs/snake --checkpoint ep_17100

Press Enter or Escape to quit each one.

For each checkpoint, write two sentences describing what the agent is doing. Think about: Does it seem to know where the apple is? Does it die often? Does it have a strategy, or does it seem random?

✅ CHECKPOINT 4

Complete the questions in snake_training.md and push your answers:

$ mwc submit

Part 3: Training Forager

Now it is your turn. The forager/ directory contains a game called Forager: an agent on an 8×8 grid that collects food that respawns when eaten. The rules are simple, but the 8×8 state space (4,032 distinct (agent, food) positions) is too large for a Q-table—you need a neural network.

💻 Play Forager to understand the game:

$ uv run python -m forager

Use arrow keys to move @ to the food *. Press Enter or Escape to quit.

Setting up a training run

💻 Create your first training run:

$ retro-gamer create --game forager --output runs/forager/

This creates runs/forager/config.toml. Open it and add observe_state to the [preprocessing] section so the agent can see the direction to the food:

[preprocessing]
spatial = false
board = true
observe_state = ["food_dx", "food_dy"]

Then start training:

$ retro-gamer train runs/forager/

A progress bar will show how training is going. Training 20,000 episodes takes about 10–20 minutes. You can stop and resume at any time with Ctrl-C.

Watch your agent play at any point:

$ retro-gamer play runs/forager/

Document your experiments

Open training_log.md and fill in Attempt 1 before you start training: write your hypothesis—what do you think will happen with the default settings? After training, fill in the evidence and analysis.

Then try at least one more configuration. Some things worth experimenting with:

epsilon_decay — how quickly does the agent commit to what it has learned?
learning_rate — too high causes instability; too low is slow
hidden_sizes in [model] — larger networks can represent more complex policies
board = false with observe_state = ["food_dx", "food_dy"] — what if the agent sees only the direction to food, with no board at all?

When you change hidden_sizes or any [preprocessing] option, run retro-gamer clean runs/forager/ before retraining.

✅ CHECKPOINT 5

When you have trained at least two configurations:

Complete training_log.md — hypothesis, evidence, and analysis for each attempt, and the final analysis section.
Answer the questions at the end of the log about what worked and why.
Push your work:

$ mwc submit

Be prepared to watch your best agent play for the class and explain what it learned.

Extension: Train an agent for your own game

If you have built a game using the retro-games framework, you can train a DQN agent to play it.

Add a [tool.retro-gamer] section to your game's pyproject.toml:

[tool.retro-gamer]
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
reward = "score"

Add state features to game.state that give the agent useful signal (e.g., direction to a target, distance to a wall).

Create and train:

$ retro-gamer create --game your_game/ --output runs/your_game/
$ retro-gamer train runs/your_game/

See the retro-gamer documentation for the full reference.

Discussion prompts

These prompts work well for whole-class discussion after students complete one or more sections of the lab. They connect the lab's technical content to broader themes in machine learning and human experience.

Connecting to real-world RL:

Where do you see reinforcement learning in your daily life? (Video game AI, recommendation systems, navigation apps, spam filters, robotic arms in manufacturing, drug discovery...)
AlphaGo and AlphaZero (the programs that mastered chess and Go) used reinforcement learning. What is different about those games compared to Snake? Why might the training take billions of episodes instead of 20,000?

Connecting RL to human behavior:

The exploration/exploitation trade-off appears in human decision-making. When you are trying a new restaurant versus going to a favorite—how is that like epsilon-greedy? When in life is more exploration better? When is it better to exploit?
The reward signal shapes behavior completely. If the snake's reward had been "steps survived" instead of "apples collected," what would the agent have learned? What happens when humans are rewarded for the wrong thing?
Q-values represent expected future reward. Humans use something similar when they think about long-term consequences. What are the limits of this comparison?

Ethics and implications:

RL systems can develop unexpected strategies when reward signals are imperfectly specified. What are examples of this in real AI systems? (Specification gaming, reward hacking...)
Self-driving car training involves RL in simulation. What challenges arise in transferring a policy learned in simulation to the real world?

Lab reflection (for the teacher)

Strengths:

The BabySnake section gives students genuine ownership: they implement a working learning algorithm from scratch, not just run existing code.
The Q-table is small enough to print and examine, making the abstract concept of "policy" concrete and verifiable.
The snake training narrative is based on real experimental data, which models good scientific reasoning and gives students evidence they can actually interpret.
The three-checkpoint comparison (ep_1100, ep_5400, ep_17100) reliably produces visibly different behavior, making the learning progress tangible.
The forager training section creates authentic space for independent experimentation and the kind of open-ended investigation that characterizes real ML work.

Weaknesses and open questions:

The watch() function in q_learning.py requires a full terminal environment. On some classroom setups this may not render correctly. Testing on students' actual machines before the lab is recommended.
Training time in Section 3 varies significantly across machines (10 minutes on a Mac with Apple Silicon vs. potentially much longer on older hardware). Students with slower machines may not see convergence in a single period; consider pre-staging some partial runs.
The "Training Snake" section is long and text-heavy by the standards of this curriculum. Students who do not read it carefully will struggle with the conceptual questions. Consider assigning it as reading homework before the period where students work on snake_training.md.
The snake training run is still in progress at the time of writing. The lab should be updated once training is complete to include a summary of the final policy's performance and the complete training curve.
The forager game is deliberately simple. A class competition for the highest reward agent would motivate more systematic experimentation; consider designating a class leaderboard.
Open question: Is two explicit features (food_dx, food_dy) the right amount of support for students in Section 3? Students who want a harder challenge could try training with observe_state = [] (board only)—but whether this is tractable in the allotted time is unclear.