Reinforcement Learning

Lab setup

First, make sure you have completed the initial setup.

If you are part of a course

  1. Open Terminal. Run the update command to make sure you have the latest code.
    $ mwc update
  2. Move to this lab's directory.
    $ cd ~/Desktop/making_with_code/shuyuan/labs/reinforcement_learning
    

If you are working on your own

  1. Move to your MWC directory.
    $ cd ~/Desktop/making_with_code
    
  2. Get a copy of this lab's materials.
    git clone https://git.makingwithcode.org/mwc/reinforcement_learning.git

In this lab, you will train a computer to play games—without telling it the rules. Instead, you will set up a system where the computer tries things, observes what happens, and gradually learns which actions lead to better outcomes. This is called reinforcement learning.

You have probably already seen how machine learning models learn to classify data. Reinforcement learning is different: there is no labeled dataset, no right answer to compare against. The agent learns entirely from the consequences of its own actions.


Part 1: Training BabySnake

Play the game first

💻 Run the BabySnake game:

$ uv run python -m babysnake

BabySnake is a simple game on a 4×4 grid. You control the @ character and try to collect the * food. Use the arrow keys to move. The game ends when your energy runs out.

How Q-learning works

The reasoning you just wrote down is, in some form, a policy: a rule that maps situations to actions. In reinforcement learning, the goal is to discover a good policy automatically.

One way to represent a policy is a Q-table. A Q-table stores an estimated value—called a Q-value—for every possible (state, action) pair. The Q-value for (state, action) estimates how much total future reward the agent expects if it takes that action in that state and then continues to act optimally.

For BabySnake, the state is a tuple of four integers: (agent_x, agent_y, food_x, food_y).

On a 4×4 board with 4 possible actions, the complete Q-table has at most 240 × 4 = 960 entries—small enough to print and read.

At the start of training, every Q-value is 0. The agent updates Q-values as it collects experience, using the Bellman equation:

Q(s, a) ← Q(s, a) + α · ( r + γ · max_a' Q(s', a') − Q(s, a) )

In words: the new Q-value is the old one plus a correction. The correction is proportional to the temporal difference (TD) error: the difference between the old estimate and a better estimate based on what actually happened.

Manual Q-update exercise

Work through this update by hand before implementing it in code.

Situation: The board is 4×4. The agent is at position (2, 2). Food is at (3, 3).

. . . .
. . . .
. . @ .
. . . *

All Q-values are currently 0, except:

StateActionQ-value
(3, 2, 3, 3)DOWN0.5

(This means the agent has already learned that from position (3,2) with food at (3,3), moving DOWN is promising.)

The agent takes action RIGHT. It moves from (2,2) to (3,2). It does not land on food. Reward: r = −0.01.

New state: (3, 2, 3, 3). New α = 0.1, γ = 0.95.

Calculate the new Q-value for ((2, 2, 3, 3), RIGHT).

Exploration vs. exploitation

During training, the agent faces a dilemma at every turn: should it exploit what it already knows (choose the action with the highest Q-value), or explore by trying something new (maybe there is a better action it has not discovered yet)?

We handle this with an epsilon-greedy policy:

At the start of training, ε is high (e.g., 1.0: always random). As training progresses, ε decays toward a small floor (e.g., 0.05). This schedule lets the agent explore widely early on and exploit what it has learned later.

Implement Q-learning

💻 Open q_learning.py. You need to implement two functions:

choose_action(q_table, state, epsilon)

With probability epsilon, return a random action from ACTIONS. Otherwise, return the action with the highest Q-value for state. Use q_table.get((state, action), 0.0) to look up a Q-value (defaulting to 0 if the pair has not been seen before).

update_q(q_table, state, action, reward, next_state, alpha, gamma)

Apply the Bellman equation to update q_table[(state, action)] in place. The right-hand side needs the best Q-value from next_state—use a list comprehension over ACTIONS to find it.

The rest of the training loop is already written for you. Once your two functions are implemented, run:

$ uv run python q_learning.py

You should see output like:

Episode   100  reward=  -2.1  score=0  epsilon=0.605  q_entries=36
Episode   200  reward=   1.4  score=1  epsilon=0.366  q_entries=88
Episode   500  reward=   3.8  score=4  epsilon=0.082  q_entries=204
Episode  1000  reward=   5.1  score=5  epsilon=0.050  q_entries=287

The reward will be negative early (step penalties accumulate) and should improve as the agent learns. score counts how many food items were collected in the last episode.


Part 2: Training Snake

BabySnake's state space has 240 entries. The Q-table can hold the entire policy in less than a kilobyte of memory. Now consider the original Snake game on a 32×16 board: the state space is enormous (trillions of possible board configurations), the agent grows a tail that it can collide with, and the optimal strategy is far more complex.

A Q-table cannot scale to this. We need a Q-network: a neural network that approximates the Q-function. Instead of looking up a value in a table, the network takes the current observation and predicts Q-values for all actions.

This is Deep Q-Learning (DQN). Training it is more subtle than Q-learning. This section walks through five training experiments that led to a working agent—what we tried, what went wrong, and how we fixed it.

The snake game has a simple reward structure: +50 for eating an apple, −1 for each step toward the apple and +1 for each step away (incentivizing approach), −10 for dying. The agent also has energy that depletes each step and refills when it eats; running out of energy ends the game.

Attempt 1: Can the network see the apple?

Hypothesis. A CNN processes 2D spatial inputs efficiently. If we feed the agent the raw game board, the CNN should be able to detect where the apple is and learn to navigate toward it.

Setup.

Evidence.

[ep_0100]  avg_reward=-9.5   avg_steps=48   epsilon=0.905  avg_loss=9.2
[ep_0500]  avg_reward=-8.7   avg_steps=108  epsilon=0.606  avg_loss=43.6
[ep_2000]  avg_reward=-9.3   avg_steps=133  epsilon=0.135  avg_loss=10.3
[ep_5000]  avg_reward=-8.9   avg_steps=134  epsilon=0.050  avg_loss=9.6
...
[ep_45700] avg_reward=-9.3   avg_steps=130  epsilon=0.050  avg_loss=8.7

What happened. The agent learned to survive—avg_steps grew from 48 to ~130—but the reward stayed flat and negative through the entire run. After 6 hours and 45,000 episodes, the agent was wandering the board, avoiding walls, but never reliably finding the apple.

Why. The full board gives the agent 3,072 numbers as input. Somewhere in those numbers is information about where the apple is, but it is deeply implicit: the agent has to figure out which numbers change when the apple moves and build a spatial representation of the board from scratch. The reward signal (+50 when the snake happens to reach the apple, after potentially hundreds of random steps) is far too sparse to guide that learning.

Attempt 2: Give the agent a compass

Hypothesis. The board encoding buries the apple's location in 3,072 numbers. What if we added two features that directly encode the direction to the apple?

Setup. Added two values to the observation:

These are positive when the apple is to the right or below, negative when it is to the left or above, and zero when directly in line.

What happened. Within hundreds of episodes, the agent began making positive progress. The first checkpoint with reliably positive reward appeared around episode 400—compared to nothing after 45,000 episodes in Attempt 1.

Why. Two features replaced thousands of implicit ones. The agent no longer needed to discover the spatial structure of the board from scratch. A direct signal pointing toward the goal gave the reward function something to work with.

Attempt 3: Diagnosing runaway loss

Hypothesis. Training with explicit features is working. Let's see how far it gets.

Early results (with features added, initial settings).

[ep_0300]  avg_loss=48.7    avg_reward=+8.1
[ep_0500]  avg_loss=347     avg_reward=+12.4
[ep_0700]  avg_loss=4,102   avg_reward=+6.5
[ep_1100]  avg_loss=686,000 avg_reward=-3.1

Training started promisingly, then the loss exploded and performance collapsed.

What happened. The loss grew without bound—a phenomenon called Q-value divergence. The apple gives +50 reward. With a learning rate of 0.001 and MSE (mean squared error) loss, large rewards pushed Q-values high. High Q-values created large TD errors (the difference between predicted and target Q-values). MSE loss squares those errors, so larger errors create quadratically larger gradients. Large gradient updates pushed Q-values even higher. A feedback loop.

Fix. Two changes stabilized training:

  1. Huber loss instead of MSE. Huber loss behaves like MSE for small errors but becomes linear for large ones, capping the gradient. This breaks the feedback loop.
  2. Lower learning rate: 0.001 → 0.0001. Smaller updates give the target network time to stabilize before the online network chases a new target.

Attempt 4: Zooming in — the egocentric view

Hypothesis. The full 32×16 board is large (3,072 inputs). The snake only needs to know what is nearby. What if we cropped the observation to a window centered on the snake's head?

Setup. Instead of the full board, the agent sees a 17×17 crop centered on the snake's head—wherever the snake happens to be. Areas outside the board are filled with empty space.

Two benefits of the egocentric view.

First, smaller input: 1,736 numbers instead of 3,072. The network is simpler, trains faster, and generalizes better.

Second, position invariance: the snake's head is always at the center of its own observation. A wall to the left looks the same whether the snake is at position (3,5) or (28,12). The network does not need to relearn the same spatial relationships at every board location.

With an egocentric crop, the full-board CNN is no longer needed. We used a flat MLP (spatial = false) that treats the 1,736 inputs as a single vector. The egocentric window already encodes local spatial context; additional convolutions over the full board are not necessary.

Attempt 5: Teaching the agent to explore

Hypothesis. Exploration rate (epsilon) should decay slowly enough to give the agent meaningful experience before it commits.

The problem with fast decay. With epsilon_decay = 0.995, epsilon falls from 1.0 to 0.05 by episode ~450. At that point the agent is acting greedily 95% of the time—but after only 450 episodes, the Q-network has barely trained. It commits to whatever policy it happened to discover early, which may be far from optimal.

epsilon after 450 episodes (decay=0.995): 0.995^450 ≈ 0.10
epsilon after 450 episodes (decay=0.9997): 0.9997^450 ≈ 0.87

Fix. With epsilon_decay = 0.9997, epsilon is still 0.55 at episode 2,000. The agent keeps exploring well into training, discovering better strategies before committing.

The successful run

With all five improvements in place, training produced a genuinely competent snake agent. Here is the training log:

[ep_0100]  avg_reward=-5.3   avg_steps=50   epsilon=0.970  avg_loss=0.9
[ep_0400]  avg_reward=+9.7   avg_steps=61   epsilon=0.887  avg_loss=1.7
[ep_1100]  avg_reward=+34.5  avg_steps=57   epsilon=0.719  avg_loss=4.6
[ep_1800]  avg_reward=+4.4   avg_steps=98   epsilon=0.583  avg_loss=5.1
[ep_3800]  avg_reward=+51.2  avg_steps=33   epsilon=0.320  avg_loss=1.2
[ep_5400]  avg_reward=+83.5  avg_steps=43   epsilon=0.198  avg_loss=2.0
[ep_9000]  avg_reward=+246.0 avg_steps=85   epsilon=0.067  avg_loss=6.4
[ep_13000] avg_reward=+375.6 avg_steps=107  epsilon=0.050  avg_loss=5.4
[ep_17100] avg_reward=+288.3 avg_steps=86   epsilon=0.050  avg_loss=4.9

The learning curve has a characteristic shape:

  1. Exploration (ep 0–300): reward negative, agent mostly random
  2. First breakthroughs (ep 400–1100): agent starts finding the apple
  3. Consolidation dip (ep 1500–2300): reward falls as the agent refines its strategy—a normal phase of reorganization
  4. Efficiency breakthrough (ep 3700+): episodes suddenly shorten (avg_steps drops to 33); the agent has learned to reach the apple quickly
  5. Maturation (ep 8000+): longer episodes, higher reward, complex strategy

Part 3: Training Forager

Now it is your turn. The forager/ directory contains a game called Forager: an agent on an 8×8 grid that collects food that respawns when eaten. The rules are simple, but the 8×8 state space (4,032 distinct (agent, food) positions) is too large for a Q-table—you need a neural network.

💻 Play Forager to understand the game:

$ uv run python -m forager

Use arrow keys to move @ to the food *. Press Enter or Escape to quit.

Setting up a training run

💻 Create your first training run:

$ retro-gamer create --game forager --output runs/forager/

This creates runs/forager/config.toml. Open it and add observe_state to the [preprocessing] section so the agent can see the direction to the food:

[preprocessing]
spatial = false
board = true
observe_state = ["food_dx", "food_dy"]

Then start training:

$ retro-gamer train runs/forager/

A progress bar will show how training is going. Training 20,000 episodes takes about 10–20 minutes. You can stop and resume at any time with Ctrl-C.

Watch your agent play at any point:

$ retro-gamer play runs/forager/

Document your experiments

Open training_log.md and fill in Attempt 1 before you start training: write your hypothesis—what do you think will happen with the default settings? After training, fill in the evidence and analysis.

Then try at least one more configuration. Some things worth experimenting with:

When you change hidden_sizes or any [preprocessing] option, run retro-gamer clean runs/forager/ before retraining.


Extension: Train an agent for your own game

If you have built a game using the retro-games framework, you can train a DQN agent to play it.

  1. Add a [tool.retro-gamer] section to your game's pyproject.toml:
    [tool.retro-gamer]
    actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
    reward = "score"
  2. Add state features to game.state that give the agent useful signal (e.g., direction to a target, distance to a wall).
  3. Create and train:
    $ retro-gamer create --game your_game/ --output runs/your_game/
    $ retro-gamer train runs/your_game/

See the retro-gamer documentation for the full reference.