Reinforcement Learning
Lab setup
First, make sure you have completed the initial setup.
If you are part of a course
-
Open Terminal. Run the update command to make sure you have the latest code.
$ mwc update -
Move to this lab's directory.
$ cd ~/Desktop/making_with_code/shuyuan/labs/reinforcement_learning
If you are working on your own
-
Move to your MWC directory.
$ cd ~/Desktop/making_with_code -
Get a copy of this lab's materials.
git clone https://git.makingwithcode.org/mwc/reinforcement_learning.git
In this lab, you will train a computer to play games—without telling it the rules. Instead, you will set up a system where the computer tries things, observes what happens, and gradually learns which actions lead to better outcomes. This is called reinforcement learning.
You have probably already seen how machine learning models learn to classify data. Reinforcement learning is different: there is no labeled dataset, no right answer to compare against. The agent learns entirely from the consequences of its own actions.
Part 1: Training BabySnake
Play the game first
💻 Run the BabySnake game:
$ uv run python -m babysnake
BabySnake is a simple game on a 4×4 grid. You control the @ character and
try to collect the * food. Use the arrow keys to move. The game ends when
your energy runs out.
How Q-learning works
The reasoning you just wrote down is, in some form, a policy: a rule that maps situations to actions. In reinforcement learning, the goal is to discover a good policy automatically.
One way to represent a policy is a Q-table. A Q-table stores an estimated value—called a Q-value—for every possible (state, action) pair. The Q-value for (state, action) estimates how much total future reward the agent expects if it takes that action in that state and then continues to act optimally.
For BabySnake, the state is a tuple of four integers:
(agent_x, agent_y, food_x, food_y).
On a 4×4 board with 4 possible actions, the complete Q-table has at most 240 × 4 = 960 entries—small enough to print and read.
At the start of training, every Q-value is 0. The agent updates Q-values as it collects experience, using the Bellman equation:
Q(s, a) ← Q(s, a) + α · ( r + γ · max_a' Q(s', a') − Q(s, a) )
In words: the new Q-value is the old one plus a correction. The correction is proportional to the temporal difference (TD) error: the difference between the old estimate and a better estimate based on what actually happened.
- r is the reward received (e.g., +1 for food, −0.01 per step)
- γ (gamma) is the discount factor: how much to value future rewards relative to immediate ones
- α (alpha) is the learning rate: how large an update to make each time
Manual Q-update exercise
Work through this update by hand before implementing it in code.
Situation: The board is 4×4. The agent is at position (2, 2). Food is at (3, 3).
. . . .
. . . .
. . @ .
. . . *
All Q-values are currently 0, except:
| State | Action | Q-value |
|---|---|---|
| (3, 2, 3, 3) | DOWN | 0.5 |
(This means the agent has already learned that from position (3,2) with food at (3,3), moving DOWN is promising.)
The agent takes action RIGHT. It moves from (2,2) to (3,2). It does not land on food. Reward: r = −0.01.
New state: (3, 2, 3, 3). New α = 0.1, γ = 0.95.
Calculate the new Q-value for ((2, 2, 3, 3), RIGHT).
Exploration vs. exploitation
During training, the agent faces a dilemma at every turn: should it exploit what it already knows (choose the action with the highest Q-value), or explore by trying something new (maybe there is a better action it has not discovered yet)?
We handle this with an epsilon-greedy policy:
- With probability ε (epsilon), take a random action (explore)
- With probability 1 − ε, take the best known action (exploit)
At the start of training, ε is high (e.g., 1.0: always random). As training progresses, ε decays toward a small floor (e.g., 0.05). This schedule lets the agent explore widely early on and exploit what it has learned later.
Implement Q-learning
💻
Open q_learning.py. You need to implement two functions:
choose_action(q_table, state, epsilon)
With probability epsilon, return a random action from ACTIONS.
Otherwise, return the action with the highest Q-value for state.
Use q_table.get((state, action), 0.0) to look up a Q-value (defaulting to 0
if the pair has not been seen before).
update_q(q_table, state, action, reward, next_state, alpha, gamma)
Apply the Bellman equation to update q_table[(state, action)] in place.
The right-hand side needs the best Q-value from next_state—use a list
comprehension over ACTIONS to find it.
The rest of the training loop is already written for you. Once your two functions are implemented, run:
$ uv run python q_learning.py
You should see output like:
Episode 100 reward= -2.1 score=0 epsilon=0.605 q_entries=36
Episode 200 reward= 1.4 score=1 epsilon=0.366 q_entries=88
Episode 500 reward= 3.8 score=4 epsilon=0.082 q_entries=204
Episode 1000 reward= 5.1 score=5 epsilon=0.050 q_entries=287
The reward will be negative early (step penalties accumulate) and should
improve as the agent learns. score counts how many food items were collected
in the last episode.
Part 2: Training Snake
BabySnake's state space has 240 entries. The Q-table can hold the entire policy in less than a kilobyte of memory. Now consider the original Snake game on a 32×16 board: the state space is enormous (trillions of possible board configurations), the agent grows a tail that it can collide with, and the optimal strategy is far more complex.
A Q-table cannot scale to this. We need a Q-network: a neural network that approximates the Q-function. Instead of looking up a value in a table, the network takes the current observation and predicts Q-values for all actions.
This is Deep Q-Learning (DQN). Training it is more subtle than Q-learning. This section walks through five training experiments that led to a working agent—what we tried, what went wrong, and how we fixed it.
The snake game has a simple reward structure: +50 for eating an apple, −1 for each step toward the apple and +1 for each step away (incentivizing approach), −10 for dying. The agent also has energy that depletes each step and refills when it eats; running out of energy ends the game.
Attempt 1: Can the network see the apple?
Hypothesis. A CNN processes 2D spatial inputs efficiently. If we feed the agent the raw game board, the CNN should be able to detect where the apple is and learn to navigate toward it.
Setup.
- Full 32×16 board (3,072 numbers)
- CNN architecture (
spatial = true) - No explicit direction-to-apple features
- 45,000 training episodes
Evidence.
[ep_0100] avg_reward=-9.5 avg_steps=48 epsilon=0.905 avg_loss=9.2
[ep_0500] avg_reward=-8.7 avg_steps=108 epsilon=0.606 avg_loss=43.6
[ep_2000] avg_reward=-9.3 avg_steps=133 epsilon=0.135 avg_loss=10.3
[ep_5000] avg_reward=-8.9 avg_steps=134 epsilon=0.050 avg_loss=9.6
...
[ep_45700] avg_reward=-9.3 avg_steps=130 epsilon=0.050 avg_loss=8.7
What happened. The agent learned to survive—avg_steps grew from 48 to ~130—but the reward stayed flat and negative through the entire run. After 6 hours and 45,000 episodes, the agent was wandering the board, avoiding walls, but never reliably finding the apple.
Why. The full board gives the agent 3,072 numbers as input. Somewhere in those numbers is information about where the apple is, but it is deeply implicit: the agent has to figure out which numbers change when the apple moves and build a spatial representation of the board from scratch. The reward signal (+50 when the snake happens to reach the apple, after potentially hundreds of random steps) is far too sparse to guide that learning.
Attempt 2: Give the agent a compass
Hypothesis. The board encoding buries the apple's location in 3,072 numbers. What if we added two features that directly encode the direction to the apple?
Setup. Added two values to the observation:
apple_dx = (apple_x − head_x) / board_widthapple_dy = (apple_y − head_y) / board_height
These are positive when the apple is to the right or below, negative when it is to the left or above, and zero when directly in line.
What happened. Within hundreds of episodes, the agent began making positive progress. The first checkpoint with reliably positive reward appeared around episode 400—compared to nothing after 45,000 episodes in Attempt 1.
Why. Two features replaced thousands of implicit ones. The agent no longer needed to discover the spatial structure of the board from scratch. A direct signal pointing toward the goal gave the reward function something to work with.
Attempt 3: Diagnosing runaway loss
Hypothesis. Training with explicit features is working. Let's see how far it gets.
Early results (with features added, initial settings).
[ep_0300] avg_loss=48.7 avg_reward=+8.1
[ep_0500] avg_loss=347 avg_reward=+12.4
[ep_0700] avg_loss=4,102 avg_reward=+6.5
[ep_1100] avg_loss=686,000 avg_reward=-3.1
Training started promisingly, then the loss exploded and performance collapsed.
What happened. The loss grew without bound—a phenomenon called Q-value divergence. The apple gives +50 reward. With a learning rate of 0.001 and MSE (mean squared error) loss, large rewards pushed Q-values high. High Q-values created large TD errors (the difference between predicted and target Q-values). MSE loss squares those errors, so larger errors create quadratically larger gradients. Large gradient updates pushed Q-values even higher. A feedback loop.
Fix. Two changes stabilized training:
- Huber loss instead of MSE. Huber loss behaves like MSE for small errors but becomes linear for large ones, capping the gradient. This breaks the feedback loop.
- Lower learning rate: 0.001 → 0.0001. Smaller updates give the target network time to stabilize before the online network chases a new target.
Attempt 4: Zooming in — the egocentric view
Hypothesis. The full 32×16 board is large (3,072 inputs). The snake only needs to know what is nearby. What if we cropped the observation to a window centered on the snake's head?
Setup. Instead of the full board, the agent sees a 17×17 crop centered on the snake's head—wherever the snake happens to be. Areas outside the board are filled with empty space.
- Full board: 32 × 16 × 6 = 3,072 inputs
- Egocentric 17×17 crop: 17 × 17 × 6 = 1,734 board inputs + 2 state = 1,736 total
Two benefits of the egocentric view.
First, smaller input: 1,736 numbers instead of 3,072. The network is simpler, trains faster, and generalizes better.
Second, position invariance: the snake's head is always at the center of its own observation. A wall to the left looks the same whether the snake is at position (3,5) or (28,12). The network does not need to relearn the same spatial relationships at every board location.
With an egocentric crop, the full-board CNN is no longer needed. We used a
flat MLP (spatial = false) that treats the 1,736 inputs as a single vector.
The egocentric window already encodes local spatial context; additional
convolutions over the full board are not necessary.
Attempt 5: Teaching the agent to explore
Hypothesis. Exploration rate (epsilon) should decay slowly enough to give the agent meaningful experience before it commits.
The problem with fast decay. With epsilon_decay = 0.995, epsilon falls
from 1.0 to 0.05 by episode ~450. At that point the agent is acting greedily
95% of the time—but after only 450 episodes, the Q-network has barely
trained. It commits to whatever policy it happened to discover early, which
may be far from optimal.
epsilon after 450 episodes (decay=0.995): 0.995^450 ≈ 0.10
epsilon after 450 episodes (decay=0.9997): 0.9997^450 ≈ 0.87
Fix. With epsilon_decay = 0.9997, epsilon is still 0.55 at episode
2,000. The agent keeps exploring well into training, discovering better
strategies before committing.
The successful run
With all five improvements in place, training produced a genuinely competent snake agent. Here is the training log:
[ep_0100] avg_reward=-5.3 avg_steps=50 epsilon=0.970 avg_loss=0.9
[ep_0400] avg_reward=+9.7 avg_steps=61 epsilon=0.887 avg_loss=1.7
[ep_1100] avg_reward=+34.5 avg_steps=57 epsilon=0.719 avg_loss=4.6
[ep_1800] avg_reward=+4.4 avg_steps=98 epsilon=0.583 avg_loss=5.1
[ep_3800] avg_reward=+51.2 avg_steps=33 epsilon=0.320 avg_loss=1.2
[ep_5400] avg_reward=+83.5 avg_steps=43 epsilon=0.198 avg_loss=2.0
[ep_9000] avg_reward=+246.0 avg_steps=85 epsilon=0.067 avg_loss=6.4
[ep_13000] avg_reward=+375.6 avg_steps=107 epsilon=0.050 avg_loss=5.4
[ep_17100] avg_reward=+288.3 avg_steps=86 epsilon=0.050 avg_loss=4.9
The learning curve has a characteristic shape:
- Exploration (ep 0–300): reward negative, agent mostly random
- First breakthroughs (ep 400–1100): agent starts finding the apple
- Consolidation dip (ep 1500–2300): reward falls as the agent refines its strategy—a normal phase of reorganization
- Efficiency breakthrough (ep 3700+): episodes suddenly shorten (avg_steps drops to 33); the agent has learned to reach the apple quickly
- Maturation (ep 8000+): longer episodes, higher reward, complex strategy
Part 3: Training Forager
Now it is your turn. The forager/ directory contains a game called Forager:
an agent on an 8×8 grid that collects food that respawns when eaten. The rules
are simple, but the 8×8 state space (4,032 distinct (agent, food) positions)
is too large for a Q-table—you need a neural network.
💻 Play Forager to understand the game:
$ uv run python -m forager
Use arrow keys to move @ to the food *. Press Enter or Escape to quit.
Setting up a training run
💻 Create your first training run:
$ retro-gamer create --game forager --output runs/forager/
This creates runs/forager/config.toml. Open it and add observe_state to the
[preprocessing] section so the agent can see the direction to the food:
[preprocessing]
spatial = false
board = true
observe_state = ["food_dx", "food_dy"]
Then start training:
$ retro-gamer train runs/forager/
A progress bar will show how training is going. Training 20,000 episodes takes about 10–20 minutes. You can stop and resume at any time with Ctrl-C.
Watch your agent play at any point:
$ retro-gamer play runs/forager/Document your experiments
Open training_log.md and fill in Attempt 1 before you start training:
write your hypothesis—what do you think will happen with the default settings?
After training, fill in the evidence and analysis.
Then try at least one more configuration. Some things worth experimenting with:
epsilon_decay— how quickly does the agent commit to what it has learned?learning_rate— too high causes instability; too low is slowhidden_sizesin[model]— larger networks can represent more complex policiesboard = falsewithobserve_state = ["food_dx", "food_dy"]— what if the agent sees only the direction to food, with no board at all?
When you change hidden_sizes or any [preprocessing] option, run
retro-gamer clean runs/forager/ before retraining.
Extension: Train an agent for your own game
If you have built a game using the retro-games framework, you can train a DQN agent to play it.
- Add a
[tool.retro-gamer]section to your game'spyproject.toml:[tool.retro-gamer] actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"] reward = "score" - Add state features to
game.statethat give the agent useful signal (e.g., direction to a target, distance to a wall). - Create and train:
$ retro-gamer create --game your_game/ --output runs/your_game/ $ retro-gamer train runs/your_game/
See the retro-gamer documentation for the full reference.