Reinforcement learning lab

Use the /mwc-module skill.

Create a lab at /Users/chrisp/Repos/MWC/making-with-code/site/content/courses/dp/labs/reinforcement_learning, with starter code in a repo at ~/Repos/MWC/modules/reinforcement_learning.

Learning objectives

The learning objectives for this lab are:

A4.2.2 Describe the role of feature selection.
- Feature selection to identify and retain the most informative attributes of the data set
- Feature selection strategies: filter methods, wrapper methods, embedded methods
A4.2.3 Describe the importance of dimensionality reduction.
- The curse of dimensionality considerations may include overfitting, computational complexity, data sparsity, the effectiveness of distance metrics, data visualization, sample size increases, memory usage.
- Dimensionality reduction of variables, while preserving the relevant aspects of the data Note: Statistical techniques such as principal component analysis (PCA) and linear discriminant analysis (LDA) are beyond the scope of this course.
A4.3.3 Explain the role of hyperparameter tuning when evaluating supervised learning algorithms.
- Accuracy, precision, recall and F1 score as evaluation metrics
- The role of hyperparameter tuning on model performance
- Overfitting and underfitting when training algorithms
A4.3.6 Describe how an agent learns to make decisions by interacting with its environment in reinforcement learning.
- The principle of cumulative reward and the foundational concepts of agent–environment interaction, encompassing actions, states, rewards and policies
- The exploration versus exploitation trade-off as a core concept in reinforcement learning
A4.3.8 Outline the structure and function of ANNs and how multi-layer networks are used to model complex patterns in data sets.
- An artificial neural network (ANN) to simulate interconnected nodes or “neurons” to process and learn from input data, enabling tasks such as classification, regression and pattern recognition
- Sketch of a single perceptron, highlighting its input, weights, bias, activation function and output
- Sketch of a multi-layer perceptron (MLP) encompassing the input layer, one or more hidden layers and the output layer.
A4.3.9 Describe how CNNs are designed to adaptively learn spatial hierarchies of features in images.
- Convolutional neural network (CNN) basic architecture: input layer, convolutional layers, activation functions, pooling layers, fully connected layers, output layer
- The effect of the number of layers, kernel size and stride, activation function selection, and the loss function on how CNNs process input data and classify images
A4.3.10 Explain the importance of model selection and comparison in machine learning.
- How different algorithms can yield different results depending on the data and type of problem
- The reasons for selecting specific machine learning models over others, considering factors like the nature of the problem, its complexity and desired outcomes
- The variability in algorithm performance based on the data’s characteristics

Ensure that the learning objectives are included in the lab's metadata using the provided mechanism. Also add a teaching note at the beignning of the lab explaining how the lab meets the learning objectives. This lab is designed to complement direct instruction, not substitute for it, so while definitions are provided for all key terms and concepts, we expect that students will also be receiving reinforcement of these concepts before, during, and/or after the lab. Integration of the lab into the course is up to the teacher.

Some of the objectives are core to the lab experience; for others, the lab provides a rich authentic opportunity to explore the concepts. In the latter case, add teaching notes throughout the lab explaining how and where they come up. Remember, our goal is to keep the text of the lab (aside from teaching notes) streamlined, focused on what students will do--students will not read huge amounts of text.

Lab structure

Training forager

In the repo, implement a game called "babysnake" using retro-games, a small grid where an agent collects a food item that respawns after collection.

State: (agent_x, agent_y, food_x, food_y) — four integers, immediately human-readable
State space: on a 4×4 board, 16 × 15 = 240 distinct states (food and agent can't overlap). Small enough to print the entire Q-table as a class artifact.
Actions: 4 directions (+ no-op)
Reward: +1 on collection, small step penalty to discourage wandering

First, have students play the game and write down the reasoning they are using.

Introduce Q-learning with the babysnake game; students will manually calculate some Q updates, and be guided to implement Q-learning in Python, with some starter code.

This section ends with a checkpoint in which students have to train babysnake to perform well.

Also write soluttions for the student exercise. I will run the solution to ensure that it produces a well-behaved babysnake, and I'll remove it from the repo before publication.

Training snake

This section introduces training of a more complex game, snake. Students will not train snake on their own. Instead, we will provide artifacts of attempts to train the snake in this (Claude Code session) conversation's history, and then students will answer conceptual questions, interpreting the evidence we encountered, reasoning about the behavior of epsilon, learning rate, loss, etc., in other situations. Organize this section into a list of subsections; start each subsection with our hypothesis for what might work, explain what we did, and then show evidence of how it went.

First, copy runs/snake-ego into the student repo, saving just a few interesting checkpoints (e.g. around episode 1800 when the initial reward spiked, midway through increasing performance, and the final policy. Add a checkpoint asking students to describe the policy's behavior in each.

I saved full data from one previous run (in ~/Desktop/snaketrainer); use this as a mid-point case study. For others, draw evidence from earlier in this conversation and summarize. Walk students through interpreting the evidence, introducing concepts and terms as needed. Present this without referring to changes/refactors we made to the framework--present the progress as if the retro-gamer framework and its contract with retro-games were stable in its final form the whole time. The point is not system design, but RL concepts. It's fine to compress multiple iterations into a single synthesized iteration to make the story cleaner.

This list of subsections should definitely include:

Inital effort to train snake: Q-value divergence with runaway loss.
Adding new explicit features to provide reward signal (moving toward/away from the apple)
Trying spatial--CNN.
Adusting epsilon and learning rate.
Adding egocentric board representation.

Note: For this section (Training snake), use runs/snake-ego at its current state--it is still training. We will return to the lesson and update the lesson and the repo once training has completed.

End this section with a checkpoint asking students to complete a list of conceptual questions, written out in the checkpoint, and in snake_training.md in the repo. Ensure that the conceptual questions here and in the next section are aligned with the lab's learning objectives.

Training x

In this section, students will train their own game. Create a small, easily trainable game in the repo, as well as training_log.md, where students should document their efforts in the same manner as with the previous section on training snake. The game to be trained should be simple enough that students will have success training an intelligent agent, but sufficiently complex that different training regimes will produce agents with different levels of success. A class might want to have a competition to train the most successful agent.

End this section with a checkpoint asking students to complete the training log, analyze their own success with training, analyzing the behavior of their final policy, and answer a few conceptual questions.

Extension: Train an agent for your own game

Invite students to train an agent for their own games.

Process notes

End the lab with a teaching note suggesting discussion prompts for connecting RL in this lab to real-life situations, both in CS and more broadly (e.g. how does human behavior reflect RL?), and reflecting on the lab at present, what's currently strong and suggestions for improvement.