Embeddings
Words have geometry. "Cat" sits closer to "dog" than to "democracy"—not because anyone drew a map, but because the two words appear in similar sentences. This lab explores that geometry: first by playing with a large set of embeddings trained on billions of words, then by understanding the mechanism that produces them, then by building a small version from scratch.
Playing with embeddings
An embedding represents a word as a list of numbers—a point in a high-dimensional space. The key property is geometric: words that appear in similar contexts end up with similar vectors, so the space organizes itself by meaning without ever being told what meaning is. Words that mean similar things cluster together. Antonyms often end up as mirror images. Relationships like "capital city of" or "past tense of" show up as consistent directions in the space.
Researchers have trained embeddings on enormous corpora—hundreds of billions of
words—and released the results for anyone to use. The gensim library makes them easy
to download and explore. The wp command in this project uses them.
💻
Try this in a Python shell (uv run python):
>>> import gensim.downloader
>>> model = gensim.downloader.load('glove-wiki-gigaword-100')
The first time this runs it downloads about 128 MB; after that it loads from cache in a
few seconds. The result is a KeyedVectors object: a large lookup table mapping words
to their 100-dimensional embedding vectors.
>>> model['cat']
array([ 0.23088, 0.28283, -0.6142, ...]) # 100 numbers
>>> model.similarity('cat', 'dog')
0.8219
>>> model.similarity('cat', 'democracy')
0.0412
You can also ask for the words most similar to a given word:
>>> model.most_similar('king', topn=5)
[('queen', 0.7699), ('prince', 0.6840), ('royal', 0.6523), ('kings', 0.6510), ('throne', 0.6398)]
This returns a list of (word, similarity) pairs, ranked by cosine similarity to the
query word. Explore the space a bit before moving on—try words from different domains,
and try things you would not expect to work.
Model structure
The GloVe embeddings above were produced by a model that learned to predict neighboring words—adjusting a large matrix of numbers until its predictions improved. In the rest of this lab you will build a simpler version of that model: a rewrite of TinyLM that learns embeddings from scratch instead of counting.
In the matrices lab, a one-hot vector selected a row from the count matrix W—the row corresponding to the current context window. The embeddings model uses the same mechanism, but with two differences: the matrix contains learned numbers instead of counts, and the one-hot encodes a single word rather than an entire context window.
The model has two learned matrices:
- E (vocab_size × embedding_dim): one row per word in the vocabulary. Each row is that word's embedding—a dense vector of numbers the model has learned.
- W (embedding_dim × vocab_size): maps a context embedding to one score per vocabulary word.
Predicting the next word takes two matrix multiplications.
Step 1: Look up a word's embedding. A one-hot vector selects that word's row from E. Here is a tiny example: five words, three-dimensional embeddings. The one-hot for "chuck" (index 1) selects its row:
The result—[2, -1, 1]—is chuck's embedding. When the context window has more than one word, the embeddings for all context words are averaged into a single vector before the next step.
Step 2: Turn the context embedding into scores. Multiplying the context embedding by W produces one score (called a logit) per vocabulary word:
Higher scores mean the model thinks that word is more likely to come next. To turn them into probabilities, the model applies softmax, written $ \sigma $:
$$\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\displaystyle\sum_j e^{z_j}}$$
For the logits [3, −2, −1, 3, −2] from the example above:
| word | $ z_i $ | $ e^{z_i} $ | $ \sigma(\mathbf{z})_i $ |
|---|---|---|---|
| a | 3 | 20.09 | 0.492 |
| chuck | −2 | 0.14 | 0.003 |
| could | −1 | 0.37 | 0.009 |
| wood | 3 | 20.09 | 0.492 |
| would | −2 | 0.14 | 0.003 |
| sum | 40.82 | 1.000 |
"a" and "wood" each get nearly half the probability; everything else is negligible.
The full forward pass
Putting all five steps together, the pipeline from context words to probabilities looks like this (using our five-word, three-dimensional example):
The actual model uses 11 words and 32-dimensional embeddings, so E is $ 11 \times 32 $ and W is $ 32 \times 11 $, but the structure is identical.
This is exactly what _forward computes in tlm/model.py:
def _forward(self, context):
indices = [self.word_to_idx[w] for w in context] # find each word's row index in E
ctx_emb = self.E[indices].mean(axis=0) # look up rows; average into one vector
logits = ctx_emb @ self.W + self.b # project through W and shift by b
probs = softmax(logits) # normalize scores to probabilities
return ctx_emb, probs
Training
Now train the model on our familiar tongue-twister corpus:
tlm train --filepath chuck.txt
You will see output like this:
Epoch 1/5 loss=2.3512
Epoch 2/5 loss=2.2375
Epoch 3/5 loss=2.1561
Epoch 4/5 loss=2.1002
Epoch 5/5 loss=2.0608
Model saved to model.json
Something new is happening here. In the count model, training meant reading through the corpus once and filling in a matrix—done in a single pass. Here, training means making many passes through the corpus, measuring how wrong the predictions are, and gradually getting better. The number on each line is the loss: a measure of how wrong the model currently is, on average. Watch it decrease across epochs.
💻 Generate some text:
tlm generate --model model.json
Where is the model?
In the count model, we could inspect model in an interactive shell and read the count
matrix directly—rows for contexts, columns for words, integers we could interpret at a
glance. Where does the learning live in this model?
💻
Use --interact to open a Python shell after generating:
tlm generate --model model.json --interact
Look at the two main matrices:
>>> model.E.shape
(11, 32)
>>> model.W.shape
(32, 11)
The vocabulary has 11 words; the default embedding size is 32. E is an
$ 11 \times 32 $ matrix and W is a $ 32 \times 11 $ matrix.
Compare this to the count matrix from the previous lab, which was
$ 11 \times 20 $: one column per unique context window seen in the corpus.
This model has no such limit. E has one row per word, regardless of how many
context windows appeared.
Look at one word's row:
>>> model.vocab
['a', 'all', 'chuck', 'could', 'how', 'if', 'it', 'much', 'the', 'wood', 'would']
>>> model.E[model.word_to_idx['chuck']]
array([ 0.043, -0.112, 0.087, -0.201, 0.034, ...]) # 32 numbers
This row—32 numbers—is the model's learned representation of the word "chuck." It is called an embedding. The model started with random numbers here and adjusted them over training to make better predictions.
Words in space
Think of each word's embedding as a set of coordinates that places the word at a point in a 32-dimensional space. (We can't draw 32 dimensions, but the math works the same as it does in 2 or 3.)
Words that appear in similar contexts will tend to end up near each other in this space, because the model learns to make similar predictions for them. The model is never told anything about what words mean—the structure emerges from statistics.
You can measure how close two words are using cosine similarity: a number between -1 and 1, where 1 means the two embedding vectors point in exactly the same direction and 0 means they are unrelated.
>>> import numpy as np
>>> chuck = model.E[model.word_to_idx['chuck']]
>>> wood = model.E[model.word_to_idx['wood']]
>>> np.dot(chuck, wood) / (np.linalg.norm(chuck) * np.linalg.norm(wood))
With a corpus of 32 words there is not much to learn from, and the similarities will not be very meaningful. The same model scales up.
💻 Train on a larger corpus and explore the resulting embeddings:
tlm train --gutenberg austen-emma.txt -t lower -t alpha --epochs 10 --output emma.json
tlm generate --model emma.json --interact>>> import numpy as np
>>> def sim(a, b):
... return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
>>> E = model.E
>>> wi = model.word_to_idx
Compute similarities between pairs of words that you expect to be related, and pairs you expect to be unrelated. Do the numbers match your intuition?
How prediction works
Look at the _forward method in tlm/model.py. When the model predicts the next word,
it does three things:
- Look up and average: retrieve the embedding row
E[i]for each wordiin the context window, and average them into a single vector. - Multiply: compute
context_vector @ Wto produce one score—called a logit—for each word in the vocabulary. - Softmax: turn those scores into a probability distribution.
Steps 2 and 3 are the same structure as the matrices lab—a matrix multiplication
followed by normalization—but now W is learned rather than filled by counting,
and the context vector is a dense embedding rather than a one-hot selector.
This structure—input, matrix multiplication, nonlinear transformation, output—is the basic building block of a neural network. This model has one such layer. The large language models you use today have hundreds.
How training works
Look at the _step method in tlm/model.py. For each (context, target) pair:
- Run
_forwardto get a probability distribution. - Check the probability the model assigned to the correct next word. The loss for this step is $ -\log(p_{\text{target}}) $: it is near zero when the model is confident and correct, and grows large when the model is wrong or uncertain.
- Compute how much each number in
EandWcontributed to the error. This is backpropagation: tracing the loss back through the math to figure out how to change each parameter to do better. - Nudge each parameter by a small amount in the direction that would reduce the loss. This is gradient descent.
This loop—forward pass, compute loss, backpropagate, update—runs for every training
example, and repeats every epoch. Over many epochs, E and W move toward values
that produce better predictions.