Embeddings

Words have geometry. "Cat" sits closer to "dog" than to "democracy"—not because anyone drew a map, but because the two words appear in similar sentences. This lab explores that geometry: first by playing with a large set of embeddings trained on billions of words, then by understanding the mechanism that produces them, then by building a small version from scratch.

Playing with embeddings

An embedding represents a word as a list of numbers—a point in a high-dimensional space. The key property is geometric: words that appear in similar contexts end up with similar vectors, so the space organizes itself by meaning without ever being told what meaning is. Words that mean similar things cluster together. Antonyms often end up as mirror images. Relationships like "capital city of" or "past tense of" show up as consistent directions in the space.

Researchers have trained embeddings on enormous corpora—hundreds of billions of words—and released the results for anyone to use. The gensim library makes them easy to download and explore. The wp command in this project uses them.

💻 Try this in a Python shell (uv run python):

>>> import gensim.downloader
>>> model = gensim.downloader.load('glove-wiki-gigaword-100')

The first time this runs it downloads about 128 MB; after that it loads from cache in a few seconds. The result is a KeyedVectors object: a large lookup table mapping words to their 100-dimensional embedding vectors.

>>> model['cat']
array([ 0.23088,  0.28283, -0.6142, ...])   # 100 numbers
>>> model.similarity('cat', 'dog')
0.8219
>>> model.similarity('cat', 'democracy')
0.0412

You can also ask for the words most similar to a given word:

>>> model.most_similar('king', topn=5)
[('queen', 0.7699), ('prince', 0.6840), ('royal', 0.6523), ('kings', 0.6510), ('throne', 0.6398)]

This returns a list of (word, similarity) pairs, ranked by cosine similarity to the query word. Explore the space a bit before moving on—try words from different domains, and try things you would not expect to work.

Model structure

The GloVe embeddings above were produced by a model that learned to predict neighboring words—adjusting a large matrix of numbers until its predictions improved. In the rest of this lab you will build a simpler version of that model: a rewrite of TinyLM that learns embeddings from scratch instead of counting.

In the matrices lab, a one-hot vector selected a row from the count matrix W—the row corresponding to the current context window. The embeddings model uses the same mechanism, but with two differences: the matrix contains learned numbers instead of counts, and the one-hot encodes a single word rather than an entire context window.

The model has two learned matrices:

Predicting the next word takes two matrix multiplications.

Step 1: Look up a word's embedding. A one-hot vector selects that word's row from E. Here is a tiny example: five words, three-dimensional embeddings. The one-hot for "chuck" (index 1) selects its row:

The result—[2, -1, 1]—is chuck's embedding. When the context window has more than one word, the embeddings for all context words are averaged into a single vector before the next step.

Step 2: Turn the context embedding into scores. Multiplying the context embedding by W produces one score (called a logit) per vocabulary word:

Higher scores mean the model thinks that word is more likely to come next. To turn them into probabilities, the model applies softmax, written $ \sigma $:

$$\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\displaystyle\sum_j e^{z_j}}$$

For the logits [3, −2, −1, 3, −2] from the example above:

word$ z_i $$ e^{z_i} $$ \sigma(\mathbf{z})_i $
a320.090.492
chuck−20.140.003
could−10.370.009
wood320.090.492
would−20.140.003
sum40.821.000

"a" and "wood" each get nearly half the probability; everything else is negligible.

The full forward pass

Putting all five steps together, the pipeline from context words to probabilities looks like this (using our five-word, three-dimensional example):

The actual model uses 11 words and 32-dimensional embeddings, so E is $ 11 \times 32 $ and W is $ 32 \times 11 $, but the structure is identical.

This is exactly what _forward computes in tlm/model.py:

def _forward(self, context):
    indices = [self.word_to_idx[w] for w in context]  # find each word's row index in E
    ctx_emb = self.E[indices].mean(axis=0)             # look up rows; average into one vector
    logits  = ctx_emb @ self.W + self.b               # project through W and shift by b
    probs   = softmax(logits)                          # normalize scores to probabilities
    return ctx_emb, probs

Training

Now train the model on our familiar tongue-twister corpus:

tlm train --filepath chuck.txt

You will see output like this:

Epoch 1/5  loss=2.3512
Epoch 2/5  loss=2.2375
Epoch 3/5  loss=2.1561
Epoch 4/5  loss=2.1002
Epoch 5/5  loss=2.0608
Model saved to model.json

Something new is happening here. In the count model, training meant reading through the corpus once and filling in a matrix—done in a single pass. Here, training means making many passes through the corpus, measuring how wrong the predictions are, and gradually getting better. The number on each line is the loss: a measure of how wrong the model currently is, on average. Watch it decrease across epochs.

💻 Generate some text:

tlm generate --model model.json

Where is the model?

In the count model, we could inspect model in an interactive shell and read the count matrix directly—rows for contexts, columns for words, integers we could interpret at a glance. Where does the learning live in this model?

💻 Use --interact to open a Python shell after generating:

tlm generate --model model.json --interact

Look at the two main matrices:

>>> model.E.shape
(11, 32)
>>> model.W.shape
(32, 11)

The vocabulary has 11 words; the default embedding size is 32. E is an $ 11 \times 32 $ matrix and W is a $ 32 \times 11 $ matrix. Compare this to the count matrix from the previous lab, which was $ 11 \times 20 $: one column per unique context window seen in the corpus. This model has no such limit. E has one row per word, regardless of how many context windows appeared.

Look at one word's row:

>>> model.vocab
['a', 'all', 'chuck', 'could', 'how', 'if', 'it', 'much', 'the', 'wood', 'would']
>>> model.E[model.word_to_idx['chuck']]
array([ 0.043, -0.112,  0.087, -0.201,  0.034, ...])  # 32 numbers

This row—32 numbers—is the model's learned representation of the word "chuck." It is called an embedding. The model started with random numbers here and adjusted them over training to make better predictions.

Words in space

Think of each word's embedding as a set of coordinates that places the word at a point in a 32-dimensional space. (We can't draw 32 dimensions, but the math works the same as it does in 2 or 3.)

Words that appear in similar contexts will tend to end up near each other in this space, because the model learns to make similar predictions for them. The model is never told anything about what words mean—the structure emerges from statistics.

You can measure how close two words are using cosine similarity: a number between -1 and 1, where 1 means the two embedding vectors point in exactly the same direction and 0 means they are unrelated.

>>> import numpy as np
>>> chuck = model.E[model.word_to_idx['chuck']]
>>> wood = model.E[model.word_to_idx['wood']]
>>> np.dot(chuck, wood) / (np.linalg.norm(chuck) * np.linalg.norm(wood))

With a corpus of 32 words there is not much to learn from, and the similarities will not be very meaningful. The same model scales up.

💻 Train on a larger corpus and explore the resulting embeddings:

tlm train --gutenberg austen-emma.txt -t lower -t alpha --epochs 10 --output emma.json
tlm generate --model emma.json --interact
>>> import numpy as np
>>> def sim(a, b):
...     return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
>>> E = model.E
>>> wi = model.word_to_idx

Compute similarities between pairs of words that you expect to be related, and pairs you expect to be unrelated. Do the numbers match your intuition?

How prediction works

Look at the _forward method in tlm/model.py. When the model predicts the next word, it does three things:

  1. Look up and average: retrieve the embedding row E[i] for each word i in the context window, and average them into a single vector.
  2. Multiply: compute context_vector @ W to produce one score—called a logit—for each word in the vocabulary.
  3. Softmax: turn those scores into a probability distribution.

Steps 2 and 3 are the same structure as the matrices lab—a matrix multiplication followed by normalization—but now W is learned rather than filled by counting, and the context vector is a dense embedding rather than a one-hot selector.

This structure—input, matrix multiplication, nonlinear transformation, output—is the basic building block of a neural network. This model has one such layer. The large language models you use today have hundreds.

How training works

Look at the _step method in tlm/model.py. For each (context, target) pair:

  1. Run _forward to get a probability distribution.
  2. Check the probability the model assigned to the correct next word. The loss for this step is $ -\log(p_{\text{target}}) $: it is near zero when the model is confident and correct, and grows large when the model is wrong or uncertain.
  3. Compute how much each number in E and W contributed to the error. This is backpropagation: tracing the loss back through the math to figure out how to change each parameter to do better.
  4. Nudge each parameter by a small amount in the direction that would reduce the loss. This is gradient descent.

This loop—forward pass, compute loss, backpropagate, update—runs for every training example, and repeats every epoch. Over many epochs, E and W move toward values that produce better predictions.