Deep Dives13 min read25 May 2026

How Transformers Work (Plain English)

The transformer architecture powers every major AI model. Here's how it actually works — no PhD required, just curiosity.

The "Attention Is All You Need" paper by Vaswani et al. (2017) from Google did not just improve on RNNs — it replaced the sequential mechanism entirely. The name reflects the paper's central argument: you do not need recurrence or convolution. Attention is sufficient.

The sentence "Unhappiness is contagious" might tokenise as: ["Un", "happiness", " is", " cont", "agious"] → [1423, 8745, 374, 1487, 32891]. Each subword piece maps to an integer ID from the vocabulary. The vocabulary typically contains 50,000 to 100,000 unique tokens.

Self-attention is "self" because each token attends to other tokens in the same sequence — the input attends to itself. This is different from cross-attention (used in encoder-decoder models), where one sequence attends to a different sequence.

python

import numpy as np

def self_attention(X, W_Q, W_K, W_V):
    """
    Minimal self-attention implementation in NumPy.

    X: token embeddings, shape (seq_len, d_model)
    W_Q, W_K, W_V: learned weight matrices, shape (d_model, d_k)
    Returns: attended representations, shape (seq_len, d_k)
    """
    d_k = W_K.shape[1]

    # Project embeddings into Q, K, V spaces
    Q = X @ W_Q  # (seq_len, d_k) — what each token is looking for
    K = X @ W_K  # (seq_len, d_k) — what each token contains
    V = X @ W_V  # (seq_len, d_k) — what each token contributes

    # Compute raw attention scores
    scores = Q @ K.T  # (seq_len, seq_len) — all pairwise relevances
    scores = scores / np.sqrt(d_k)  # scale to prevent vanishing gradients

    # Normalise to get attention weights
    def softmax(x):
        exp = np.exp(x - x.max(axis=-1, keepdims=True))
        return exp / exp.sum(axis=-1, keepdims=True)

    weights = softmax(scores)  # (seq_len, seq_len) — rows sum to 1

    # Compute weighted sum of values
    output = weights @ V  # (seq_len, d_k)
    return output, weights

# Example: 4 tokens, 8-dimensional embeddings, 4-dimensional attention
np.random.seed(42)
seq_len, d_model, d_k = 4, 8, 4

X = np.random.randn(seq_len, d_model)   # token embeddings
W_Q = np.random.randn(d_model, d_k)    # query projection
W_K = np.random.randn(d_model, d_k)    # key projection
W_V = np.random.randn(d_model, d_k)    # value projection

output, weights = self_attention(X, W_Q, W_K, W_V)

print("Attention weights (each row sums to 1.0):")
print(weights.round(3))
print("\nOutput shape:", output.shape)
# Each output token is now a blend of all input tokens,
# weighted by how much attention it paid to each one

GPT-4 was trained on roughly 13 trillion tokens. Claude's training data size is not disclosed. At current compute costs, training a frontier model costs $50-100 million. This is why there are only a handful of true frontier models — the barrier to entry is enormous.

transformersarchitecturedeep divesattentionNLP

🎓Interactive Courses

Ready to go further?

Take the interactive course — daily lessons, real exercises, XP and streaks. Turn reading into lasting skills.

Daily streaksXP & levels

Start a course