Deep Dives13 min read25 May 2026
How Transformers Work (Plain English)
The transformer architecture powers every major AI model. Here's how it actually works — no PhD required, just curiosity.
The "Attention Is All You Need" paper by Vaswani et al. (2017) from Google did not just improve on RNNs — it replaced the sequential mechanism entirely. The name reflects the paper's central argument: you do not need recurrence or convolution. Attention is sufficient.
The sentence "Unhappiness is contagious" might tokenise as: ["Un", "happiness", " is", " cont", "agious"] → [1423, 8745, 374, 1487, 32891]. Each subword piece maps to an integer ID from the vocabulary. The vocabulary typically contains 50,000 to 100,000 unique tokens.
Self-attention is "self" because each token attends to other tokens in the same sequence — the input attends to itself. This is different from cross-attention (used in encoder-decoder models), where one sequence attends to a different sequence.
python
import numpy as np
def self_attention(X, W_Q, W_K, W_V):
"""
Minimal self-attention implementation in NumPy.
X: token embeddings, shape (seq_len, d_model)
W_Q, W_K, W_V: learned weight matrices, shape (d_model, d_k)
Returns: attended representations, shape (seq_len, d_k)
"""
d_k = W_K.shape[1]
# Project embeddings into Q, K, V spaces
Q = X @ W_Q # (seq_len, d_k) — what each token is looking for
K = X @ W_K # (seq_len, d_k) — what each token contains
V = X @ W_V # (seq_len, d_k) — what each token contributes
# Compute raw attention scores
scores = Q @ K.T # (seq_len, seq_len) — all pairwise relevances
scores = scores / np.sqrt(d_k) # scale to prevent vanishing gradients
# Normalise to get attention weights
def softmax(x):
exp = np.exp(x - x.max(axis=-1, keepdims=True))
return exp / exp.sum(axis=-1, keepdims=True)
weights = softmax(scores) # (seq_len, seq_len) — rows sum to 1
# Compute weighted sum of values
output = weights @ V # (seq_len, d_k)
return output, weights
# Example: 4 tokens, 8-dimensional embeddings, 4-dimensional attention
np.random.seed(42)
seq_len, d_model, d_k = 4, 8, 4
X = np.random.randn(seq_len, d_model) # token embeddings
W_Q = np.random.randn(d_model, d_k) # query projection
W_K = np.random.randn(d_model, d_k) # key projection
W_V = np.random.randn(d_model, d_k) # value projection
output, weights = self_attention(X, W_Q, W_K, W_V)
print("Attention weights (each row sums to 1.0):")
print(weights.round(3))
print("\nOutput shape:", output.shape)
# Each output token is now a blend of all input tokens,
# weighted by how much attention it paid to each oneGPT-4 was trained on roughly 13 trillion tokens. Claude's training data size is not disclosed. At current compute costs, training a frontier model costs $50-100 million. This is why there are only a handful of true frontier models — the barrier to entry is enormous.
transformersarchitecturedeep divesattentionNLP
🎓Interactive Courses
Ready to go further?
Take the interactive course — daily lessons, real exercises, XP and streaks. Turn reading into lasting skills.
Daily streaksXP & levels
