Understanding Large Language Models
What actually happens inside a model like Claude or GPT? A plain-English explanation of how LLMs work — no maths required.
Large language models are the technology behind Claude, GPT-4, Gemini, and most of the AI tools you use today. Understanding how they actually work — even at a rough level — makes you a dramatically better AI user. It explains why they fail in certain ways, and how to work around those failures.
What an LLM is doing, at the core
At the most basic level, a language model predicts what text should come next. Given the words "The capital of France is", it predicts "Paris" comes next with very high probability. That sounds simple — and the mechanism is simple — but it produces remarkably sophisticated behaviour at scale.
Tokens, not words
Models don't see words. They see tokens — chunks of characters that might be a word, part of a word, or punctuation. "Unbelievable" might be split into "Un", "believ", "able". This matters because the model's context window (how much it can read at once) is measured in tokens, not words. Roughly 1 token ≈ 0.75 words in English.
Training: learning from the internet
Before you ever touch it, an LLM was trained on a vast corpus of text — books, websites, code, academic papers. During training, the model repeatedly tried to predict what came next, compared its prediction to the real text, and adjusted its internal parameters to be more accurate next time. Billions of adjustments over months of compute.
This is why LLMs know about history, can write code, and understand context — they've seen all of it in training data.
Fine-tuning and alignment
After initial training, models go through a second phase where human raters score responses for helpfulness, harmlessness, and honesty. The model learns to produce responses that score well. This is why Claude feels like it's trying to be genuinely helpful rather than just predicting likely text — the alignment training shapes its behaviour toward useful responses.
What LLMs genuinely can't do
- They don't know what's happening right now (knowledge cutoff)
- They can't access the internet unless explicitly given that tool
- They can 'hallucinate' — confidently state false information — because they optimise for likely-sounding text, not verified truth
- They don't have persistent memory between conversations (unless built in separately)
- They can't count or do arithmetic reliably — they pattern-match numbers rather than compute
Context windows
A context window is how much text the model can hold in its 'working memory' at once. Claude's context window is very large — over 200,000 tokens. This means you can paste in an entire book and ask questions about it. But everything outside the context window is invisible to the model — it doesn't have long-term memory unless you build it in.
Why this matters for how you use AI
Knowing that models predict likely text explains why they hallucinate — the most likely-sounding answer isn't always the true one. It explains why giving more context improves output — you're giving the model better signal about what 'likely' should mean in your situation. And it explains why they're so good at creative and linguistic tasks but less reliable for pure fact retrieval.
Test your knowledge
· 4 questionsReady to go further?
Take the interactive course — daily lessons, real exercises, XP and streaks. Turn reading into lasting skills.
