A short rundown of five overlapping topics, plus a few first-party places to keep reading. Skim it; the sources are the real payload.
Predict the next token. Do it for a few trillion of them. Most of what we call "capability" falls out.
Pretraining is the foundation under everything else. Take a transformer with tens or hundreds of billions of parameters, initialize it randomly, and train it on a giant corpus — most of the readable web, plus curated code, books, math, and multilingual text — with a single objective: predict the next token. The loss is cross-entropy on that prediction; the optimizer is some variant of AdamW; the run takes weeks on tens of thousands of accelerators. What comes out is a base model: a probability distribution over continuations of any prefix. It is not a chatbot. It does not follow instructions, refuse requests, or know it is in a conversation. It just predicts.
This matters because nearly all the knowledge and reasoning capability of a frontier model is laid down here. Post-training (instruction tuning, RLHF, RLVR) shapes how those capabilities are expressed — tone, refusals, chain-of-thought style, agentic persistence — but adds comparatively little new world knowledge. The scaling laws (Kaplan 2020, then Chinchilla 2022) describe how loss falls predictably as you add parameters, data, and compute together. Chinchilla's revision — that compute-optimal models are smaller and trained on more tokens than was assumed — still drives data work at the frontier, where modern runs ingest on the order of 10–30 trillion tokens.
Learning behavior from reward signals, not labeled answers.
In RL, an agent takes actions in an environment, receives rewards, and adjusts a policy (the function from state → action) to maximize expected reward over time. Unlike supervised learning, there are no ground-truth labels — only feedback about how good a chosen trajectory was. The hard problems are credit assignment (which action led to the reward?), exploration vs. exploitation, and reward design.
Modern LLM training has made RL central again: after pretraining, models are fine-tuned with RLHF (reward learned from human preference comparisons) or RLAIF / RLVR (reward from another model or from verifiable signals like passing tests). Reasoning models (o-series, Claude with extended thinking, etc.) are largely the product of RL on long chains of thought.
Reverse-engineering neural networks into human-readable algorithms.
Mech interp treats a trained network as an artifact to be disassembled. Rather than asking "what does this model predict?", it asks "what computation is this circuit performing?" — identifying features (directions in activation space that correspond to interpretable concepts) and circuits (subgraphs of attention heads and MLPs that implement specific behaviors like induction, indirect object identification, or refusal).
The field's main recent breakthrough is sparse autoencoders (SAEs), which decompose a model's polysemantic neurons into a much larger dictionary of sparser, more monosemantic features. This has scaled from toy models to frontier models, and is starting to let researchers steer behavior by amplifying or clamping specific features.
RL changes what the model does; interp tries to see what it became.
RL fine-tuning (RLHF, RLVR, reasoning RL) is the main lever shaping how frontier models actually behave — refusals, helpfulness, chain-of-thought style, agentic persistence. But it's a blunt instrument: we train against a reward signal and hope the resulting policy generalizes the way we wanted, rather than learning a clever proxy. Mech interp is the natural counterpart — a way to look inside the post-RL model and check whether the behaviors we trained correspond to clean internal concepts ("the model represents 'honesty'") or to messy shortcuts ("the model represents 'sound confident')").
Concretely, this shows up in three places: (1) interpreting what RLHF does to features — does it suppress capabilities or just suppress their expression? (2) detecting reward hacking and deceptive alignment by looking for features the model "knows" but doesn't surface; (3) using interp findings to design better reward signals or to monitor a model's reasoning during RL training itself.
A model is a function; a harness is the program that calls it in a loop.
An LLM by itself is a single forward pass — text in, text out. A harness wraps the model in a loop that does several things: maintains conversation history, decides when to stop or continue, parses tool-call requests out of the model's output, executes those tools, feeds the results back as new messages, and manages context (compaction, summarization, file-based memory). Claude Code, Cursor, the Claude Agent SDK, and Claude.ai with computer-use are all examples — same underlying model, different harnesses.
Tools extend what the model can do beyond generating text. The model is trained to emit structured tool-use blocks (a JSON schema describing function calls); the harness intercepts these, runs the real function (a shell command, an API call, a file read), and returns the result. The model isn't actually executing code — it's asking the harness to, and reasoning over the response. MCP (Model Context Protocol) standardizes how third-party tool servers expose tools and resources to any compatible harness, so tools become portable across clients.
The capability surface of a deployed AI system is therefore model × harness × tools. Better tools and a smarter loop can make a weaker model competitive; a weak harness wastes a strong model. RL training increasingly happens inside realistic harnesses, so the model learns the loop it'll be deployed in.