A Quick Primer

RL, mech interp, and the harnesses around the model

A short rundown of five overlapping topics, plus a few first-party places to keep reading. Skim it; the sources are the real payload.

01 Pretraining

Predict the next token. Do it for a few trillion of them. Most of what we call "capability" falls out.

Pretraining is the foundation under everything else. Take a transformer with tens or hundreds of billions of parameters, initialize it randomly, and train it on a giant corpus — most of the readable web, plus curated code, books, math, and multilingual text — with a single objective: predict the next token. The loss is cross-entropy on that prediction; the optimizer is some variant of AdamW; the run takes weeks on tens of thousands of accelerators. What comes out is a base model: a probability distribution over continuations of any prefix. It is not a chatbot. It does not follow instructions, refuse requests, or know it is in a conversation. It just predicts.

This matters because nearly all the knowledge and reasoning capability of a frontier model is laid down here. Post-training (instruction tuning, RLHF, RLVR) shapes how those capabilities are expressed — tone, refusals, chain-of-thought style, agentic persistence — but adds comparatively little new world knowledge. The scaling laws (Kaplan 2020, then Chinchilla 2022) describe how loss falls predictably as you add parameters, data, and compute together. Chinchilla's revision — that compute-optimal models are smaller and trained on more tokens than was assumed — still drives data work at the frontier, where modern runs ingest on the order of 10–30 trillion tokens.

First-party sources

Brown et al. — Language Models are Few-Shot Learners (GPT-3) The paper that made "just pretrain at scale" the dominant paradigm.
Kaplan et al. — Scaling Laws for Neural Language Models Loss as a smooth function of params, data, and compute. The original scaling-laws paper.
Hoffmann et al. — Training Compute-Optimal LLMs (Chinchilla) Corrected the params/data tradeoff; current data-collection budgets still live in its shadow.
Karpathy — Neural Networks: Zero to Hero / nanoGPT Builds a small GPT from scratch in code. The fastest way to make pretraining concrete.
Anthropic — Claude model announcements & cards Real-world detail on what pretraining + post-training looks like at one frontier lab.

02 Reinforcement Learning

Learning behavior from reward signals, not labeled answers.

In RL, an agent takes actions in an environment, receives rewards, and adjusts a policy (the function from state → action) to maximize expected reward over time. Unlike supervised learning, there are no ground-truth labels — only feedback about how good a chosen trajectory was. The hard problems are credit assignment (which action led to the reward?), exploration vs. exploitation, and reward design.

Modern LLM training has made RL central again: after pretraining, models are fine-tuned with RLHF (reward learned from human preference comparisons) or RLAIF / RLVR (reward from another model or from verifiable signals like passing tests). Reasoning models (o-series, Claude with extended thinking, etc.) are largely the product of RL on long chains of thought.

First-party sources

Sutton & Barto — Reinforcement Learning: An Introduction The textbook. Free PDF. Start with chapters 1–6.
OpenAI Spinning Up in Deep RL Hands-on intro by Josh Achiam — equations, code, and a reading list.
David Silver's RL course (DeepMind / UCL) Lecture videos + slides from the AlphaGo lead.
Schulman et al. — Proximal Policy Optimization PPO; still the default policy-gradient algorithm in LLM RL.
Ouyang et al. — InstructGPT / RLHF The paper that put RLHF in the spotlight for language models.

03 Mechanistic Interpretability

Reverse-engineering neural networks into human-readable algorithms.

Mech interp treats a trained network as an artifact to be disassembled. Rather than asking "what does this model predict?", it asks "what computation is this circuit performing?" — identifying features (directions in activation space that correspond to interpretable concepts) and circuits (subgraphs of attention heads and MLPs that implement specific behaviors like induction, indirect object identification, or refusal).

The field's main recent breakthrough is sparse autoencoders (SAEs), which decompose a model's polysemantic neurons into a much larger dictionary of sparser, more monosemantic features. This has scaled from toy models to frontier models, and is starting to let researchers steer behavior by amplifying or clamping specific features.

First-party sources

Olah et al. — Zoom In: An Introduction to Circuits The original Distill primer. Read this first.
transformer-circuits.pub (Anthropic) The ongoing series: "A Mathematical Framework for Transformer Circuits," "Toy Models of Superposition," monthly updates.
Templeton et al. — Scaling Monosemanticity SAEs applied to Claude 3 Sonnet — the "Golden Gate Bridge feature" paper.
Anthropic — On the Biology of a Large Language Model 2025 attribution-graph work tracing multi-step reasoning inside Claude.
Neel Nanda's mech interp resources Tutorials, the TransformerLens library, "200 Concrete Open Problems."

04 Where RL and mech interp meet

RL changes what the model does; interp tries to see what it became.

RL fine-tuning (RLHF, RLVR, reasoning RL) is the main lever shaping how frontier models actually behave — refusals, helpfulness, chain-of-thought style, agentic persistence. But it's a blunt instrument: we train against a reward signal and hope the resulting policy generalizes the way we wanted, rather than learning a clever proxy. Mech interp is the natural counterpart — a way to look inside the post-RL model and check whether the behaviors we trained correspond to clean internal concepts ("the model represents 'honesty'") or to messy shortcuts ("the model represents 'sound confident')").

Concretely, this shows up in three places: (1) interpreting what RLHF does to features — does it suppress capabilities or just suppress their expression? (2) detecting reward hacking and deceptive alignment by looking for features the model "knows" but doesn't surface; (3) using interp findings to design better reward signals or to monitor a model's reasoning during RL training itself.

First-party sources

Anthropic — Alignment Faking in Large Language Models A concrete case where interp helps explain what RL did and didn't change.
Hubinger et al. — Sleeper Agents Backdoor behaviors that survive RLHF; interp probes them.
Christiano et al. — Deep RL from Human Preferences The foundational RLHF paper; useful baseline before reading interp critiques.
Anthropic — Core Views on AI Safety Frames why the lab pairs RL/post-training with interpretability.

05 How harnesses and tools extend the model

A model is a function; a harness is the program that calls it in a loop.

An LLM by itself is a single forward pass — text in, text out. A harness wraps the model in a loop that does several things: maintains conversation history, decides when to stop or continue, parses tool-call requests out of the model's output, executes those tools, feeds the results back as new messages, and manages context (compaction, summarization, file-based memory). Claude Code, Cursor, the Claude Agent SDK, and Claude.ai with computer-use are all examples — same underlying model, different harnesses.

Tools extend what the model can do beyond generating text. The model is trained to emit structured tool-use blocks (a JSON schema describing function calls); the harness intercepts these, runs the real function (a shell command, an API call, a file read), and returns the result. The model isn't actually executing code — it's asking the harness to, and reasoning over the response. MCP (Model Context Protocol) standardizes how third-party tool servers expose tools and resources to any compatible harness, so tools become portable across clients.

The capability surface of a deployed AI system is therefore model × harness × tools. Better tools and a smarter loop can make a weaker model competitive; a weak harness wastes a strong model. RL training increasingly happens inside realistic harnesses, so the model learns the loop it'll be deployed in.

First-party sources

Anthropic — Building Effective Agents The canonical writeup of workflow vs. agent patterns, from the team that built Claude Code.
Anthropic — Tool Use docs How tool schemas, tool_use blocks, and the request/response loop actually work.
Model Context Protocol (MCP) Open spec for tool/resource servers; reference implementations and SDKs.
Claude Agent SDK The harness Anthropic ships — read this if you want to see one in code.
Yao et al. — ReAct: Reasoning + Acting Early academic articulation of the think-act-observe loop most harnesses still use.