What LLMs Actually Learn: A Statistical Perspective

statisticsAItech

The framing problem

Every time someone says a language model “knows” something, I wince a little. Not because it’s wrong exactly, but because it smuggles in a frame that leads us astray. “Knowing” implies a stable, retrievable fact. What these models have is closer to a heavily weighted prior over plausible continuations.

That distinction matters more than it might seem.


What the training objective actually is

At the most basic level, a transformer trained on next-token prediction is learning a function:

P(token_n | token_1, token_2, ..., token_{n-1})

That’s it. Everything else — the apparent reasoning, the code generation, the surprisingly good trivia — is an emergent consequence of optimising that objective over enough data.

“The model doesn’t know the capital of France. It has learned that the token sequence the capital of France is is very reliably followed by Paris.”

The distinction collapses in most practical cases, which is why it’s easy to miss. But it matters when the training distribution thins out — rare facts, recent events, niche domains. There the probability estimates get noisy, and the model still produces a confident-sounding completion.


Calibration and why it’s hard

A well-calibrated model would express uncertainty proportional to how uncertain it actually is. The trouble is that the training objective doesn’t reward calibration directly — it rewards correct predictions. A model that always says “I’m not sure” would be penalised even when that’s the right epistemic state.

This is why you see confident hallucinations. The model isn’t “lying” — it’s producing a high-probability completion given its training, and its training didn’t include many examples of the form “here is a confident-sounding claim that is actually false.”

RLHF and similar techniques help, but they introduce their own distortions. Models trained to seem helpful and confident tend to paper over uncertainty rather than expose it.


What this means in practice

A few things follow from taking this frame seriously:

  1. Use the model where the training distribution is dense. Common programming patterns, well-documented APIs, standard prose — these are exactly the domains where the probability estimates are reliable.

  2. Treat rare-domain outputs as drafts, not answers. Niche legal jurisdictions, obscure historical events, your proprietary codebase — the model is extrapolating, not retrieving.

  3. Contradiction is signal, not noise. If a model gives inconsistent answers across rephrases of the same question, that’s useful information about where the distribution is thin.


A useful mental model

Think of it like a very well-read person who has read everything but remembers nothing verbatim, and who has a strong tendency to sound confident regardless of actual certainty. You’d use their intuitions gladly for common problems. You’d double-check everything they say about your specific situation.

That’s roughly the right operating posture.

A visualisation of token probability distributions across a sequence Fig 1. Probability mass across the vocabulary at each position. Most of it concentrates on a small set of plausible continuations.


The interesting open question is whether scale alone can fix the calibration problem, or whether it requires architectural changes. My intuition is the latter — but that’s a post for another day.


// END OF TRANSMISSION