Understanding NATs in LLM Training

If you've spent any time reading ML papers or staring at training logs, you've probably seen this term thrown around constantly: "Loss went down from 3.2 to 2.8 nats." It sounds impressive, and everyone nods along as if they know exactly what it means, but honestly, I was just pretending to understand it for the longest time. Eventually, I got tired of that and decided to actually figure out what NATs are and why they matter, so let me break it down the way I wish someone had explained it to me.

What Even Are NATs?

NAT stands for Natural unit of information, and it's sometimes also called a "nit" or "nat" depending on who's writing the paper, which admittedly makes the naming situation a bit confusing. Essentially, it's the same concept as bits, but instead of being measured with a base-2 logarithm like bits are, it's measured using the natural logarithm (ln), so it's really just a different scale for measuring the same underlying thing.

When people in the LLM world talk about "nats," they're almost always referring to the negative log-likelihood loss, which is basically the cross-entropy loss calculated using natural logarithms instead of base-2 logarithms. Once you understand that connection, the whole concept becomes much less mysterious.

The Core Idea

During training, what's actually happening is that your LLM is constantly trying to predict the next token in a sequence. For each position in the text, the model outputs a probability distribution over its entire vocabulary, which could be something like 50,000 tokens or whatever size your vocabulary happens to be. The loss calculation is actually quite straightforward once you see it written out:

Loss = -ln(P(correct token))

where:
  P(correct token) = probability the model assigned to the actual next token
  ln = natural logarithm (base e ≈ 2.718)

Let's walk through some concrete examples to make this tangible:

Example 1: High Confidence (Good Prediction)

Model's prediction: P("cat") = 0.9
Actual next word: "cat"

Loss = -ln(0.9) 
     = -(-0.105)
     = 0.105 nats

Interpretation: Low loss! Model was confident and correct.

Example 2: Low Confidence (Poor Prediction)

Model's prediction: P("elephant") = 0.01
Actual next word: "elephant"

Loss = -ln(0.01)
     = -(-4.605)
     = 4.605 nats

Interpretation: High loss! Model was surprised by the correct answer.

Example 3: Near Perfect

Model's prediction: P("the") = 0.99
Actual next word: "the"

Loss = -ln(0.99)
     = 0.010 nats

Interpretation: Almost zero loss. Model was nearly certain.

The fundamental principle is simple: lower nats mean better predictions, and that's really all you need to remember at a high level.

Weather Forecaster Analogy

I find it helpful to think about this through an analogy that makes the intuition clearer. Imagine you're a weather forecaster, and every single day you're being graded based on your confidence levels about whether it's going to rain or not. Each morning, you make your announcement with some degree of certainty, like "I'm 80% sure it'll rain today," and then reality unfolds and it either rains or it doesn't.

Your score, measured in nats, depends entirely on how much confidence you gave to what actually ended up happening. Let's work through a week of forecasts to see how this plays out:

Day 1 - You predict: 80% rain, Reality: It rains ✓
  Score = -ln(0.80) = 0.223 nats (good!)

Day 2 - You predict: 10% rain, Reality: It rains ✓
  Score = -ln(0.10) = 2.303 nats (terrible! caught off guard)

Day 3 - You predict: 99% rain, Reality: It rains ✓
  Score = -ln(0.99) = 0.010 nats (nearly perfect!)

Day 4 - You predict: 60% rain, Reality: It rains ✓
  Score = -ln(0.60) = 0.511 nats (decent)

Day 5 - You predict: 95% rain, Reality: It rains ✓
  Score = -ln(0.95) = 0.051 nats (excellent)

Average score over 5 days:
  (0.223 + 2.303 + 0.010 + 0.511 + 0.051) / 5 = 0.620 nats

Your perplexity: e^0.620 ≈ 1.86
(Meaning you're effectively choosing between ~2 outcomes, which makes 
sense for binary weather prediction!)

Your goal as a forecaster, accumulated over thousands of days of predictions, is to get your average nats as low as possible. That's exactly what an LLM is doing during training, except instead of thousands of weather predictions, it's making billions of token predictions across massive amounts of text.

Why Natural Logarithm Though?

There's actually a pretty practical reason why we use the natural logarithm instead of other bases, and it comes down to the mathematics of neural network optimization. Since the whole training process relies on gradient descent and backpropagation, which are fundamentally calculus-based operations, using the natural logarithm makes the math work out much more elegantly.

Here's why mathematically:

For f(x) = ln(x):
  f'(x) = 1/x

This is incredibly clean for gradient computation!

Compare to log₂(x):
  f'(x) = 1/(x · ln(2))
  
The extra ln(2) constant makes every gradient calculation slightly messier.

The relationship between nats and bits is also straightforward to convert:

1 nat = 1/ln(2) bits ≈ 1.44 bits

Conversion formula:
  bits = nats × 1.44
  nats = bits × 0.693

Example:
  3.0 nats = 3.0 × 1.44 = 4.32 bits

So they're really just measuring the same underlying concept on different scales, kind of like how Celsius and Fahrenheit both measure temperature but use different reference points. The choice to use nats is purely about computational convenience during training.

What Do Nats Mean at Scale?

To get a sense of what these numbers actually mean in practice, let's consider an LLM with a vocabulary of 50,000 tokens, which is pretty typical. Here's how the math works out across different levels of model quality:

Random Baseline (No Learning)

vocab_size = 50_000
random_probability = 1 / vocab_size
 
loss = -math.log(random_probability)
# loss ≈ 10.82 nats per token
 
perplexity = math.exp(loss)
# perplexity = 50,000 (choosing randomly from all words!)

This represents your worst meaningful baseline, the point where the model has learned absolutely nothing and is just throwing darts at a massive vocabulary board.

Well-Trained Model

# Typical loss for a good model
average_loss = 2.5  # nats per token
 
effective_choices = math.exp(average_loss)
# effective_choices ≈ 12.18
 
# The model has narrowed down from 50,000 options to ~12!
reduction_factor = vocab_size / effective_choices
# reduction_factor ≈ 4,100x better than random

A well-trained model typically achieves somewhere around 2.0 to 3.5 nats per token on normal text, which means:

Loss (nats)	Effective Choices	Reduction from Random
2.0	~7 words	7,100x better
2.5	~12 words	4,100x better
3.0	~20 words	2,500x better
3.5	~33 words	1,500x better

That's a dramatic improvement and shows the model has genuinely learned the structure and patterns of language.

Near perfection, which would be approaching 0 nats, would mean the model almost always knows the next token with near-certainty, but that's actually impossible for natural language because language itself is inherently unpredictable in many contexts, and there's genuine ambiguity and creativity in how people write and speak.

The "Effective Choices" Intuition

This concept connects directly to something called perplexity, which gives you an intuitive way to understand what those NATs numbers really mean. The formula is beautifully simple:

Perplexity = e^(average loss in nats)

where:
  e ≈ 2.718 (Euler's number)
  average loss = mean NATs across all predictions

You can think of perplexity as answering the question: "On average, how many equally-likely words is the model choosing between when it makes a prediction?"

Let's calculate this for different loss levels:

Low Loss (Good Model)

import math
 
average_loss = 1.0  # nats
perplexity = math.exp(average_loss)
# perplexity ≈ 2.72
 
# Interpretation: Like choosing between ~3 words

Medium Loss (Decent Model)

average_loss = 2.0  # nats
perplexity = math.exp(average_loss)
# perplexity ≈ 7.39
 
# Interpretation: Like choosing between ~7 words

Higher Loss (Struggling Model)

average_loss = 3.0  # nats
perplexity = math.exp(average_loss)
# perplexity ≈ 20.09
 
# Interpretation: Like choosing between ~20 words

Very High Loss (Poor Model)

average_loss = 4.0  # nats
perplexity = math.exp(average_loss)
# perplexity ≈ 54.60
 
# Interpretation: Model is really struggling, too many options

Here's a quick reference table:

Average Loss	Perplexity	What This Feels Like
1.0 nats	~2.7	Choosing between ~3 words
2.0 nats	~7.4	Choosing between ~7 words
3.0 nats	~20	Choosing between ~20 words
4.0 nats	~55	Struggling, many options

Detective Analogy

Another way I like to think about this is to imagine the LLM as a detective reading through a story one word at a time, trying to guess what comes next before flipping the page to see the actual word. The quality of the detective's guess depends entirely on how much context they have available.

When the context is strong, like reading "The cat sat on the ___," the detective can be very confident in their predictions, narrowing it down to words like "mat," "floor," or "chair," and the loss ends up being low because the context has dramatically constrained the possibilities. But when the context is weak, like just reading "The ___," the next word could be almost anything in the entire vocabulary, so the loss is high because there's genuine uncertainty and not much to go on.

Sometimes you get medium-strength context, like "I hereby declare this meeting ___," where there's a strong signal pointing toward words like "adjourned" or "open," and again the loss is low because the surrounding context provides enough clues to make a confident prediction. Training is fundamentally the process of making this detective smarter over time, teaching it to better use context clues, grammatical patterns, world knowledge, and statistical regularities to reduce the average surprise it experiences across millions and millions of sentences.

Why This Actually Matters

When researchers publish a paper saying "we reduced loss from 3.2 to 2.8 nats," that difference of 0.4 nats might not sound like much at first, but let's actually calculate what that means:

# Before: baseline model
old_loss = 3.2
old_perplexity = math.exp(old_loss)
# old_perplexity ≈ 24.5 words
 
# After: improved model
new_loss = 2.8
new_perplexity = math.exp(new_loss)
# new_perplexity ≈ 16.4 words
 
# Improvement
reduction = old_perplexity - new_perplexity
# reduction ≈ 8.1 fewer "effective choices"
 
percentage_improvement = (reduction / old_perplexity) * 100
# percentage_improvement ≈ 33% reduction in uncertainty!

So that 0.4 nat improvement means the model went from effectively choosing among around 25 plausible words to choosing among only 16. That's a 33% reduction in uncertainty, which translates directly into noticeably more coherent, more accurate, and generally higher-quality text generation. You can usually see the difference clearly when you actually use the model.

Here's a real-world example of what this looks like:

Prompt: "The capital of France is"

Model at 3.2 nats (perplexity ~24):
  Top predictions: Paris (35%), Lyon (8%), France (6%), located (4%), ...
  [spreading probability across ~24 plausible options]

Model at 2.8 nats (perplexity ~16):
  Top predictions: Paris (68%), the (5%), located (3%), ...
  [concentrating probability on fewer, better choices]

The better model is much more confident in the correct answer ("Paris") and wastes less probability mass on unlikely alternatives.

How NATs Drive Real Training Decisions

Understanding NATs isn't just academic curiosity, it's actually central to how major labs decide whether to train a new model from scratch or continue training an existing one. The loss curve measured in nats tells researchers whether their training strategy is working, and that decision has massive implications for compute budgets and timelines.

When you're looking at something like Llama 4 or GPT-5, those major version jumps almost always involve training from scratch because the architecture itself has fundamentally changed. You can't take a dense transformer model and "continue train" it into a mixture-of-experts architecture, the underlying structure is completely different, so you're starting fresh with a new pretraining run that might consume 20 to 40 trillion tokens. The NATs loss during those runs is what tells the team whether their new architecture is actually learning better than the previous generation.

But here's where it gets interesting. Between those major releases, you see point releases like DeepSeek V3.1 or V3.2, and those are typically using continued pretraining on targeted datasets rather than starting over. The trick is avoiding catastrophic forgetting, where training on new data causes the model to lose what it previously learned, and researchers monitor this by watching whether the NATs loss on held-out evaluation sets starts climbing back up. If your loss starts increasing on the old data while you're training on new data, that's a red flag that the model is forgetting, and you might need to mix in more of the original training data to maintain stability.

There's also been a fascinating shift in how compute budgets get allocated. Recent models like DeepSeek R1 show that you can do a solid but not maximal pretraining run to get your base NATs loss down to a reasonable level, and then invest heavily in post-training techniques like reinforcement learning to squeeze out additional capabilities. DeepSeek V3's pretraining cost about five million dollars and got the model to a competitive NATs loss, and then they spent roughly another million on RL training on top of that base to create R1, which dramatically improved reasoning abilities without requiring another massive pretraining run.

The analogy I like is renovation versus rebuilding. Training from scratch is like demolishing a house and building a completely new structure with a different floor plan and modern materials, you do this when you want something fundamentally different and you're willing to pay the full cost. Continued pretraining is more like a major renovation where the foundation stays but you update specific rooms and systems, it's much cheaper but you're constrained by the original structure. And post-training methods like RLHF are like interior decorating, the house is built and now you're making it beautiful and functional for the people who'll actually live there.

What we're seeing in 2025 and 2026 is a layered approach where labs do a from-scratch pretraining run every 12 to 18 months when they have a genuinely new architecture or scale, but between those runs they're doing targeted continued pretraining to add specific capabilities like new languages or longer context windows, and then they're investing increasingly sophisticated effort into post-training to maximize the quality of the final model. The economics have shifted because novel training techniques have shown you can reduce pretraining compute by roughly 10x while getting similar performance if you invest more intelligently in the post-training phase.

The bottom line is that NATs loss is the fundamental metric that guides all of these decisions. It tells you whether your pretraining is converging properly, whether your continued training is causing catastrophic forgetting, and whether your post-training improvements are actually making the model better at real-world tasks. Without understanding what NATs measure and how to interpret them, you can't make informed decisions about where to invest your compute budget, and given that these training runs can cost millions of dollars, getting that decision wrong is extremely expensive.

TL;DR

At the end of the day, nats are just a way of measuring how surprised the model is, on average, by what actually appears in the text. Training is the process of systematically making the model less and less surprised as it encounters more data and learns better patterns. Lower nats mean the model is more confident and more accurate in its predictions, which ultimately means better performance on whatever task you're trying to accomplish.

That's really all there is to it. It's not nearly as scary or mysterious as it first appears once you break it down and understand what's actually being measured.

Written because I was tired of pretending I understood this when reading papers. Now I actually do.