NeMo RL — Learning Notes

What is Post-Training?

When we train an LLM from scratch (pretraining), we teach it to predict the next token from massive amounts of text. The model gets good at sounding like human text — but sounding correct is not the same as being correct or helpful.

Post-training is a second phase that starts after pretraining. It takes a model that can already generate text and teaches it to behave correctly for a specific task. NeMo RL is NVIDIA’s framework for this second phase.

The sequence is always:

Pretraining → the model learns language from a massive text corpus Post-training → the model learns to behave correctly using rewards

You cannot do RL before pretraining because RL needs the model to generate answers first, grade them, and then update based on that grade. If the model cannot generate anything meaningful yet, there is nothing to grade.

Fine-Tuning vs Reinforcement Learning

Both fine-tuning and RL update the model weights — but the signal that drives the update is completely different.

In fine-tuning (SFT — Supervised Fine-Tuning), you show the model the correct answer directly. The model sees input-output pairs and learns to copy the correct output. The signal is: here is the right answer, learn it.

In RL, you never show the model the correct answer. Instead the model generates many answers, each answer gets a reward score, and the model learns to generate more answers like the high-reward ones and fewer like the low-reward ones. The signal is: try something, I will tell you if it was good or bad.

Why RL Instead of Fine-Tuning?

Fine-tuning requires correct answers. For many tasks you do not have correct answers — you only know if an answer was good or bad.

For math you have correct answers so you could fine-tune. But for tasks like writing a helpful response, being creative, or following user preferences, there is no single correct answer. Only RL can handle these because it just needs a reward signal, not a correct answer.

How Do You Reward Subjective Tasks?

For tasks with clear correct answers like math, the reward is simple — correct gets 1, wrong gets 0.

For subjective tasks like creativity, there are three approaches.

Human feedback — a human reads the output and gives a score. This is how RLHF (Reinforcement Learning from Human Feedback) works. ChatGPT was trained this way. The problem is it is expensive and slow.

Reward model — first collect human ratings on many examples, then train a separate model to predict what a human would rate. This gives you a cheap automatic grader that approximates human judgment at scale.

Rule-based proxy rewards — reward measurable properties like vocabulary diversity, length, or sentiment. These are imperfect but cheap and automatic.

RL works best on tasks with clear correct answers. This is why NeMo RL uses math as its default example — the reward is unambiguous.

GRPO — Group Relative Policy Optimization

GRPO is the algorithm NeMo RL uses to train the model with reinforcement learning. Instead of generating one answer per question, GRPO generates many answers for the same question — 16 by default in the NeMo RL math example.

For each question, the 16 answers get graded by the environment and receive rewards. GRPO then computes the group average reward and subtracts it from each individual reward to get an advantage:

advantage = reward - group_average

Answers that did better than the group average get a positive advantage. Answers that did worse get a negative advantage. The model then learns to increase the probability of tokens from above-average answers and decrease the probability of tokens from below-average answers.

Why Not Simply Reward Each Answer Directly?

If you reward each answer directly without comparing to a group, you run into the credit assignment problem — the model cannot tell which tokens caused the success or failure.

Consider this correct answer:

“The square root of 144 is 12”

Every token gets reward 1. But which token actually mattered? “The”? “square”? “root”? “12”? The model has no way to know that “12” was the critical token. Everything gets equal credit and the learning signal is meaningless.

GRPO solves this by generating many answers and comparing them as a group. Now the model can see patterns across answers:

“The answer is 12” → reward 1 “The answer is 14” → reward 0 “The result is 12” → reward 1 “The result is 11” → reward 0

By comparing these answers, it becomes clear what varies between correct and incorrect ones — it is always the final number. “12” consistently appears in high-reward answers. “14” and “11” appear in low-reward answers. The group comparison makes credit assignment obvious — “12” deserves the positive advantage, not “The” or “answer” or “is”.

There is also a second problem with rewarding one answer at a time. If the model is already very good and all answers are correct, every reward is 1 and every advantage is 0. No gradient, no learning. If the model is very bad and all answers are wrong, the same thing happens. By subtracting the group average, the signal becomes relative to the current ability of the model. GRPO learns best in the middle zone where the model gets some answers right and some wrong. The fix when the model saturates is curriculum learning — start with easy problems and gradually introduce harder ones so the model stays in the learning zone.

How the Weights Are Updated

After rewards and advantages are computed, GRPO computes a loss and runs backpropagation through the entire model — all layers and all attention heads get updated, just like fine-tuning.

However RL gradients can be very noisy and unstable. One bad update can destroy the model. GRPO adds three constraints to keep updates safe.

Clipping — the update is not allowed to change the probability of any token too much in one step. The ratio between new and old probability is clipped to stay within a fixed range. This prevents one aggressive update from breaking the model.

KL penalty — there is a frozen copy of the original model called the reference policy. The loss includes a penalty if the updated model drifts too far from this reference. In the NeMo RL math example this is set to 0.01. This keeps the model from forgetting everything it learned during pretraining.

Small learning rate — RL uses a much smaller learning rate than pretraining to keep updates gentle and gradual.

The intuition is that the model has learned habits during pretraining. RL is trying to change those habits. If you change too fast the model gets confused and forgets everything it knew. The three constraints make sure changes happen gradually and safely.

The Two Model Architecture

NeMo RL uses two separate instances of the same model during training — a policy model and a generation model. They have identical architecture because weights must be copied between them, but they serve different roles.

The generation model runs inside vLLM — a fast inference engine optimized for throughput. It generates thousands of answers quickly in parallel but does not receive gradients. Its weights are fixed during each training step.

The policy model runs inside PyTorch with an optimizer. It receives gradients and updates its weights via backpropagation. It is the model that actually learns.

After each training step the updated policy weights are copied into vLLM so the next step generates answers from the improved model. Without this refit step the generation model would keep using old weights forever.

What is a Training Step

A step in RL is different from an epoch or a single example. One step is one full iteration of the RL loop:

vLLM generates 512 answers (32 prompts x 16 answers each)
The environment grades all 512 answers and returns rewards
GRPO computes advantages for each answer
The policy model computes the loss and updates its weights once
Updated weights are copied from the policy model into vLLM
The loop repeats

The vLLM weights are updated once per step — after all 512 examples have been generated, graded, and trained on.

What Data is Fixed and What is Generated

The original prompts and ground truth answers come from a fixed dataset — OpenMathInstruct-2 in the NeMo RL math example. These never change.

What changes every step is the model answers. vLLM generates fresh answers at the start of every step using the current model weights. After the step the model is updated and the next step generates slightly different answers.

This is what makes RL fundamentally different from supervised learning. In supervised learning the training data is fixed. In RL the training data evolves as the model improves — better model generates better answers which create a different training signal which makes the model even better.

Similarities and differences between the two models

Both models start from the same pretrained checkpoint and have identical architecture — same number of layers, same dimensions, same transformer structure. Weights must be copyable between them so the architecture can never diverge.

Beyond that they are quite different.

The policy model runs in PyTorch with full gradient computation. It stores weights as PyTorch tensors alongside optimizer states — momentum, variance, and other values the optimizer needs. This uses a lot of memory but is necessary for learning.

The vLLM model runs in vLLM’s own optimized runtime with no gradients and no optimizer state. Its weights are stored in a format optimized for fast parallel generation. It also maintains a KV cache — the stored key and value vectors from previous tokens — so it does not recompute attention for every token during generation. This makes it much faster than a plain PyTorch model for generating long answers.

A useful analogy: think of the architecture as a recipe. The policy model is that recipe printed in a cookbook — carefully annotated, used for studying and improving. The vLLM model is the same recipe laminated on the kitchen counter — optimized for fast execution. Same content, different format, different purpose.

The Full Training Loop

One training step in NeMo RL follows this sequence:

Load 32 prompts from the dataset
vLLM generates 16 answers per prompt — 512 answers total
All 512 answers go to the environment which returns a reward for each
GRPO computes the group average reward per prompt and subtracts it from each answer reward to get an advantage per answer
The policy model runs a forward pass on all 512 prompt+answer sequences and computes logprobs for each answer token
The loss is computed using logprobs and advantages
Backpropagation updates the policy weights
Updated weights are copied into vLLM
Loop repeats

Two Different Forward Passes

There are two completely different forward passes in each step.

The first pass happens inside vLLM during generation. The prompt goes in and vLLM generates the answer token by token. At each step the new token attends only to the prompt and all previously generated tokens — this is autoregressive generation, strictly left to right.

The second pass happens inside the policy model during training. The full sequence — prompt plus complete generated answer — goes through attention together. Every token attends to every token before it across the full sequence. But logprobs are only computed for the answer tokens — not the prompt tokens. This is controlled by a token loss mask:

prompt tokens → token_loss_mask = 0 → logprobs ignored answer tokens → token_loss_mask = 1 → logprobs computed

This makes sense because we want to teach the model to generate better answers — not to regenerate the prompt.

The Loss Function

After the second forward pass we have three things:

Logprobs per token — for each answer token, the log probability the policy assigns to that token given everything before it. These are computed during the second forward pass.

Prev_logprobs per token — the log probability the policy assigned to each token just before this update. These are computed by running the policy model on the same sequences before the weight update.

Advantage per answer — one scalar per answer saying how good this answer was compared to the group average. Advantage is computed at the answer level not the token level.

From Logprobs to a Ratio

For each answer token we compute a ratio between the current and previous probability:

ratio = exp(current_logprob - prev_logprob) = current_probability / prev_probability

If ratio > 1 the policy is now more likely to generate this token than before this update. If ratio < 1 it is now less likely.

Combining Ratio and Advantage

The advantage is one scalar per answer but it is broadcast to every token in that answer. Then for each token:

loss = ratio x advantage

If the advantage is positive (good answer) we want ratio > 1 — push the probability of these tokens higher than before. If the advantage is negative (bad answer) we want ratio < 1 — push the probability of these tokens lower than before.

Every token in a good answer gets pushed up equally. Every token in a bad answer gets pushed down equally. The final loss is the mean across all tokens across all 512 answers in the step.

Clipping

The ratio can go very large or very small in one update which would destabilize training. So we clip it to stay within a fixed range:

ratio = clip(ratio, 1 - ε, 1 + ε)

With ε = 0.2 the ratio stays between 0.8 and 1.2. This means in one step the probability of any token can only change by at most 20 percent. This prevents one aggressive update from breaking the model.

KL Penalty

There is also a frozen reference model — a copy of the original pretrained model that never gets updated. The loss includes a penalty if the policy drifts too far from this reference:

total_loss = clipped_ratio x advantage + 0.01 x KL(policy, reference)

The KL penalty keeps the model from forgetting everything it learned during pretraining while still allowing it to improve on the task. In the NeMo RL math example this penalty weight is set to 0.01.

Summary of One Training Step

To put everything together, here is what happens to one prompt in one step:

vLLM generates 16 answers for the prompt
Environment grades each answer — reward = 1 or 0
Group average is computed — say 0.5
Each answer gets an advantage — correct answers get +0.5, wrong get -0.5
Policy model runs forward pass on all 16 prompt+answer sequences
For each answer token, ratio = current_prob / prev_prob
For each answer token, loss = clip(ratio, 0.8, 1.2) x advantage
Mean loss across all tokens drives backpropagation
Weights update — tokens in correct answers become more likely, tokens in wrong answers become less likely
Updated weights copied to vLLM for next step

NVIDIA Solutions Architect — Study Notes

Blog Archive

Archive of all previous blog posts