NeMo RL — Learning Notes
What is Post-Training?
When we train an LLM from scratch (pretraining), we teach it to predict the next token from massive amounts of text. The model gets good at sounding like human text — but sounding correct is not the same as being correct or helpful.
Post-training is a second phase that starts after pretraining. It takes a model that can already generate text and teaches it to behave correctly for a specific task. NeMo RL is NVIDIA’s framework for this second phase.
The sequence is always:
Pretraining → the model learns language from a massive text corpus Post-training → the model learns to behave correctly using rewards
You cannot do RL before pretraining because RL needs the model to generate answers first, grade them, and then update based on that grade. If the model cannot generate anything meaningful yet, there is nothing to grade.
Fine-Tuning vs Reinforcement Learning
Both fine-tuning and RL update the model weights — but the signal that drives the update is completely different.
In fine-tuning (SFT — Supervised Fine-Tuning), you show the model the correct answer directly. The model sees input-output pairs and learns to copy the correct output. The signal is: here is the right answer, learn it.
In RL, you never show the model the correct answer. Instead the model generates many answers, each answer gets a reward score, and the model learns to generate more answers like the high-reward ones and fewer like the low-reward ones. The signal is: try something, I will tell you if it was good or bad.
Why RL Instead of Fine-Tuning?
Fine-tuning requires correct answers. For many tasks you do not have correct answers — you only know if an answer was good or bad.
For math you have correct answers so you could fine-tune. But for tasks like writing a helpful response, being creative, or following user preferences, there is no single correct answer. Only RL can handle these because it just needs a reward signal, not a correct answer.
How Do You Reward Subjective Tasks?
For tasks with clear correct answers like math, the reward is simple — correct gets 1, wrong gets 0.
For subjective tasks like creativity, there are three approaches.
Human feedback — a human reads the output and gives a score. This is how RLHF (Reinforcement Learning from Human Feedback) works. ChatGPT was trained this way. The problem is it is expensive and slow.
Reward model — first collect human ratings on many examples, then train a separate model to predict what a human would rate. This gives you a cheap automatic grader that approximates human judgment at scale.
Rule-based proxy rewards — reward measurable properties like vocabulary diversity, length, or sentiment. These are imperfect but cheap and automatic.
RL works best on tasks with clear correct answers. This is why NeMo RL uses math as its default example — the reward is unambiguous.
GRPO — Group Relative Policy Optimization
GRPO is the algorithm NeMo RL uses to train the model with reinforcement learning. Instead of generating one answer per question, GRPO generates many answers for the same question — 16 by default in the NeMo RL math example.
For each question, the 16 answers get graded by the environment and receive rewards. GRPO then computes the group average reward and subtracts it from each individual reward to get an advantage:
advantage = reward - group_average
Answers that did better than the group average get a positive advantage. Answers that did worse get a negative advantage. The model then learns to increase the probability of tokens from above-average answers and decrease the probability of tokens from below-average answers.
Why Not Simply Reward Each Answer Directly?
If you reward each answer directly without comparing to a group, you run into the credit assignment problem — the model cannot tell which tokens caused the success or failure.
Consider this correct answer:
“The square root of 144 is 12”
Every token gets reward 1. But which token actually mattered? “The”? “square”? “root”? “12”? The model has no way to know that “12” was the critical token. Everything gets equal credit and the learning signal is meaningless.
GRPO solves this by generating many answers and comparing them as a group. Now the model can see patterns across answers:
“The answer is 12” → reward 1 “The answer is 14” → reward 0 “The result is 12” → reward 1 “The result is 11” → reward 0
By comparing these answers, it becomes clear what varies between correct and incorrect ones — it is always the final number. “12” consistently appears in high-reward answers. “14” and “11” appear in low-reward answers. The group comparison makes credit assignment obvious — “12” deserves the positive advantage, not “The” or “answer” or “is”.
There is also a second problem with rewarding one answer at a time. If the model is already very good and all answers are correct, every reward is 1 and every advantage is 0. No gradient, no learning. If the model is very bad and all answers are wrong, the same thing happens. By subtracting the group average, the signal becomes relative to the current ability of the model. GRPO learns best in the middle zone where the model gets some answers right and some wrong. The fix when the model saturates is curriculum learning — start with easy problems and gradually introduce harder ones so the model stays in the learning zone.
How the Weights Are Updated
After rewards and advantages are computed, GRPO computes a loss and runs backpropagation through the entire model — all layers and all attention heads get updated, just like fine-tuning.
However RL gradients can be very noisy and unstable. One bad update can destroy the model. GRPO adds three constraints to keep updates safe.
Clipping — the update is not allowed to change the probability of any token too much in one step. The ratio between new and old probability is clipped to stay within a fixed range. This prevents one aggressive update from breaking the model.
KL penalty — there is a frozen copy of the original model called the reference policy. The loss includes a penalty if the updated model drifts too far from this reference. In the NeMo RL math example this is set to 0.01. This keeps the model from forgetting everything it learned during pretraining.
Small learning rate — RL uses a much smaller learning rate than pretraining to keep updates gentle and gradual.
The intuition is that the model has learned habits during pretraining. RL is trying to change those habits. If you change too fast the model gets confused and forgets everything it knew. The three constraints make sure changes happen gradually and safely.
The Two Model Architecture
NeMo RL uses two separate instances of the same model during training — a policy model and a generation model. They have identical architecture because weights must be copied between them, but they serve different roles.
The generation model runs inside vLLM — a fast inference engine optimized for throughput. It generates thousands of answers quickly in parallel but does not receive gradients. Its weights are fixed during each training step.
The policy model runs inside PyTorch with an optimizer. It receives gradients and updates its weights via backpropagation. It is the model that actually learns.
After each training step the updated policy weights are copied into vLLM so the next step generates answers from the improved model. Without this refit step the generation model would keep using old weights forever.
What is a Training Step
A step in RL is different from an epoch or a single example. One step is one full iteration of the RL loop:
- vLLM generates 512 answers (32 prompts x 16 answers each)
- The environment grades all 512 answers and returns rewards
- GRPO computes advantages for each answer
- The policy model computes the loss and updates its weights once
- Updated weights are copied from the policy model into vLLM
- The loop repeats
The vLLM weights are updated once per step — after all 512 examples have been generated, graded, and trained on.
What Data is Fixed and What is Generated
The original prompts and ground truth answers come from a fixed dataset — OpenMathInstruct-2 in the NeMo RL math example. These never change.
What changes every step is the model answers. vLLM generates fresh answers at the start of every step using the current model weights. After the step the model is updated and the next step generates slightly different answers.
This is what makes RL fundamentally different from supervised learning. In supervised learning the training data is fixed. In RL the training data evolves as the model improves — better model generates better answers which create a different training signal which makes the model even better.