Why GRPO Uses r - log r - 1 for the KL Term, Instead of Just log Ratio
In the GRPO paper, the KL-related term is written (per token) as:
\[D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\text{ref}}) = \frac{\pi_{\text{ref}}(o_t\mid \cdot)}{\pi_\theta(o_t\mid \cdot)} -\log \frac{\pi_{\text{ref}}(o_t\mid \cdot)}{\pi_\theta(o_t\mid \cdot)} - 1\]At first glance this looks different from the familiar “log-ratio” form. The key point is: this expression has the same expectation as the standard KL, but is more stable and harder to exploit during training.
1. Define the Ratio
Let
\[r = \frac{\pi_{\text{ref}}(o_t\mid \cdot)}{\pi_\theta(o_t\mid \cdot)}=\exp\big(\log \pi_{\text{ref}} - \log \pi_\theta\big).\]GRPO uses the integrand $\phi(r) = r - \log r - 1.$
2. It Is Exactly the Same KL in Expectation
A crucial identity is:
\[\mathbb{E}_{o_t\sim \pi_\theta}[r]=\sum_{o_t}\pi_\theta(o_t)\frac{\pi_{\text{ref}}(o_t)}{\pi_\theta(o_t)}=\sum_{o_t}\pi_{\text{ref}}(o_t)=1.\]Therefore,
\[\mathbb{E}_{\pi_\theta}\big[r-\log r-1\big]=(\mathbb{E}_{\pi_\theta}[r]-1)-\mathbb{E}_{\pi_\theta}[\log r]=-\mathbb{E}_{\pi_\theta}[\log r].\]But since \(-\log r = \log(\pi_\theta/\pi_{\text{ref}})\), we get
\[-\mathbb{E}_{\pi_\theta}[\log r]=\mathbb{E}_{\pi_\theta} \left[\log\frac{\pi_\theta}{\pi_{\text{ref}}}\right]=D_{\mathrm{KL}}(\pi_\theta\Vert \pi_{\text{ref}}).\]✅ Conclusion:
\[\mathbb{E}_{\pi_\theta}[r-\log r-1] = D_{\mathrm{KL}}(\pi_\theta\Vert \pi_{\text{ref}}).\]So GRPO is not changing the KL objective; it is changing how the KL is estimated per sample/token.
3. Why Not Just Use the Log Ratio Directly?
Even though the expected KL is the same, the per-token sampled signal behaves very differently.
3.1 Always a Pure Penalty (Never “Accidentally a Reward”)
Using the per-token log ratio directly often yields terms like:
\[-\beta \log\frac{\pi_\theta}{\pi_{\text{ref}}}.\]When $\pi_\theta < \pi_{\text{ref}}$ for a token, we have $\log(\pi_\theta/\pi_{\text{ref}})<0$, so the above becomes positive, i.e. it acts like a reward for decreasing $\pi_\theta$ on that token.
In contrast,
\[r-\log r-1 \ge 0 \quad\text{(since }\log r \le r-1\text{)}\]so $-\beta(r-\log r-1)\le 0$ is always a penalty term.
This makes it harder to exploit and reduces reward-hacking-like behavior.
3.2 Lower Variance via a Control Variate
Notice that $(r-1)$ has zero mean under $\pi_\theta$:
\[\mathbb{E}_{\pi_\theta}[r-1] = 0.\]So
\[r-\log r-1 = -\log r + (r-1)\]is effectively adding a zero-mean control variate to the log-ratio estimator.
This keeps the expectation unchanged but typically reduces variance, making updates more stable in practice (especially with sampled trajectories/tokens).
3.3 Smooth Quadratic Behavior Near the Reference Policy
Let $x = \log\frac{\pi_\theta}{\pi_{\text{ref}}} = -\log r.$ Then $r-\log r-1 = e^{-x}+x-1.$
For small deviations ($x\approx 0$):
\[e^{-x}+x-1 \approx \frac{x^2}{2}.\]So near the reference policy, this behaves like a quadratic penalty, which is smoother than a purely linear log-ratio term and helps avoid overly sharp updates.
4. Summary
- Standard form:
- GRPO low-variance integrand:
- Why GRPO uses it:
- Always non-negative per token → pure penalty, less exploitable
- Lower variance via a zero-mean control variate $(r-1)$
- Smoother near the reference policy (quadratic approximation)
In short: GRPO uses a different per-sample estimator to get the same KL objective, but with better stability and robustness.