Why GRPO Uses r - log r - 1 for the KL Term, Instead of Just log Ratio

In the GRPO paper, the KL-related term is written (per token) as:

\[D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\text{ref}}) = \frac{\pi_{\text{ref}}(o_t\mid \cdot)}{\pi_\theta(o_t\mid \cdot)} -\log \frac{\pi_{\text{ref}}(o_t\mid \cdot)}{\pi_\theta(o_t\mid \cdot)} - 1\]

At first glance this looks different from the familiar “log-ratio” form. The key point is: this expression has the same expectation as the standard KL, but is more stable and harder to exploit during training.


1. Define the Ratio

Let

\[r = \frac{\pi_{\text{ref}}(o_t\mid \cdot)}{\pi_\theta(o_t\mid \cdot)}=\exp\big(\log \pi_{\text{ref}} - \log \pi_\theta\big).\]

GRPO uses the integrand $\phi(r) = r - \log r - 1.$


2. It Is Exactly the Same KL in Expectation

A crucial identity is:

\[\mathbb{E}_{o_t\sim \pi_\theta}[r]=\sum_{o_t}\pi_\theta(o_t)\frac{\pi_{\text{ref}}(o_t)}{\pi_\theta(o_t)}=\sum_{o_t}\pi_{\text{ref}}(o_t)=1.\]

Therefore,

\[\mathbb{E}_{\pi_\theta}\big[r-\log r-1\big]=(\mathbb{E}_{\pi_\theta}[r]-1)-\mathbb{E}_{\pi_\theta}[\log r]=-\mathbb{E}_{\pi_\theta}[\log r].\]

But since \(-\log r = \log(\pi_\theta/\pi_{\text{ref}})\), we get

\[-\mathbb{E}_{\pi_\theta}[\log r]=\mathbb{E}_{\pi_\theta} \left[\log\frac{\pi_\theta}{\pi_{\text{ref}}}\right]=D_{\mathrm{KL}}(\pi_\theta\Vert \pi_{\text{ref}}).\]

Conclusion:

\[\mathbb{E}_{\pi_\theta}[r-\log r-1] = D_{\mathrm{KL}}(\pi_\theta\Vert \pi_{\text{ref}}).\]

So GRPO is not changing the KL objective; it is changing how the KL is estimated per sample/token.


3. Why Not Just Use the Log Ratio Directly?

Even though the expected KL is the same, the per-token sampled signal behaves very differently.

3.1 Always a Pure Penalty (Never “Accidentally a Reward”)

Using the per-token log ratio directly often yields terms like:

\[-\beta \log\frac{\pi_\theta}{\pi_{\text{ref}}}.\]

When $\pi_\theta < \pi_{\text{ref}}$ for a token, we have $\log(\pi_\theta/\pi_{\text{ref}})<0$, so the above becomes positive, i.e. it acts like a reward for decreasing $\pi_\theta$ on that token.

In contrast,

\[r-\log r-1 \ge 0 \quad\text{(since }\log r \le r-1\text{)}\]

so $-\beta(r-\log r-1)\le 0$ is always a penalty term.
This makes it harder to exploit and reduces reward-hacking-like behavior.


3.2 Lower Variance via a Control Variate

Notice that $(r-1)$ has zero mean under $\pi_\theta$:

\[\mathbb{E}_{\pi_\theta}[r-1] = 0.\]

So

\[r-\log r-1 = -\log r + (r-1)\]

is effectively adding a zero-mean control variate to the log-ratio estimator.
This keeps the expectation unchanged but typically reduces variance, making updates more stable in practice (especially with sampled trajectories/tokens).


3.3 Smooth Quadratic Behavior Near the Reference Policy

Let $x = \log\frac{\pi_\theta}{\pi_{\text{ref}}} = -\log r.$ Then $r-\log r-1 = e^{-x}+x-1.$

For small deviations ($x\approx 0$):

\[e^{-x}+x-1 \approx \frac{x^2}{2}.\]

So near the reference policy, this behaves like a quadratic penalty, which is smoother than a purely linear log-ratio term and helps avoid overly sharp updates.


4. Summary

  • Standard form:
\[D_{\mathrm{KL}}(\pi_\theta\Vert \pi_{\text{ref}})=\mathbb{E}_{\pi_\theta}\!\left[\log\frac{\pi_\theta}{\pi_{\text{ref}}}\right]\]
  • GRPO low-variance integrand:
\[\mathbb{E}_{\pi_\theta}[r-\log r-1]\ \text{with} \ r=\ \frac{\pi_{\text{ref}}}{\pi_\theta}\]
  • Why GRPO uses it:
    1. Always non-negative per token → pure penalty, less exploitable
    2. Lower variance via a zero-mean control variate $(r-1)$
    3. Smoother near the reference policy (quadratic approximation)

In short: GRPO uses a different per-sample estimator to get the same KL objective, but with better stability and robustness.