推理模型训练范式：RLVR、GRPO与 Test-Time Compute Scaling

📍 本文是「LLM进阶：从会用到底层精通」专题的第 4/10 篇

📊 难度：高级 | ⏱️ 预计阅读：25 分钟

学习目标

🎯 学完本文后，你将能够：

- 理解推理模型（Reasoning Model）与传统 LLM 的本质差异，以及为什么预训练 Scaling 的边际收益正在递减

- 掌握 GRPO（Group Relative Policy Optimization）的数学推导，能解释它为什么不需要 Critic 模型

- 能设计 RLVR（Reinforcement Learning with Verifiable Rewards）场景下的奖励函数，区分数学回报、代码回报和格式回报

- 理解 Test-Time Compute Scaling 的三种范式及其与推理模型训练的关系

- 能用 TRL 的 GRPOTrainer 在 GSM8K 上跑一个简化版推理训练实验

前置唤醒

📚 在开始之前，请确认你已经理解：

- PPO（Proximal Policy Optimization）的基本框架：Policy、Value Function、Advantage 的概念（不求精通，后续会推导对比）

- 基础概率论：期望、KL 散度（KL Divergence）的定义

- 对大模型训练范式（Pre-training → SFT → RLHF）有整体认知

---

1. 为什么需要推理模型

1.1 「直觉型」LLM 的天花板

你是否有这样的体验：让 GPT-4 回答「法国的首都是哪里」，它瞬间答对；让它解一道 AIME（美国数学邀请赛）题目，它可能给出一个看似合理但完全错误的推理过程。

这不是模型不够大。这是预训练范式本身的局限。

传统 LLM 本质上是一个「直觉型」系统——它通过海量文本的自回归训练，学会了一个极其强大的条件概率分布 P(token | context)。这个分布擅长模式匹配、知识检索、风格模仿，但不擅长多步推理。

我们来看一组数据：

模型AIME 2024 得分训练方式------------------------------GPT-4o~13.4%预训练 + SFTo1-preview~56.7%推理模型训练o1~83.3%推理模型训练（增强）DeepSeek-R1~79.8%GRPO + RLVR

AIME 这类竞赛数学题的门槛极高：GPT-4o 的正确率只有 13%，而专为推理优化的 o1 一跃到了 83%。这不是模型变大了——是训练范式变了。

1.2 三种 Scaling Law

我们已经很熟悉 OpenAI 2020 年提出的 Scaling Law：模型参数 N、训练数据 D、训练计算量 C 之间存在幂律关系，更大的模型 = 更低的 Loss。

但到了 2025 年，社区逐渐形成共识：Scaling Law 有三个维度，我们只充分挖掘了第一个。

加载图表...

graph TD

subgraph Pre-training Scaling

A[更多数据 D] --> B[更大模型 N]

B --> C[更低预训练 Loss]

end

subgraph Post-training Scaling

D[SFT 数据质量] --> E[RLHF/RLVR]

E --> F[对齐与推理能力]

end

subgraph Test-Time Compute Scaling

G[Chain-of-Thought] --> H[Self-Consistency]

H --> I[Best-of-N / Beam Search]

I --> J[更高质量输出]

end

C --> F

F --> J

python

import torch
import torch.nn.functional as F

def grpo_loss(policy_model, ref_model, questions, G=16, epsilon=0.2, beta=0.04):
    """
    简化的 GRPO 损失计算（一个 batch 的版本）
    """
    total_loss = 0.0

    for q in questions:
        # Step 1: 对同一个问题采样 G 个回答
        responses = []
        log_probs_old = []
        for _ in range(G):
            with torch.no_grad():
                output = policy_model.generate(q)
                responses.append(output.tokens)
                log_probs_old.append(policy_model.log_prob(output.tokens))

        # Step 2: 计算每个回答的奖励（数学题 = 答案是否正确）
        rewards = torch.tensor([verify_answer(q, r) for r in responses])

        # Step 3: 组内标准化得到 Advantage
        mean_r = rewards.mean()
        std_r = rewards.std() + 1e-8
        advantages = (rewards - mean_r) / std_r

        # Step 4: 计算 GRPO 损失
        for i in range(G):
            new_log_probs = policy_model.log_prob(responses[i])

            # importance sampling ratio
            ratio = torch.exp(new_log_probs - log_probs_old[i])
            clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)

            # PPO-style clipped loss, 所有 token 共享同一个 advantage
            surr1 = ratio * advantages[i]
            surr2 = clipped_ratio * advantages[i]
            policy_loss = -torch.min(surr1, surr2).mean()

            # KL 散度正则：用近似 D_KL(π_θ || π_ref)
            with torch.no_grad():
                ref_log_probs = ref_model.log_prob(responses[i])
            kl_div = (new_log_probs - ref_log_probs).mean()

            total_loss += policy_loss + beta * kl_div

    return total_loss / len(questions)

Pre-training Scaling：你已经很熟悉了。更大的模型，更多的数据。这条路还有空间，但边际收益在递减——每提升一个点，成本呈指数增长。

Post-training Scaling：用强化学习在高质量信号上继续训练。GRPO 和 RLVR 就是这个维度的核心武器。

Test-Time Compute Scaling：不改变模型权重，而是在推理时投入更多计算——让模型「多想一会儿」。Chain-of-Thought（CoT）、Self-Consistency、Best-of-N 都属于这一类。

💡 关键要点：推理模型 = 预训练基座 + RL 后训练（让模型学会「思考」）+ Test-Time Compute（让模型能「多想」）。三者缺一不可。

✨ 一句话记住：预训练给你知识，RL 后训练给你推理能力，Test-Time Compute 给你思考时间。

---

2. GRPO 算法深度推导

2.1 PPO 为什么「贵」

在理解 GRPO 之前，我们需要先理解 PPO 在 RLHF 中的角色及其瓶颈。

RLHF 的标准流程中，PPO 需要四个模型同时驻留在显存中：

Policy Model（待优化的策略，即 LLM 本身）

Reference Model（冻结的初始策略，用于 KL 惩罚）

Reward Model（打分模型，来自人类偏好标注）

Critic Model / Value Network（估计状态价值 V(s)，用于计算 Advantage）

其中 Critic Model 是一个「额外的负担」：它通常和 Policy Model 一样大，参数规模相当。这意味着 PPO 训练时，显存消耗接近 Policy Model 的 4 倍。对于 70B 级别的模型，这几乎不可行。

GRPO 的核心洞察一句话就能说清楚：我们不需要训练一个 Critic 去估计每个 Token 的「绝对价值」，只需要在同一个问题的多个回答之间做「相对比较」就够了。

2.2 GRPO 的数学框架

GRPO（Group Relative Policy Optimization）由 DeepSeek 团队在 DeepSeekMath 和后续的 DeepSeek-R1 中提出并验证。其核心流程如下：

Step 1：对每个问题采样一组回答

对于训练集中的每个问题 q，用当前 Policy π_θ 采样 G 个回答 {o₁, o₂, ..., o_G}（通常 G=4~16）。

Step 2：为每个回答计算奖励

每个回答 oᵢ 获得一个奖励 rᵢ，来自可验证的规则函数（数学题对错、代码是否通过测试等）。

Step 3：组内标准化计算 Advantage

这是 GRPO 最核心的一步。对同一问题 q 的 G 个回答，计算它们的组内均值和标准差：

\hat{A}_{i} = \frac{r_i - \text{mean}(r_1, \dots, r_G)}{\text{std}(r_1, \dots, r_G)}

这个做法的直觉非常漂亮：如果一个问题很简单，所有回答都对了（奖励全是 1），那么没有任何回答获得正 Advantage——模型不需要在这个问题上继续「用力」。反之，如果一个问题很难，大部分回答错了但有一个对了，那个正确的回答会获得极高的 Advantage。

🛠️ 实战经验：G 的选择是一个关键超参。G 太小（如 2）会导致组内方差估计不稳定；G 太大（如 32）会增加采样成本。DeepSeek-R1 在实践中使用 G=16，这是一个很好的起点。

Step 4：计算 GRPO 损失函数

GRPO 的损失函数由两部分组成：

\mathcal{L}_{\text{GRPO}} = -\frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min\left( \rho_t(\theta) \hat{A}_i, \text{clip}(\rho_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_i \right) + \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})

其中：

ρ_t(θ) = π_θ(a_t | s_t) / π_old(a_t | s_t)，即新旧策略的概率比（importance sampling ratio）

clip(·, 1-ε, 1+ε) 是 PPO 式的裁剪，防止策略更新幅度过大（通常 ε=0.2）

同一个回答中所有 Token 共享同一个 Advantage Âᵢ——因为奖励是整个回答级别的

β·D_KL 是 KL 散度正则项，防止策略偏离 Reference Model 太远（通常 β=0.04）

2.3 GRPO 为什么不需要 Critic

这里有一个容易被忽略的深层洞察。

在 PPO 中，Advantage 的计算方式通常是 GAE（Generalized Advantage Estimation）：

A_t^{\text{GAE}} = \delta_t + \gamma \lambda \delta_{t+1} + \dots

其中 δ_t = r_t + γ V(s_{t+1}) - V(s_t)。GAE 依赖 Value Network V(s) 的自举（bootstrapping），这引入了估计偏差。

GRPO 的 Advantage 直接来自组内比较的奖励标准化——没有自举，没有 Value Network，没有估计偏差。这是一种直接从奖励信号中读取优劣势的方法，而不是通过一个中间模型去「猜测」优劣。

💡 关键要点：GRPO 不是 PPO 的简化版，而是一个根本性的范式转变——从「学习一个价值函数来评估动作好坏」变成「让模型从自己的多样性采样中学习」。

2.4 Python 伪代码

python

def math_reward(ground_truth_answer: str, model_output: str) -> float:
    """数学题奖励：提取最终答案并精确匹配"""
    predicted = extract_final_answer(model_output)  # 从 CoT 中提取最后一个数字/表达式
    if normalize(predicted) == normalize(ground_truth_answer):
        return 1.0
    return 0.0

import torch

import torch.nn.functional as F

def grpo_loss(policy_model, ref_model, questions, G=16, epsilon=0.2, beta=0.04):

"""

简化的 GRPO 损失计算（一个 batch 的版本）

"""

total_loss = 0.0

for q in questions:

# Step 1: 对同一个问题采样 G 个回答

responses = []

log_probs_old = []

for _ in range(G):

with torch.no_grad():

output = policy_model.generate(q)

responses.append(output.tokens)

log_probs_old.append(policy_model.log_prob(output.tokens))

# Step 2: 计算每个回答的奖励（数学题 = 答案是否正确）

rewards = torch.tensor([verify_answer(q, r) for r in responses])

# Step 3: 组内标准化得到 Advantage

mean_r = rewards.mean()

std_r = rewards.std() + 1e-8

advantages = (rewards - mean_r) / std_r

# Step 4: 计算 GRPO 损失

for i in range(G):

new_log_probs = policy_model.log_prob(responses[i])

# importance sampling ratio

ratio = torch.exp(new_log_probs - log_probs_old[i])

clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)

# PPO-style clipped loss, 所有 token 共享同一个 advantage

surr1 = ratio * advantages[i]

surr2 = clipped_ratio * advantages[i]

policy_loss = -torch.min(surr1, surr2).mean()

# KL 散度正则：用近似 D_KL(π_θ || π_ref)

with torch.no_grad():

ref_log_probs = ref_model.log_prob(responses[i])

kl_div = (new_log_probs - ref_log_probs).mean()

total_loss += policy_loss + beta * kl_div

return total_loss / len(questions)

python

def code_reward(test_cases: list, generated_code: str) -> float:
    """代码奖励：通过测试用例的比例"""
    passed = 0
    for test_input, expected_output in test_cases:
        try:
            actual_output = execute_code(generated_code, test_input)
            if actual_output == expected_output:
                passed += 1
        except Exception:
            pass  # 编译错误或运行时异常 = 0 分
    return passed / len(test_cases)

💡 关键要点：注意第 47-48 行——同一个回答的所有 Token 共享同一个 Advantage。这是 GRPO 与标准 PPO（每个 Token 有独立的 Advantage）的关键区别之一。

✨ 一句话记住：GRPO = 用「组内相对比较」替代「Critic 绝对估计」，训练成本砍掉近一半。

---

3. RLVR 的奖励设计

RLVR（Reinforcement Learning with Verifiable Rewards）并非一种算法，而是一种奖励设计哲学：奖励信号必须来自可验证的规则，而非人类偏好模型。

这也是为什么 DeepSeek-R1 的冷启动阶段完全不需要 Reward Model——数学题有标准答案，代码有测试用例，这些规则函数天然就是无偏的奖励信号。

3.1 数学回报（Math Reward）

数学题是最经典的 RLVR 场景。奖励函数极其简单：

python

def format_reward(output: str) -> float:
    """格式奖励：检查思考标签结构是否正确"""
    score = 0.0
    if " thinking" in output and " response" in output:
        score += 0.5  # 两个标签都存在
    if output.index(" thinking") < output.index(" response"):
        score += 0.25  # 思考在前，回答在后
    if not bool(re.search(r" thinking.* thinking", output)):
        score += 0.25  # 没有嵌套/重复标签
    return score

def math_reward(ground_truth_answer: str, model_output: str) -> float:

"""数学题奖励：提取最终答案并精确匹配"""

predicted = extract_final_answer(model_output) # 从 CoT 中提取最后一个数字/表达式

if normalize(predicted) == normalize(ground_truth_answer):

return 1.0

return 0.0

bash

pip install transformers trl datasets accelerate vllm

实际工程中还有几个细节：

答案提取：模型输出通常包含大量推理过程（CoT）。需要用正则或启发式方法提取 \boxed{...} 或「最终答案是：xxx」中的内容。

数值容差：对于浮点数答案，用 abs(pred - gt) < 1e-6 而非严格相等。

部分正确：有些实现会给「推理路径正确但计算失误」的回答 0.5 分，但这会引入主观性，打破「可验证」原则。我们认为应该保持 0/1 的硬奖励。

3.2 代码回报（Code Reward）

代码场景的奖励更直接——运行就对了：

python

import re

def gsm8k_reward(completions, ground_truth, **kwargs):
    """GSM8K 的 RLVR 奖励：从模型输出中提取最终数字并与标准答案比对"""
    rewards = []
    for completion, answer in zip(completions, ground_truth):
        # 提取最后一个数字
        numbers = re.findall(r'\d+\.?\d*', completion)
        predicted = numbers[-1] if numbers else ""
        # 精确匹配
        reward = 1.0 if predicted == str(answer).strip() else 0.0
        rewards.append(reward)
    return rewards

def code_reward(test_cases: list, generated_code: str) -> float:

"""代码奖励：通过测试用例的比例"""

passed = 0

for test_input, expected_output in test_cases:

try:

actual_output = execute_code(generated_code, test_input)

if actual_output == expected_output:

passed += 1

except Exception:

pass # 编译错误或运行时异常 = 0 分

return passed / len(test_cases)

python

from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# 1. 加载模型和数据集
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

dataset = load_dataset("openai/gsm8k", "main")["train"].select(range(2000))

def formatting_func(example):
    return [
        {"role": "user", "content": example["question"]},
        {"role": "assistant", "content": " thinking\n"}
    ]

# 2. GRPO 配置
training_args = GRPOConfig(
    output_dir="./grpo-gsm8k",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=5e-6,
    logging_steps=10,
    bf16=True,
    # GRPO 特有的参数
    num_generations=8,        # G = 8：每个问题采样 8 个回答
    max_prompt_length=512,
    max_completion_length=256,
    beta=0.04,                # KL 散度系数
    temperature=0.9,          # 采样温度
)

# 3. 训练
trainer = GRPOTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    reward_funcs=[gsm8k_reward],
    processing_class=tokenizer,
)

trainer.train()

代码回报的好处是天然连续——不需要 0/1 二值化，通过率本身就是合理的奖励值。

3.3 格式回报（Format Reward）

格式回报是 RLVR 中最容易被忽略但实际最巧妙的设计。DeepSeek-R1 训练中，模型需要在 <think> 和 </think> 标签之间组织推理过程：

python

def best_of_n_inference(model, prompt, N=16):
    responses = []
    for _ in range(N):
        response = model.generate(prompt, temperature=0.7)
        responses.append(response)

    # 用 Reward 函数选最佳回答
    scores = [compute_reward(prompt, r) for r in responses]
    best_idx = scores.index(max(scores))
    return responses[best_idx]

def format_reward(output: str) -> float:

"""格式奖励：检查思考标签结构是否正确"""

score = 0.0

if " thinking" in output and " response" in output:

score += 0.5 # 两个标签都存在

if output.index(" thinking") < output.index(" response"):

score += 0.25 # 思考在前，回答在后

if not bool(re.search(r" thinking.* thinking", output)):

score += 0.25 # 没有嵌套/重复标签

return score

🛠️ 实战经验：格式奖励在训练初期至关重要。模型一开始并不会自觉地使用 thinking 标签——它需要格式奖励来「学会用标签」。一旦格式稳定后（通常几百步），你甚至可以将格式奖励逐渐退火到 0，让模型专注于内容质量。

3.4 组合奖励

实际训练中，总奖励是多个子奖励的加权和：

R_{\text{total}} = w_{\text{math}} \cdot R_{\text{math}} + w_{\text{format}} \cdot R_{\text{format}}

对于代码任务：\(R_{\text{total}} = R_{\text{code}} + w_{\text{format}} \cdot R_{\text{format}}\)

✨ 一句话记住：RLVR 的「V」（Verifiable）是核心——奖励函数必须是对错分明的规则，而不是人类偏好模型的「打分」。

---

4. 代码实践：GSM8K 上的 GRPO 训练

下面我们用 TRL（Transformer Reinforcement Learning）库的 GRPOTrainer 跑一个简化版实验。我们将在 GSM8K（小学数学应用题数据集）上对比 SFT 和 GRPO 的效果。

4.1 环境准备

pip install transformers trl datasets accelerate vllm

4.2 奖励函数定义

import re

def gsm8k_reward(completions, ground_truth, **kwargs):

"""GSM8K 的 RLVR 奖励：从模型输出中提取最终数字并与标准答案比对"""

rewards = []

for completion, answer in zip(completions, ground_truth):

# 提取最后一个数字

numbers = re.findall(r'\d+\.?\d*', completion)

predicted = numbers[-1] if numbers else ""

# 精确匹配

reward = 1.0 if predicted == str(answer).strip() else 0.0

rewards.append(reward)

return rewards

4.3 GRPO 训练主脚本

from trl import GRPOTrainer, GRPOConfig

from datasets import load_dataset

from transformers import AutoModelForCausalLM, AutoTokenizer

import torch

1. 加载模型和数据集

model_name = "Qwen/Qwen2.5-1.5B-Instruct"

model = AutoModelForCausalLM.from_pretrained(

model_name, torch_dtype=torch.bfloat16, device_map="auto"

)

tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer.pad_token = tokenizer.eos_token

dataset = load_dataset("openai/gsm8k", "main")["train"].select(range(2000))

def formatting_func(example):

return [

{"role": "user", "content": example["question"]},

{"role": "assistant", "content": " thinking\n"}

]

2. GRPO 配置

training_args = GRPOConfig(

output_dir="./grpo-gsm8k",

num_train_epochs=1,

per_device_train_batch_size=2,

gradient_accumulation_steps=4,

learning_rate=5e-6,

logging_steps=10,

bf16=True,

# GRPO 特有的参数

num_generations=8, # G = 8：每个问题采样 8 个回答

max_prompt_length=512,

max_completion_length=256,

beta=0.04, # KL 散度系数

temperature=0.9, # 采样温度

)

3. 训练

trainer = GRPOTrainer(

model=model,

args=training_args,

train_dataset=dataset,

reward_funcs=[gsm8k_reward],

processing_class=tokenizer,

)

trainer.train()

4.4 预期结果

指标SFT 基线GRPO 训练后---------------------------GSM8K 准确率~45%~55-60%响应长度（平均 token）~80~180使用 thinking 标签比例N/A>90%

注意，即便只用 2000 条数据和 1.5B 的小模型，GRPO 也能带来约 10-15 个百分点的提升。响应的平均长度翻倍——因为模型学会了在 thinking 里写推理过程。

🛠️ 实战经验：DRPO 训练时推荐使用 vLLM 作为生成后端（TRL 已集成），可将采样速度提升 5-10 倍。参数 num_generations 不要设得太大——G=8 和 G=16 的效果差异不大，但耗时几乎翻倍。

✨ 一句话记住：GRPO 不是「炼丹」——它是一套严谨的算法。你需要的只是可验证的奖励信号和足够的 GPU 显存放 Policy + Reference 两个模型。

---

常见误区

误区一：「GRPO 就是简化版的 PPO，没什么本质区别」

这是最常见也最危险的误解。GRPO 不是「砍掉了 Critic 的 PPO」——它从根本上改变了 Advantage 的估计方式。PPO 的 Advantage 来自 Value Network 的自举（有偏估计），GRPO 的 Advantage 来自组内奖励的相对比较（无偏，直接来自奖励信号）。两者的 Optimism Bias 和方差特性完全不同。

误区二：「RLVR 只能用于有标准答案的任务」

表面上看确实如此，但实际上格式奖励和规则奖励极大地拓展了 RLVR 的适用范围。任何可以写为 verifier(output) -> float 的函数都可以驱动 RLVR 训练——代码风格检查、安全合规检查、长度控制、语言一致性检查等。DeepSeek-R1 在处理创意写作时也用到了语言一致性奖励。

误区三：「推理模型一定比普通模型大很多」

DeepSeek-R1-Distill-7B 在多项推理基准上的表现超过了 GPT-4o（一个大约 200B 的 Dense 模型）。这说明推理能力在很大程度上是训练范式的结果，而非纯粹参数量的产物。蒸馏（Knowledge Distillation）可以将一个 Large Reasoning Model 的「思维模式」迁移到小模型上，效果惊人。

---

练习与思考

练习 1：Advantage 极端情况

当同一个问题的 G=4 个回答全部正确（奖励全为 1）时，GRPO 的 Advantage 是多少？这对训练有什么影响？

当 G=4 且所有奖励都为 1 时：

mean(r) = 1.0

std(r) = 0.0（加上 ε=1e-8 后接近 0）

Advantage = (1 - 1) / ε ≈ 0

所有 Advantage 接近 0，意味着这些回答对 Loss 几乎没有贡献——模型不会因「简单题全对」而被过度奖励。这正是 GRPO 的设计意图：简单题不需要进一步优化，把学习信号留给困难题。

</details>

练习 2：Reward Hacking 风险

如果格式奖励权重 w_format 设得过大（比如 10.0），而数学奖励只有 1.0。模型可能会产生什么行为？如何防范？

模型会发现「写出正确的 thinking response 格式」远比「算出正确答案」回报高。它可能退化为只输出格式正确的废话——比如「 thinking 让我思考一下... response 答案是 42」（无论题目是什么）。

防范方法：

格式奖励的权重不应超过任务奖励的 1/5；

训练后期逐步退火格式奖励权重；

监控格式正确率和任务正确率的变化趋势——如果格式率 > 99% 但任务率下降，说明格式奖励过重。

</details>

练习 3：Test-Time Compute 的工程实现

你有一个已训练好的推理模型，想在推理时用 Best-of-N 提升质量（N=16）。写出伪代码描述这个过程，并说明它的计算代价相比单次推理增加了多少。

def best_of_n_inference(model, prompt, N=16):

responses = []

for _ in range(N):

response = model.generate(prompt, temperature=0.7)

responses.append(response)

# 用 Reward 函数选最佳回答

scores = [compute_reward(prompt, r) for r in responses]

best_idx = scores.index(max(scores))

return responses[best_idx]

计算代价：N 次前向传播，延迟约为单次推理的 N 倍（如果顺序执行）。如果批量并行，延迟接近单次但吞吐量降低 N 倍。相比单次推理，Test-Time Compute 消耗约 16 倍的 FLOPs——但换来了显著的质量提升。这是一个典型的「用计算换质量」的 tradeoff。

</details>

---

延伸阅读

DeepSeek-R1 技术报告：GRPO 和 RLVR 的原始出处，必读。详细描述了冷启动数据构造、RL 训练阶段划分和蒸馏实验。

DeepSeekMath (2024)：GRPO 算法的首次公开提出，附录中有完整的数学推导。

TRL GRPOTrainer 文档：HuggingFace TRL 库的 GRPO 实现，包含大量实用配置示例。

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (Google DeepMind, 2024)：系统研究了 Test-Time Compute Scaling 的最优策略，证明在某些场景下增加推理时间计算比增加模型参数更高效。

本文总结

本文从「为什么需要推理模型」出发，系统地介绍了推理模型训练的三大支柱：GRPO 算法、RLVR 奖励设计和 Test-Time Compute Scaling。核心要点回顾：

推理模型不是更大的模型，而是经过 RL 后训练、学会了「思考」的模型。

GRPO 的核心是组内相对比较，淘汰了 Critic 模型，将训练成本减少近半。

RLVR 用可验证规则替代人类偏好打分，让奖励信号无偏、可复现、可规模化。

Test-Time Compute 是推理时的事后增强——不需要重新训练，只是「让模型多想一会儿」。

✨ 一句话记住：推理模型 = 预训练基座 + GRPO × RLVR + Test-Time Compute。三个维度，三倍力量。

---

👉 下一篇预告：第 5 篇——从 RAG 到 Agent：LLM 的行动力进化

你将了解：检索增强生成（RAG）如何弥补 LLM 的知识时效性问题、Agent 框架如何让 LLM 从「对话者」变成「行动者」，以及 Tool Use、Planning、Memory 这三大 Agent 核心能力的实现原理。

推理模型训练范式：RLVR、GRPO与Test-Time Compute Scaling

推理模型训练范式：RLVR、GRPO与 Test-Time Compute Scaling

学习目标

前置唤醒

1. 为什么需要推理模型

1.1 「直觉型」LLM 的天花板

1.2 三种 Scaling Law

2. GRPO 算法深度推导

2.1 PPO 为什么「贵」

2.2 GRPO 的数学框架

2.3 GRPO 为什么不需要 Critic

2.4 Python 伪代码

3. RLVR 的奖励设计

3.1 数学回报（Math Reward）

3.2 代码回报（Code Reward）

3.3 格式回报（Format Reward）

3.4 组合奖励

4. 代码实践：GSM8K 上的 GRPO 训练

4.1 环境准备

4.2 奖励函数定义

4.3 GRPO 训练主脚本

1. 加载模型和数据集

2. GRPO 配置

3. 训练

4.4 预期结果

常见误区

误区一：「GRPO 就是简化版的 PPO，没什么本质区别」

误区二：「RLVR 只能用于有标准答案的任务」

误区三：「推理模型一定比普通模型大很多」

练习与思考

练习 1：Advantage 极端情况

练习 2：Reward Hacking 风险

练习 3：Test-Time Compute 的工程实现

延伸阅读

本文总结

相关文章

探索更多内容

评论 (0)