GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

论文概览

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Language Models (LLMs) by leveraging direct outcome verification instead of learned reward models. Building on this…

规范主键

arxiv:2604.20659

合并来源

arXiv

作者

Jingyi Wang，Lei Zhu，Tengjin Weng，Song-Li Wu，Haochen Tan，Jierun Chen，Chaofan Tao，Haoli Bai，Lu Hou，Lifeng Shang，Xiao-Ping Zhang

分类

cs.LG, cs.AI

标签

评测 / 方法

主题词

Benchmark / Language Model

首次出现

2026-04-23 11:42:13 (UTC+08:00)

个人反馈

把你为什么标记这篇论文、接下来准备怎么处理，直接挂在规范化详情页上。

当前还没有个人反馈，可以先用本地 feedback CLI 补上。

反馈操作

复制规范主键或本地 CLI 命令，把这篇论文快速加入个人反馈状态文件。

行动提醒状态

这里记录这篇论文最近已经触发过哪些 action reason，便于解释为什么今天没有再次提醒。

当前还没有记录过 action 提醒。

来源与外链

优先展示这篇论文在各来源上的规范化入口，再补当前摘要页和 PDF。

arXiv PDF

历史命中

按归档时间回看它在哪些 feed 中出现过，并保留当日 digest 产物入口。

LLM

2026-04-23

2026-04-23 11:42:13 (Asia/Shanghai)

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Language Models (LLMs) by leveraging direct outcome verification instead of…

Score 87 · title matched "reasoning"；summary matched "benchmark"；has PDF

Markdown JSON 对应 Feed 页

GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

论文概览

个人反馈

反馈操作

行动提醒状态

来源与外链

历史命中

2026-04-23

相关推荐

Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views

V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization

MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation

Where and What: Reasoning Dynamic and Implicit Preferences in Situated Conversational Recommendation

Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems

SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models