【论文速递】2025年第40周(Sep-28-Oct-04)(Robotics/Embodied AI/LLM)
自约翰·冯·诺依曼和艾伦·图灵以来,计算系统与大脑之间的关系一直是先驱理论家的动力。均匀、无标度的生物网络(例如大脑)具有强大的特性,包括随着时间的推移进行泛化,这是机器学习通向通用推理模型的主要障碍。我们引入了“Dragon Hatchling”(BDH),这是一种新的大型语言模型架构,基于局部相互作用的神经元粒子的无标度生物启发网络。BDH 结合了强大的理论基础和固有的可解释性,而不牺牲类似
中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- 幼龙破壳:Transformer和大脑模型之间缺失的环节
- LongLive:实时交互式长视频生成
- MCPMark:面向真实且全面的MCP应用场景的压力测试基准
- Vision-Zero:基于策略化博弈自对弈的可扩展视觉语言模型自我提升
- DeepSearch:通过蒙特卡罗树搜索以可验证的奖励克服强化学习的瓶颈
- MinerU2.5:用于高效高分辨率文档解析的解耦视觉语言模型
- EPO:LLM 代理强化学习的熵正则化策略优化
- SLA:通过微调稀疏线性注意力超越扩散变压器的稀疏性
- 熵安全推理的分位数优势估计
- LongCodeZip:压缩代码语言模型的长上下文
- Self-Forcing++:迈向分钟级高质量视频生成
- GEM:代理法学硕士的健身房
- ExGRPO:从经验中学习推理
- 思考越多,准确性就越低?论视觉语言模型推理的双重性
- SINQ:免校准低精度 LLM 权重的 Sinkhorn 归一化量化
- 语言模型可以从口头反馈中学习而无需标量奖励
- 语言模型的变分推理
- 赢得剪枝赌博:联合样本和令牌剪枝的统一方法,以实现高效的监督微调
- VLA-RFT:视觉-语言-动作强化微调,并在世界模拟器中验证奖励
- StableToken:用于弹性语音LLM的抗噪声语义语音分词器
- ReviewScore:使用大型语言模型检测误导的同行评审
- 多人纳什偏好优化
- StealthAttack:通过密度引导幻觉实现稳健的 3D 高斯飞溅中毒
- TruthRL:通过强化学习激励诚实的法学硕士
- OpenGPT-4o-Image:用于高级图像生成和编辑的综合数据集
幼龙破壳:Transformer和大脑模型之间缺失的环节
-
标题: The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain
-
作者: Adrian Kosowski, Przemysław Uznański, Jan Chorowski, Zuzanna Stamirowska, Michał Bartoszkiewicz
-
日期: 2025-09-30
-
ArXiv主页: https://arxiv.org/abs/2509.26507
-
gitHub仓库: https://github.com/takzen/vision-bdh
英文摘要
The relationship between computing systems and the brain has served as motivation for pioneering theoreticians since John von Neumann and Alan Turing. Uniform, scale-free biological networks, such as the brain, have powerful properties, including generalizing over time, which is the main barrier for Machine Learning on the path to Universal Reasoning Models. We introduce `Dragon Hatchling’ (BDH), a new Large Language Model architecture based on a scale-free biologically inspired network of \n locally-interacting neuron particles. BDH couples strong theoretical foundations and inherent interpretability without sacrificing Transformer-like performance. BDH is a practical, performant state-of-the-art attention-based state space sequence learning architecture. In addition to being a graph model, BDH admits a GPU-friendly formulation. It exhibits Transformer-like scaling laws: empirically BDH rivals GPT2 performance on language and translation tasks, at the same number of parameters (10M to 1B), for the same training data. BDH can be represented as a brain model. The working memory of BDH during inference entirely relies on synaptic plasticity with Hebbian learning using spiking neurons. We confirm empirically that specific, individual synapses strengthen connection whenever BDH hears or reasons about a specific concept while processing language inputs. The neuron interaction network of BDH is a graph of high modularity with heavy-tailed degree distribution. The BDH model is biologically plausible, explaining one possible mechanism which human neurons could use to achieve speech. BDH is designed for interpretability. Activation vectors of BDH are sparse and positive. We demonstrate monosemanticity in BDH on language tasks. Interpretability of state, which goes beyond interpretability of neurons and model parameters, is an inherent feature of the BDH architecture.
中文摘要
自约翰·冯·诺依曼和艾伦·图灵以来,计算系统与大脑之间的关系一直是先驱理论家的动力。均匀、无标度的生物网络(例如大脑)具有强大的特性,包括随着时间的推移进行泛化,这是机器学习通向通用推理模型的主要障碍。我们引入了“Dragon Hatchling”(BDH),这是一种新的大型语言模型架构,基于局部相互作用的神经元粒子的无标度生物启发网络。BDH 结合了强大的理论基础和固有的可解释性,而不牺牲类似 Transformer 的性能。BDH 是一种实用、高性能、最先进的基于注意力的状态空间序列学习架构。除了作为图形模型之外,BDH 还采用 GPU 友好的公式。它表现出类似 Transformer 的缩放法则:从经验上看,对于相同的训练数据,在相同数量的参数(10M 到 1B)下,BDH 在语言和翻译任务上的性能可与 GPT2 相媲美。BDH可以表示为大脑模型。BDH 在推理过程中的工作记忆完全依赖于使用尖峰神经元的赫布学习的突触可塑性。我们凭经验证实,每当 BDH 在处理语言输入时听到或推理出特定概念时,特定的个体突触就会加强连接。BDH的神经元交互网络是一个具有重尾度分布的高模块化图。BDH 模型在生物学上是合理的,解释了人类神经元可以用来实现语音的一种可能机制。BDH 是为了可解释性而设计的。BDH 的激活向量是稀疏且正的。我们在语言任务中证明了 BDH 的单义性。状态的可解释性超越了神经元和模型参数的可解释性,是 BDH 架构的固有特征。
LongLive:实时交互式长视频生成
- 标题: LongLive: Real-time Interactive Long Video Generation
- 作者: Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, Yukang Chen
- 日期: 2025-09-26
- ArXiv主页: https://arxiv.org/abs/2509.22622
- 论文链接: https://arxiv.org/pdf/2509.22622
- 项目链接: https://nvlabs.github.io/LongLive
- gitHub仓库: https://github.com/NVlabs/LongLive
英文摘要
We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.
中文摘要
我们推出了 LongLive,一个用于实时和交互式长视频生成的帧级自回归 (AR) 框架。长视频生成在效率和质量方面都提出了挑战。扩散和扩散强迫模型可以产生高质量的视频,但由于双向注意力而导致效率低下。因果注意力 AR 模型支持 KV 缓存以加快推理速度,但由于长视频训练期间的内存挑战,长视频的质量通常会下降。此外,除了基于静态提示的生成之外,交互功能(例如流式提示输入)对于动态内容创建至关重要,使用户能够实时引导叙述。这种交互要求显着增加了复杂性,特别是在确保提示转换期间的视觉一致性和语义连贯性方面。为了应对这些挑战,LongLive 采用因果、帧级 AR 设计,集成了 KV 重新缓存机制,可通过新的提示刷新缓存状态,实现平滑、一致的切换;流式长调优以实现长视频训练并协调训练和推理(训练长测试长);短窗口注意力与帧级注意力接收器配对,缩短帧接收器,保持远程一致性,同时实现更快的生成。通过这些关键设计,LongLive 在短短 32 个 GPU 日内将 1.3B 参数短剪辑模型微调为一分钟长的生成。据推断,LongLive 在单个 NVIDIA H100 上维持了 20.7 FPS,在短视频和长视频上的 VBench 上都取得了强劲的性能。LongLive 在单个 H100 GPU 上支持长达 240 秒的视频。LongLive 进一步支持 INT8 量化推理,只有边际质量损失。
MCPMark:面向真实且全面的MCP应用场景的压力测试基准
- 标题: MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use
- 作者: Zijian Wu, Xiangyan Liu, Xinyuan Zhang, Lingjun Chen, Fanqing Meng, Lingxiao Du, Yiran Zhao, Fanshi Zhang, Yaoqi Ye, Jiawei Wang, Zirui Wang, Jinjie Ni, Yufan Yang, Arvin Xu, Michael Qizhe Shieh
- 日期: 2025-09-28
- ArXiv主页: https://arxiv.org/abs/2509.24002
- 论文链接: https://arxiv.org/pdf/2509.24002
- 项目链接: https://mcpmark.ai/
- gitHub仓库: https://github.com/eval-sys/mcpmark
英文摘要
MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of 127 high-quality tasks collaboratively created by domain experts and AI agents. Each task begins with a curated initial state and includes a programmatic script for automatic verification. These tasks demand richer and more diverse interactions with the environment, involving a broad range of create, read, update, and delete (CRUD) operations. We conduct a comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, gpt-5-medium, reaches only 52.56% pass@1 and 33.86% pass^4, while other widely regarded strong models, including claude-sonnet-4 and o3, fall below 30% pass@1 and 15% pass^4. On average, LLMs require 16.2 execution turns and 17.4 tool calls per task, significantly surpassing those in previous MCP benchmarks and highlighting the stress-testing nature of MCPMark.
中文摘要
MCP 标准化了法学硕士与外部系统交互的方式,为一般代理奠定了基础。然而,现有的 MCP 基准测试范围仍然狭窄:它们专注于读取繁重的任务或交互深度有限的任务,无法捕捉现实工作流程的复杂性和现实性。为了弥补这一差距,我们提出了 MCPMark,这是一个旨在以更现实、更全面的方式评估 MCP 使用情况的基准。它由领域专家和人工智能代理协作创建的 127 个高质量任务组成。每个任务都从一个策划的初始状态开始,并包括一个用于自动验证的编程脚本。这些任务需要与环境进行更丰富、更多样化的交互,涉及广泛的创建、读取、更新和删除 (CRUD) 操作。我们使用在工具调用循环中运行的最小代理框架对尖端法学硕士进行全面评估。实证结果表明,性能最好的模型 gpt-5-medium 仅达到 52.56% pass@1 和 33.86% pass^4,而其他被广泛认为的强模型,包括 claude-sonnet-4 和 o3,则低于 30% pass@1 和 15% pass^4。平均而言,LLM 每个任务需要 16.2 次执行轮次和 17.4 次工具调用,显着超过之前的 MCP 基准测试,并凸显了 MCPMark 的压力测试性质。
Vision-Zero:基于策略化博弈自对弈的可扩展视觉语言模型自我提升
-
标题: Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play
-
作者: Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, Wentian Zhao
-
日期: 2025-09-29
-
ArXiv主页: https://arxiv.org/abs/2509.25541
英文摘要
Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of vision-language models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs. To address this challenge, we propose Vision-Zero, a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in “Who Is the Spy”-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model’s reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code has been released at https://github.com/wangqinsi1/Vision-Zero.
中文摘要
尽管强化学习(RL)可以有效增强视觉语言模型(VLM)的推理能力,但目前的方法仍然严重依赖于劳动密集型数据集,需要大量的人工构建和验证,导致训练成本极高,从而限制了VLM的实际部署。为了应对这一挑战,我们提出了 Vision-Zero,这是一个与领域无关的框架,可以通过从任意图像对生成的竞争性视觉游戏来实现 VLM 的自我改进。具体来说,Vision-Zero 包含三个主要属性:(1)战略自我博弈框架:Vision-Zero 在“谁是间谍”式的游戏中训练 VLM,模型在多个角色中进行战略推理和行动。通过交互式游戏,模型可以自主生成训练数据,无需人工注释。(2)任意图像的游戏玩法:与现有的游戏化框架不同,Vision-Zero 可以从任意图像生成游戏,从而增强模型跨不同领域的推理能力,并对不同任务表现出很强的泛化能力。我们使用三种不同类型的图像数据集演示了这种多功能性:基于 CLEVR 的合成场景、图表和真实世界图像。(3)可持续的性能增益:我们引入了迭代自我对弈策略优化(Iterative-SPO),这是一种新颖的训练算法,它在自我对弈和具有可验证奖励的强化学习(RLVR)之间交替,缓解了仅自我对弈训练中经常出现的性能瓶颈,并实现了持续的长期改进。尽管使用无标签数据,Vision-Zero 在推理、图表问答和以视觉为中心的理解任务上实现了最先进的性能,超越了其他基于注释的方法。模型和代码已发布在https://github.com/wangqinsi1/Vision-Zero。
DeepSearch:通过蒙特卡罗树搜索以可验证的奖励克服强化学习的瓶颈
- 标题: DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
- 作者: Fang Wu, Weihao Xuan, Heli Qi, Ximing Lu, Aaron Tu, Li Erran Li, Yejin Choi
- 日期: 2025-09-29
- ArXiv主页: https://arxiv.org/abs/2509.25454
- 论文链接: https://arxiv.org/pdf/2509.25454
- 项目链接: https://github.com/smiles724/DeepSearch
- gitHub仓库: https://github.com/smiles724/DeepSearch
英文摘要
Although RLVR has become an essential component for developing advanced reasoning skills in LLMs, contemporary studies have documented training plateaus that emerge following thousands of optimization steps, demonstrating notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models - using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.
中文摘要
尽管 RLVR 已成为法学硕士培养高级推理技能的重要组成部分,但当代研究记录了数千个优化步骤后出现的训练平台,表明尽管计算投资增加,但性能收益却显着下降。这种限制源于当前 RLVR 实践中固有的稀疏探索模式,其中模型依赖于有限的推出,而这些推出往往会错过关键推理路径,并且无法提供解决方案空间的系统覆盖。我们推出 DeepSearch,这是一个将蒙特卡罗树搜索直接集成到 RLVR 训练中的框架。与仅在推理时依赖树搜索的现有方法相比,DeepSearch 将结构化搜索嵌入到训练循环中,从而能够跨推理步骤进行系统探索和细粒度信用分配。通过训练时探索,DeepSearch 解决了探索不足的根本瓶颈,该瓶颈导致长时间训练步骤的性能提升逐渐减弱。我们的贡献包括:(1) 全局前沿选择策略,优先考虑整个搜索树中有前途的节点;(2) 使用基于熵的指导进行选择,确定用于监督的可信路径;(3) 具有解决方案缓存的自适应重放缓冲区训练,以提高效率。数学推理基准实验表明,DeepSearch 实现了 62.95% 的平均准确率,并为 1.5B 推理模型建立了新的最先进技术 - 使用的 GPU 时间比扩展训练方法少 5.7 倍。这些结果凸显了战略探索相对于暴力扩展的重要性,并证明了算法创新对于推进 RLVR 方法的前景。DeepSearch 为通过系统搜索而不是长时间计算来扩展推理能力建立了一个新方向。
MinerU2.5:用于高效高分辨率文档解析的解耦视觉语言模型
- 标题: MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
- 作者: Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang, Jingzhou Chen, Fangdong Wang, Xiaomeng Zhao, Liqun Wei, Wei Li, Shasha Wang, Ruiliang Xu, Yuanyuan Cao, Lu Chen, Qianqian Wu, Huaiyu Gu, Lindong Lu, Keming Wang, Dechen Lin, Guanlin Shen, Xuanhe Zhou, Linfeng Zhang, Yuhang Zang, Xiaoyi Dong, Jiaqi Wang, Bo Zhang, Lei Bai, Pei Chu, Weijia Li, Jiang Wu, Lijun Wu, Zhenxiang Li, Guangyu Wang, Zhongying Tu, Chao Xu, Kai Chen, Yu Qiao, Bowen Zhou, Dahua Lin, Wentao Zhang, Conghui He
- 日期: 2025-09-26
- ArXiv主页: https://arxiv.org/abs/2509.22186
- 论文链接: https://arxiv.org/pdf/2509.22186
- 项目链接: https://opendatalab.github.io/MinerU/
- gitHub仓库: https://github.com/opendatalab/MinerU
英文摘要
We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.
中文摘要
我们推出了 MinerU2.5,这是一种 1.2B 参数文档解析视觉语言模型,可实现最先进的识别精度,同时保持卓越的计算效率。我们的方法采用从粗到细的两阶段解析策略,将全局布局分析与本地内容识别分离。在第一阶段,该模型对下采样图像进行有效的布局分析以识别结构元素,从而避免了处理高分辨率输入的计算开销。在第二阶段,在全局布局的指导下,它对从原始图像中提取的原始分辨率作物进行有针对性的内容识别,保留密集文本、复杂公式和表格中的细粒度细节。为了支持这一策略,我们开发了一个全面的数据引擎,可以生成多样化的大规模训练语料库,用于预训练和微调。最终,MinerU2.5 展示了强大的文档解析能力,在多个基准测试中实现了最先进的性能,在各种识别任务中超越了通用模型和特定领域模型,同时保持了显着较低的计算开销。
EPO:LLM 代理强化学习的熵正则化策略优化
-
标题: EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning
-
作者: Xu Wujiang, Wentian Zhao, Zhenting Wang, Li Yu-Jhe, Jin Can, Jin Mingyu, Mei Kai, Wan Kun, Metaxas Dimitris
-
日期: 2025-09-26
-
ArXiv主页: https://arxiv.org/abs/2509.22576
-
gitHub仓库: https://github.com/WujiangXu/EPO
英文摘要
Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.
中文摘要
在奖励稀疏的多轮环境中训练 LLM 智能体,完成一项任务需要在一个情节内进行 30 轮以上的交互,这对强化学习提出了根本性的挑战。我们确定了这种情况下独特的关键故障模式:探索-利用级联故障。这种级联始于早期策略的过早收敛,其中稀疏的反馈导致代理采取有缺陷的低熵策略。随后,智能体进入后期政策崩溃,传统的熵正则化变得适得其反,促进混乱的探索,从而破坏训练的稳定性。我们提出了熵正则化策略优化(EPO),这是一个通过三种协同机制打破这种失败循环的通用框架:(1)在多轮设置中采用熵正则化来增强探索,(2)熵平滑正则器将策略熵限制在历史平均值内以防止突然波动,以及(3)自适应基于阶段的权重,平衡训练中的探索和利用。我们的分析证明 EPO 保证熵方差单调递减,同时保持收敛。EPO 在 ScienceWorld 上实现了高达 152% 的性能提升,在 ALFWorld 上实现了高达 19.8% 的性能提升。我们的工作表明,多轮稀疏奖励设置需要与传统 RL 根本不同的熵控制,这对 LLM 代理训练具有广泛的影响。
SLA:通过微调稀疏线性注意力超越扩散变压器的稀疏性
- 标题: SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention
- 作者: Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, Joseph E. Gonzalez, Jun Zhu, Jianfei Chen
- 日期: 2025-09-28
- ArXiv主页: https://arxiv.org/abs/2509.24006
- 论文链接: https://arxiv.org/pdf/2509.24006
- 项目链接: https://github.com/thu-ml/SLA
- gitHub仓库: https://github.com/thu-ml/SLA
英文摘要
In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B.
中文摘要
在扩散变换器(DiT)模型中,特别是对于视频生成,由于序列长度长和二次复杂度,注意力延迟是一个主要瓶颈。我们发现注意力权重可以分为两部分:一小部分具有高排名的大权重和其余具有非常低排名的权重。这自然建议对第一部分应用稀疏加速,对第二部分应用低阶加速。基于这一发现,我们提出了 SLA(稀疏线性注意力),这是一种可训练的注意力方法,融合稀疏和线性注意力来加速扩散模型。SLA 将注意力权重分为关键、边际和可忽略类别,对关键权重应用 O(N^2) 注意力,对边际权重应用 O(N) 注意力,并跳过可忽略的权重。SLA 将这些计算组合到单个 GPU 内核中,并支持前向和后向传递。只需使用 SLA 进行几个微调步骤,DiT 模型即可将注意力计算量减少 20 倍,从而在不损失生成质量的情况下实现显着加速。实验表明,SLA 在不降低端到端生成质量的情况下将注意力计算减少了 95%,优于基线方法。此外,我们为 SLA 实现了高效的 GPU 内核,在 Wan2.1-1.3B 上,注意力计算速度提高了 13.7 倍,视频生成端到端速度提高了 2.2 倍。
熵安全推理的分位数优势估计
-
标题: Quantile Advantage Estimation for Entropy-Safe Reasoning
-
作者: Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He
-
日期: 2025-09-26
-
ArXiv主页: https://arxiv.org/abs/2509.22611
-
gitHub仓库: https://github.com/junkangwu/QAE
英文摘要
Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} – rather than token-level heuristics – as the primary mechanism for scaling RLVR.
中文摘要
具有可验证奖励的强化学习(RLVR)强化了 LLM 推理,但训练经常在{熵崩溃}和{熵爆炸}之间振荡。我们将这两种风险追溯到无价值强化学习(例如 GRPO 和 DAPO)中使用的平均基线,该基线不适当地惩罚了奖励异常值下的负优势样本。我们提出了{分位数优势估计}(QAE),用分组 K 分位数基线替换平均值。QAE 引入了一个响应级别的双机制门:在硬查询 (p <= 1 - K) 上,它强化了罕见的成功,而在简单查询 (p > 1 - K) 上,它针对剩余的失败。在一阶 softmax 更新下,我们证明了{双边熵安全性},给出了单步熵变的下限和上限,可以抑制爆炸并防止崩溃。根据经验,这种最小的修改稳定了熵,稀疏了信用分配(通过调整 K,大约 80% 的响应获得零优势),并在 AIME 2024/2025 和 AMC 2023 的 Qwen3-8B/14B-Base 上产生持续的 pass@1 增益。这些结果将{基线设计}(而不是代币级启发式)确定为扩展 RLVR 的主要机制。
LongCodeZip:压缩代码语言模型的长上下文
-
标题: LongCodeZip: Compress Long Context for Code Language Models
-
作者: Yuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen, Xiaodong Gu
-
日期: 2025-10-01
-
ArXiv主页: https://arxiv.org/abs/2510.00446
-
gitHub仓库: https://github.com/YerbaPage/LongCodeZip
英文摘要
Code generation under long contexts is becoming increasingly critical as Large Language Models (LLMs) are required to reason over extensive information in the codebase. While recent advances enable code LLMs to process long inputs, high API costs and generation latency remain substantial bottlenecks. Existing context pruning techniques, such as LLMLingua, achieve promising results for general text but overlook code-specific structures and dependencies, leading to suboptimal performance in programming tasks. In this paper, we propose LongCodeZip, a novel plug-and-play code compression framework designed specifically for code LLMs. LongCodeZip employs a dual-stage strategy: (1) coarse-grained compression, which identifies and ranks function-level chunks using conditional perplexity with respect to the instruction, retaining only the most relevant functions; and (2) fine-grained compression, which segments retained functions into blocks based on perplexity and selects an optimal subset under an adaptive token budget to maximize relevance. Evaluations across multiple tasks, including code completion, summarization, and question answering, show that LongCodeZip consistently outperforms baseline methods, achieving up to a 5.6x compression ratio without degrading task performance. By effectively reducing context size while preserving essential information, LongCodeZip enables LLMs to better scale to real-world, large-scale code scenarios, advancing the efficiency and capability of code intelligence applications.
中文摘要
由于需要大型语言模型 (LLM) 对代码库中的大量信息进行推理,因此长上下文下的代码生成变得越来越重要。虽然最近的进展使代码法学硕士能够处理长输入,但高 API 成本和生成延迟仍然是重大瓶颈。现有的上下文修剪技术(例如 LLMLingua)对于一般文本取得了有希望的结果,但忽略了特定于代码的结构和依赖性,导致编程任务的性能不佳。在本文中,我们提出了 LongCodeZip,这是一种专为代码 LLM 设计的新颖的即插即用代码压缩框架。LongCodeZip采用双阶段策略:(1)粗粒度压缩,使用指令的条件困惑度来识别和排序函数级块,仅保留最相关的函数;(2)细粒度压缩,根据复杂度将保留的功能分割成块,并在自适应令牌预算下选择最佳子集以最大化相关性。对多个任务(包括代码完成、摘要和问答)的评估表明,LongCodeZip 始终优于基准方法,在不降低任务性能的情况下实现高达 5.6 倍的压缩比。通过有效减少上下文大小,同时保留基本信息,LongCodeZip 使法学硕士能够更好地扩展到现实世界的大规模代码场景,从而提高代码智能应用程序的效率和能力。
Self-Forcing++:迈向分钟级高质量视频生成
- 标题: Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
- 作者: Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh
- 日期: 2025-10-02
- ArXiv主页: https://arxiv.org/abs/2510.02283
- 论文链接: https://arxiv.org/pdf/2510.02283
- 项目链接: https://self-forcing-plus-plus.github.io/
- gitHub仓库: https://github.com/justincui03/Self-Forcing-Plus-Plus
英文摘要
Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher’s capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model’s position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-plus-plus.github.io/
中文摘要
扩散模型彻底改变了图像和视频的生成,实现了前所未有的视觉质量。然而,它们对 Transformer 架构的依赖会带来极高的计算成本,特别是在将生成扩展到长视频时。最近的工作探索了长视频生成的自回归公式,通常是从短视界双向教师中提取的。然而,鉴于教师模型无法合成长视频,学生模型超出其训练范围的外推通常会导致明显的质量下降,这是由于连续潜在空间内的错误复合而引起的。在本文中,我们提出了一种简单而有效的方法来减轻长视距视频生成中的质量下降,而无需长视频教师的监督或对长视频数据集进行再训练。我们的方法集中于利用教师模型的丰富知识,通过从自我生成的长视频中提取的采样片段为学生模型提供指导。我们的方法保持时间一致性,同时将视频长度扩展至超出教师能力的 20 倍,避免了过度曝光和错误累积等常见问题,而无需像以前的方法那样重新计算重叠帧。当扩大计算范围时,我们的方法显示了生成长达 4 分 15 秒的视频的能力,相当于我们的基础模型位置嵌入支持的最大跨度的 99.9%,并且比我们的基线模型长 50 倍以上。标准基准测试和我们提出的改进基准测试表明,我们的方法在保真度和一致性方面都远远优于基线方法。我们的长视野视频演示可以在 https://self-forcing-plus-plus.github.io/ 找到
GEM:代理法学硕士的健身房
- 标题: GEM: A Gym for Agentic LLMs
- 作者: Zichen Liu, Anya Sims, Keyu Duan, Changyu Chen, Simon Yu, Xiangxin Zhou, Haotian Xu, Shaopan Xiong, Bo Liu, Chenmien Tan, Chuen Yang Beh, Weixun Wang, Hao Zhu, Weiyan Shi, Diyi Yang, Michael Shieh, Yee Whye Teh, Wee Sun Lee, Min Lin
- 日期: 2025-10-01
- ArXiv主页: https://arxiv.org/abs/2510.01051
- 论文链接: https://arxiv.org/pdf/2510.01051
- 项目链接: https://axon-rl.github.io/
- gitHub仓库: https://github.com/axon-rl/gem
英文摘要
The training paradigm for large language models (LLMs) is moving from static datasets to experience-based learning, where agents acquire skills via interacting with complex environments. To facilitate this transition we introduce GEM (General Experience Maker), an open-source environment simulator designed for the age of LLMs. Analogous to OpenAI-Gym for traditional reinforcement learning (RL), GEM provides a standardized framework for the environment-agent interface, including asynchronous vectorized execution for high throughput, and flexible wrappers for easy extensibility. GEM also features a diverse suite of environments, robust integrated tools, and single-file example scripts demonstrating using GEM with five popular RL training frameworks. Along with this, we also provide a set of baselines across 24 environments using REINFORCE with Return Batch Normalization (ReBN), which – unlike GRPO – is compatible with the full RL setting of dense per-turn rewards and offers better credit assignment. We further conduct apple-to-apple benchmarking of PPO, GRPO and REINFORCE in both single- and multi-turn settings using GEM to shed light on the algorithmic designs. Lastly, GEM also functions as a convenient evaluation toolkit besides a training environment. We hope this framework can help accelerate future agentic LLM research.
中文摘要
大型语言模型 (LLM) 的训练范式正在从静态数据集转向基于经验的学习,其中代理通过与复杂环境交互来获取技能。为了促进这一转变,我们引入了 GEM(通用体验制造商),这是一个专为法学硕士时代设计的开源环境模拟器。与传统强化学习 (RL) 的 OpenAI-Gym 类似,GEM 为环境代理接口提供标准化框架,包括用于高吞吐量的异步矢量化执行,以及用于轻松扩展的灵活包装器。GEM 还具有多样化的环境套件、强大的集成工具和单文件示例脚本,演示如何使用 GEM 和五种流行的 RL 训练框架。除此之外,我们还使用 REINFORCE 和返回批量归一化 (ReBN) 提供跨 24 个环境的一组基线,与 GRPO 不同,它与密集的每回合奖励的完整 RL 设置兼容,并提供更好的信用分配。我们进一步使用 GEM 在单轮和多轮设置中对 PPO、GRPO 和 REINFORCE 进行同类基准测试,以阐明算法设计。最后,除了培训环境之外,GEM 还可以充当方便的评估工具包。我们希望这个框架能够帮助加速未来的代理法学硕士研究。
ExGRPO:从经验中学习推理
- 标题: ExGRPO: Learning to Reason from Experience
- 作者: Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, Yu Cheng
- 日期: 2025-10-02
- ArXiv主页: https://arxiv.org/abs/2510.02245
- 论文链接: https://arxiv.org/pdf/2510.02245
英文摘要
Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.
中文摘要
可验证奖励的强化学习(RLVR)是一种用于提高大型语言模型推理能力的新兴范例。然而,标准的策略训练会在单次更新后丢弃推出经验,导致计算效率低下和不稳定。虽然之前关于强化学习的工作强调了重用过去经验的好处,但经验特征在塑造大型推理模型的学习动态方面的作用仍未得到充分探索。在本文中,我们首次研究了推理体验的价值所在,并将推出正确性和熵确定为体验价值的有效指标。基于这些见解,我们提出了 ExGRPO(体验组相对政策优化),这是一个组织和优先考虑有价值的经验的框架,并采用混合政策目标来平衡探索与经验利用。对五个骨干模型(1.5B-8B 参数)的实验表明,ExGRPO 持续提高了数学/一般基准的推理性能,比同策略 RLVR 平均增益 +3.5/7.6 点。此外,ExGRPO 可以稳定在同策略方法失败的较强和较弱模型上的训练。这些结果凸显了原则性的体验管理是高效且可扩展的 RLVR 的关键要素。
思考越多,准确性就越低?论视觉语言模型推理的双重性
- 标题: More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models
- 作者: Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, Jing Zhang
- 日期: 2025-09-30
- ArXiv主页: https://arxiv.org/abs/2509.25848
- 论文链接: https://arxiv.org/pdf/2509.25848
- 项目链接: https://xytian1008.github.io/VAPO/
- gitHub仓库: https://github.com/xytian1008/VAPO
英文摘要
Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, it may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning causes the model to increasingly disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model’s reliance on visual information and achieves new state-of-the-art results on a wide range of established benchmarks. Project page: https://xytian1008.github.io/VAPO/
中文摘要
推理已成为大型语言模型 (LLM) 的关键功能。通过强化学习(RL),通常是组相对策略优化(GRPO),这些模型能够解决复杂的任务,例如数学和代码生成。基于这些进展,最近的研究试图将推理扩展到视觉语言模型(VLM),在不同的视觉任务中产生有希望的结果。尽管取得了这些进展,我们的研究揭示了多模态推理的双重性质:虽然它大大增强了逻辑推理并促进了挑战性问题的表现,但它可能逐渐损害感知基础,导致对其他基本视觉问题的识别失败。通过进一步分析,我们将这种现象归因于视觉遗忘,其中长时间的推理导致模型越来越忽视视觉输入。为了解决这个问题,我们提出了视觉锚定策略优化(VAPO),这是一种简单而有效的方法,可以明确地将推理过程引向基于视觉的轨迹。我们的结果模型 VAPO-Thinker-7B 显着增强了模型对视觉信息的依赖,并在广泛的既定基准上实现了新的最先进的结果。项目页面:https://xytian1008.github.io/VAPO/
SINQ:免校准低精度 LLM 权重的 Sinkhorn 归一化量化
- 标题: SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights
- 作者: Lorenz K. Müller, Philippe Bich, Jiawei Zhuang, Ahmet Çelik, Luca Benfenati, Lukas Cavigelli
- 日期: 2025-09-26
- ArXiv主页: https://arxiv.org/abs/2509.22944
- 论文链接: https://arxiv.org/pdf/2509.22944
- 项目链接: https://github.com/huawei-csl/SINQ
- gitHub仓库: https://github.com/huawei-csl/SINQ
英文摘要
Post-training quantization has emerged as the most widely used strategy for deploying large language models at low precision. Still, current methods show perplexity degradation at bit-widths less than or equal to 4, partly because representing outliers causes precision issues in parameters that share the same scales as these outliers. This problem is especially pronounced for calibration-free, uniform quantization methods. We introduce SINQ to augment existing post-training quantizers with an additional second-axis scale factor and a fast Sinkhorn-Knopp-style algorithm that finds scales to normalize per-row and per-column variances, thereby minimizing a novel per-matrix proxy target for quantization: the matrix imbalance. Our method has no interactions between layers and can be trivially applied to new architectures to quantize any linear layers. We evaluate our method on the Qwen3 model family and DeepSeek-V2.5. SINQ improves WikiText2 and C4 perplexity significantly against uncalibrated uniform quantization baselines and can be further enhanced by combining it with calibration and non-uniform quantization levels. Code to reproduce the results of this work and to easily quantize models using SINQ is available at https://github.com/huawei-csl/SINQ.
中文摘要
训练后量化已成为以低精度部署大型语言模型时最广泛使用的策略。尽管如此,当前的方法在位宽小于或等于 4 时表现出困惑度下降,部分原因是表示离群值会导致与这些离群值共享相同尺度的参数的精度问题。对于免校准、均匀量化方法来说,这个问题尤其明显。我们引入 SINQ,通过额外的第二轴比例因子和快速 Sinkhorn-Knopp 式算法来增强现有的训练后量化器,该算法找到标准化每行和每列方差的比例,从而最大限度地减少新的每矩阵量化代理目标:矩阵不平衡。我们的方法在层之间没有交互,并且可以简单地应用于新的架构来量化任何线性层。我们在 Qwen3 模型系列和 DeepSeek-V2.5 上评估我们的方法。SINQ 相对于未校准的均匀量化基线显着改善了 WikiText2 和 C4 的困惑度,并且可以通过将其与校准和非均匀量化级别相结合来进一步增强。可在 https://github.com/huawei-csl/SINQ 上获取重现这项工作结果并使用 SINQ 轻松量化模型的代码。
语言模型可以从口头反馈中学习而无需标量奖励
-
标题: Language Models Can Learn from Verbal Feedback Without Scalar Rewards
-
作者: Renjie Luo, Zichen Liu, Xiangyan Liu, Chao Du, Min Lin, Wenhu Chen, Wei Lu, Tianyu Pang
-
日期: 2025-09-26
-
ArXiv主页: https://arxiv.org/abs/2509.22638
-
gitHub仓库: https://github.com/sail-sg/feedback-conditional-policy
英文摘要
LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at https://github.com/sail-sg/feedback-conditional-policy.
中文摘要
法学硕士通常使用来自人类或人工智能反馈的强化学习进行训练,但此类方法通常将细致入微的反馈压缩为标量奖励,从而丢弃了大部分丰富性并导致规模不平衡。我们建议将口头反馈视为条件信号。受文本到图像生成中的语言先验的启发,我们引入了反馈条件策略(FCP),该先验可以从看不见的提示中产生新颖的输出。FCP 直接从响应反馈对中学习,通过离线数据的最大似然训练来逼近反馈条件后验。我们进一步开发了一个在线引导阶段,政策在积极的条件下生成并接收新的反馈来完善自身。这将反馈驱动的学习重新定义为条件生成而不是奖励优化,为法学硕士提供了一种更具表现力的方式来直接从口头反馈中学习。我们的代码可在 https://github.com/sail-sg/feedback-conditional-policy 获取。
语言模型的变分推理
-
标题: Variational Reasoning for Language Models
-
作者: Xiangxin Zhou, Zichen Liu, Haonan Wang, Chao Du, Min Lin, Chongxuan Li, Liang Wang, Tianyu Pang
-
日期: 2025-09-26
-
ArXiv主页: https://arxiv.org/abs/2509.22637
英文摘要
We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.
中文摘要
我们引入了一种语言模型的变分推理框架,将思维痕迹视为潜在变量,并通过变分推理对其进行优化。从证据下界(ELBO)开始,我们将其扩展到多迹目标以获得更严格的界限,并提出了一种前向 KL 公式来稳定变分后验的训练。我们进一步表明,拒绝采样微调和二元奖励强化学习(包括 GRPO)可以被解释为局部前向 KL 目标,其中模型精度的隐式加权自然地从推导中产生,并揭示了以前未被注意到的对更简单问题的偏见。我们在 Qwen 2.5 和 Qwen 3 模型系列上跨广泛的推理任务实证验证了我们的方法。总的来说,我们的工作提供了一个原则性的概率视角,将变分推理与强化学习风格的方法相结合,并为提高语言模型的推理能力提供了稳定的目标。我们的代码可在 https://github.com/sail-sg/variational-reasoning 获取。
赢得剪枝赌博:联合样本和令牌剪枝的统一方法,以实现高效的监督微调
- 标题: Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning
- 作者: Shaobo Wang, Jiaming Wang, Jiajun Zhang, Cong Wang, Yue Min, Zichen Wen, Fei Huang, Huiqiang Jiang, Junyang Lin, Dayiheng Liu, Linfeng Zhang
- 日期: 2025-09-28
- ArXiv主页: https://arxiv.org/abs/2509.23873
- 论文链接: https://arxiv.org/pdf/2509.23873
- 项目链接: https://gszfwsb.github.io/Q-tuning/
英文摘要
As supervised fine-tuning (SFT) evolves from a lightweight post-training step into a compute-intensive phase rivaling mid-training in scale, data efficiency has become critical for aligning large language models (LLMs) under tight budgets. Existing data pruning methods suffer from a fragmented design: they operate either at the sample level or the token level in isolation, failing to jointly optimize both dimensions. This disconnect leads to significant inefficiencies–high-value samples may still contain redundant tokens, while token-level pruning often discards crucial instructional or corrective signals embedded in individual examples. To address this bottleneck, we introduce the Error-Uncertainty (EU) Plane, a diagnostic framework that jointly characterizes the heterogeneous utility of training data across samples and tokens. Guided by this insight, we propose Quadrant-based Tuning (Q-Tuning), a unified framework that strategically coordinates sample pruning and token pruning. Q-Tuning employs a two-stage strategy: first, it performs sample-level triage to retain examples rich in informative misconceptions or calibration signals; second, it applies an asymmetric token-pruning policy, using a context-aware scoring mechanism to trim less salient tokens exclusively from misconception samples while preserving calibration samples in their entirety. Our method sets a new state of the art across five diverse benchmarks. Remarkably, on SmolLM2-1.7B, Q-Tuning achieves a +38% average improvement over the full-data SFT baseline using only 12.5% of the original training data. As the first dynamic pruning approach to consistently outperform full-data training, Q-Tuning provides a practical and scalable blueprint for maximizing data utilization in budget-constrained LLM SFT.
中文摘要
随着监督微调(SFT)从轻量级的训练后步骤发展到规模可与训练中期相媲美的计算密集型阶段,数据效率对于在预算紧张的情况下调整大型语言模型(LLM)变得至关重要。现有的数据修剪方法存在碎片化的设计:它们要么在样本级别操作,要么在令牌级别操作,无法联合优化两个维度。这种脱节导致效率显着低下——高价值样本可能仍然包含冗余令牌,而令牌级修剪通常会丢弃嵌入在各个示例中的关键指导或纠正信号。为了解决这个瓶颈,我们引入了误差不确定性(EU)平面,这是一种诊断框架,可以共同表征跨样本和令牌的训练数据的异构效用。在这种见解的指导下,我们提出了基于象限的调整(Q-Tuning),这是一个统一的框架,可以战略性地协调样本修剪和令牌修剪。Q-Tuning 采用两阶段策略:首先,它执行样本级别分类以保留富含信息性误解或校准信号的示例;其次,它采用非对称标记修剪策略,使用上下文感知评分机制专门从误解样本中修剪不太显着的标记,同时保留完整的校准样本。我们的方法在五个不同的基准上设定了新的最先进水平。值得注意的是,在 SmolLM2-1.7B 上,Q-Tuning 仅使用 12.5% 的原始训练数据,就比全数据 SFT 基线平均提高了 38%。作为第一种始终优于全数据训练的动态修剪方法,Q-Tuning 提供了实用且可扩展的蓝图,可在预算有限的 LLM SFT 中最大限度地提高数据利用率。
VLA-RFT:视觉-语言-动作强化微调,并在世界模拟器中验证奖励
- 标题: VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators
- 作者: Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, Weihua Su
- 日期: 2025-10-01
- ArXiv主页: https://arxiv.org/abs/2510.00406
- 论文链接: https://arxiv.org/pdf/2510.00406
- 项目链接: https://vla-rft.github.io/
- gitHub仓库: https://github.com/OpenHelix-Team/VLA-RFT
英文摘要
Vision-Language-Action (VLA) models enable embodied decision-making but rely heavily on imitation learning, leading to compounding errors and poor robustness under distribution shift. Reinforcement learning (RL) can mitigate these issues yet typically demands costly real-world interactions or suffers from sim-to-real gaps. We introduce VLA-RFT, a reinforcement fine-tuning framework that leverages a data-driven world model as a controllable simulator. Trained from real interaction data, the simulator predicts future visual observations conditioned on actions, allowing policy rollouts with dense, trajectory-level rewards derived from goal-achieving references. This design delivers an efficient and action-aligned learning signal, drastically lowering sample requirements. With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines and achieves greater efficiency than simulator-based RL. Moreover, it exhibits strong robustness under perturbed conditions, sustaining stable task execution. Our results establish world-model-based RFT as a practical post-training paradigm to enhance the generalization and robustness of VLA models. For more details, please refer to https://vla-rft.github.io/.
中文摘要
视觉-语言-动作(VLA)模型可以实现具体决策,但严重依赖模仿学习,导致分布偏移下的复合错误和鲁棒性差。强化学习 (RL) 可以缓解这些问题,但通常需要昂贵的现实世界交互或存在模拟与现实之间的差距。我们引入了 VLA-RFT,这是一种强化微调框架,利用数据驱动的世界模型作为可控模拟器。根据真实的交互数据进行训练,模拟器可以预测以行动为条件的未来视觉观察,从而允许政策推出,并从实现目标的参考中获得密集的轨迹级奖励。这种设计提供了高效且与行动一致的学习信号,大大降低了样本要求。VLA-RFT 的微调步骤少于 400 个,超越了强监督基线,并比基于模拟器的 RL 实现了更高的效率。此外,它在扰动条件下表现出很强的鲁棒性,维持稳定的任务执行。我们的结果将基于世界模型的 RFT 确立为一种实用的训练后范例,以增强 VLA 模型的泛化性和鲁棒性。更多详情请参考https://vla-rft.github.io/。
StableToken:用于弹性语音LLM的抗噪声语义语音分词器
- 标题: StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
- 作者: Yuhan Song, Linhao Zhang, Chuhan Wu, Aiwei Liu, Wei Jia, Houfeng Wang, Xiao Zhou
- 日期: 2025-09-26
- ArXiv主页: https://arxiv.org/abs/2509.22220
- 论文链接: https://arxiv.org/pdf/2509.22220
英文摘要
Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks.
中文摘要
流行的语义语音标记器旨在捕获语言内容,但却非常脆弱。我们发现它们对于与意义无关的声学扰动并不稳健;即使在语音完全可理解的高信噪比 (SNR) 下,它们的输出标记序列也可能发生巨大变化,从而增加下游法学硕士的学习负担。这种不稳定性源于两个缺陷:脆弱的单路径量化架构和对中间令牌稳定性漠不关心的远程训练信号。为了解决这个问题,我们引入了 StableToken,这是一种通过共识驱动机制实现稳定性的代币生成器。其多分支架构并行处理音频,并且这些表示通过强大的按位投票机制合并,形成单个稳定的令牌序列。StableToken 在代币稳定性方面树立了新的最先进水平,在各种噪声条件下大大减少了单位编辑距离(UED)。这种基础稳定性可直接转化为下游效益,显着提高 SpeechLLM 在各种任务上的稳健性。
ReviewScore:使用大型语言模型检测误导的同行评审
- 标题: ReviewScore: Misinformed Peer Review Detection with Large Language Models
- 作者: Hyun Ryu, Doohyuk Jang, Hyemin S. Lee, Joonhyun Jeong, Gyeongman Kim, Donghyeon Cho, Gyouk Chu, Minyeong Hwang, Hyeongwon Jang, Changhun Kim, Haechan Kim, Jina Kim, Joowon Kim, Yoonjeon Kim, Kwanhyung Lee, Chanjae Park, Heecheol Yun, Gregor Betz, Eunho Yang
- 日期: 2025-09-25
- ArXiv主页: https://arxiv.org/abs/2509.21679
- 论文链接: https://arxiv.org/pdf/2509.21679
英文摘要
Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either “weaknesses” in a review that contain incorrect premises, or “questions” in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs and verify moderate agreements. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality. A thorough disagreement analysis further supports a potential of fully automated ReviewScore evaluation.
中文摘要
同行评审是学术研究的支柱,但在大多数人工智能会议中,随着提交数量的激增,评审质量正在下降。为了可靠地检测低质量评论,我们将错误信息的评论点定义为评论中包含不正确前提的“弱点”,或评论中可以由论文回答的“问题”。我们验证了 15.2% 的弱点和 26.4% 的问题被误导,并引入了 ReviewScore 来指示评论点是否被误导。为了评估每个弱点前提的真实性,我们提出了一个自动化引擎,可以根据弱点重建每个显式和隐式前提。我们构建了一个由人类专家注释的 ReviewScore 数据集,以检查法学硕士自动化 ReviewScore 评估的能力。然后,我们使用八个当前最先进的法学硕士来衡量 ReviewScore 上的人类模型协议,并验证适度的协议。我们还证明,评估前提级事实性比评估弱点级事实性显示出更高的一致性。彻底的分歧分析进一步支持了全自动 ReviewScore 评估的潜力。
多人纳什偏好优化
-
标题: Multiplayer Nash Preference Optimization
-
作者: Fang Wu, Xu Huang, Weihao Xuan, Zhiwei Zhang, Yijia Xiao, Guancheng Wan, Xiaomin Li, Bing Hu, Peng Xia, Jure Leskovec, Yejin Choi
-
日期: 2025-09-27
-
ArXiv主页: https://arxiv.org/abs/2509.23102
-
gitHub仓库: https://github.com/smiles724/MNPO
英文摘要
Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models (LLMs) with human preferences. However, reward-based methods built on the Bradley-Terry assumption struggle to capture the non-transitive and heterogeneous nature of real-world preferences. To address this, recent studies have reframed alignment as a two-player Nash game, giving rise to Nash learning from human feedback (NLHF). While this perspective has inspired algorithms such as INPO, ONPO, and EGPO with strong theoretical and empirical guarantees, they remain fundamentally restricted to two-player interactions, creating a single-opponent bias that fails to capture the full complexity of realistic preference structures. In this work, we introduce Multiplayer Nash Preference Optimization (MNPO), a novel framework that generalizes NLHF to the multiplayer regime. It formulates alignment as an n-player game, where each policy competes against a population of opponents while being regularized toward a reference model. Our framework establishes well-defined Nash equilibria in multiplayer settings and extends the concept of duality gap to quantify approximation quality. We demonstrate that MNPO inherits the equilibrium guarantees of two-player methods while enabling richer competitive dynamics and improved coverage of diverse preference structures. Through comprehensive empirical evaluation, we show that MNPO consistently outperforms existing NLHF baselines on instruction-following benchmarks, achieving superior alignment quality under heterogeneous annotator conditions and mixed-policy evaluation scenarios. Together, these results establish MNPO as a principled and scalable framework for aligning LLMs with complex, non-transitive human preferences. Code is available at https://github.com/smiles724/MNPO.
中文摘要
基于人类反馈的强化学习 (RLHF) 已成为使大型语言模型 (LLM) 与人类偏好保持一致的标准范例。然而,建立在布拉德利-特里假设基础上的基于奖励的方法很难捕捉现实世界偏好的非传递性和异质性。为了解决这个问题,最近的研究将对齐重新定义为两人纳什博弈,从而引发了纳什从人类反馈中学习(NLHF)。虽然这种观点启发了 INPO、ONPO 和 EGPO 等算法,并提供了强有力的理论和实证保证,但它们从根本上仍然仅限于两人互动,从而产生了单一对手的偏见,无法捕捉现实偏好结构的全部复杂性。在这项工作中,我们介绍了多人游戏纳什偏好优化(MNPO),这是一个将 NLHF 推广到多人游戏机制的新颖框架。它将对齐制定为 n 人游戏,其中每项策略都与一群对手竞争,同时针对参考模型进行规范化。我们的框架在多人游戏设置中建立了明确定义的纳什均衡,并扩展了对偶间隙的概念以量化近似质量。我们证明 MNPO 继承了两人方法的均衡保证,同时实现了更丰富的竞争动态并改善了不同偏好结构的覆盖范围。通过全面的实证评估,我们表明 MNPO 在指令遵循基准方面始终优于现有的 NLHF 基线,在异构注释器条件和混合策略评估场景下实现了卓越的对齐质量。总之,这些结果将 MNPO 确立为一个有原则的、可扩展的框架,使法学硕士与复杂的、非传递性的人类偏好保持一致。代码可在 https://github.com/smiles724/MNPO 获取。
StealthAttack:通过密度引导幻觉实现稳健的 3D 高斯飞溅中毒
- 标题: StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions
- 作者: Bo-Hsu Ke, You-Zhe Xie, Yu-Lun Liu, Wei-Chen Chiu
- 日期: 2025-10-02
- ArXiv主页: https://arxiv.org/abs/2510.02314
- 论文链接: https://arxiv.org/pdf/2510.02314
- 项目链接: https://hentci.github.io/stealthattack/
- gitHub仓库: https://github.com/Hentci/StealthAttack_official
英文摘要
3D scene representation methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have significantly advanced novel view synthesis. As these methods become prevalent, addressing their vulnerabilities becomes critical. We analyze 3DGS robustness against image-level poisoning attacks and propose a novel density-guided poisoning method. Our method strategically injects Gaussian points into low-density regions identified via Kernel Density Estimation (KDE), embedding viewpoint-dependent illusory objects clearly visible from poisoned views while minimally affecting innocent views. Additionally, we introduce an adaptive noise strategy to disrupt multi-view consistency, further enhancing attack effectiveness. We propose a KDE-based evaluation protocol to assess attack difficulty systematically, enabling objective benchmarking for future research. Extensive experiments demonstrate our method’s superior performance compared to state-of-the-art techniques. Project page: https://hentci.github.io/stealthattack/
中文摘要
神经辐射场 (NeRF) 和 3D 高斯分布 (3DGS) 等 3D 场景表示方法具有显着先进的新颖视图合成。随着这些方法变得流行,解决它们的漏洞变得至关重要。我们分析了 3DGS 针对图像级中毒攻击的鲁棒性,并提出了一种新颖的密度引导中毒方法。我们的方法策略性地将高斯点注入到通过核密度估计(KDE)识别的低密度区域中,嵌入从中毒视图中清晰可见的视点相关的虚幻对象,同时最小化对无辜视图的影响。此外,我们引入了自适应噪声策略来破坏多视图一致性,进一步增强攻击有效性。我们提出了一种基于 KDE 的评估协议来系统地评估攻击难度,为未来的研究提供客观的基准测试。大量的实验证明了我们的方法与最先进的技术相比具有优越的性能。项目页面:https://hentci.github.io/stealthattack/
TruthRL:通过强化学习激励诚实的法学硕士
- 标题: TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning
- 作者: Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Sean Chen, Mohammad Kachuee, Teja Gollapudi, Tony Liao, Nicolas Scheffer, Rakesh Wanga, Anuj Kumar, Yu Meng, Wen-tau Yih, Xin Luna Dong
- 日期: 2025-09-30
- ArXiv主页: https://arxiv.org/abs/2509.25760
- 论文链接: https://arxiv.org/pdf/2509.25760
英文摘要
While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy – models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that, compared to vanilla RL, TruthRL significantly reduces hallucinations by 28.9% and improves truthfulness by 21.1%, with consistent gains across various backbone models (e.g., Qwen, Llama) under both retrieval and non-retrieval setups. In-depth ablation study demonstrates that vanilla accuracy-driven methods, such as supervised fine-tuning or RL with a binary reward, struggle to balance factual correctness and uncertainty. In contrast, our proposed truthfulness-driven TruthRL achieves strong performance in both accuracy and truthfulness, underscoring the importance of learning objective design for developing truthful LLMs.
中文摘要
虽然大型语言模型(LLM)在事实问答方面表现出了强大的性能,但它们仍然容易产生幻觉和不真实的反应,特别是当任务需要其参数知识之外的信息时。事实上,真实性要求的不仅仅是准确性——模型还必须认识到不确定性,并在不确定时避免出现幻觉。这对现有方法提出了根本性挑战:优化准确性的方法通常会放大幻觉,而鼓励弃权的方法可能会变得过于保守,从而牺牲正确答案。这两个极端最终都会损害真实性。在这项工作中,我们提出了 TruthRL,一种通用的强化学习 (RL) 框架,可以直接优化法学硕士的真实性。具体来说,我们使用 GRPO 来实现 TruthRL,并通过简单而有效的三元奖励来区分正确答案、幻觉和弃权。它不仅通过提供正确的响应,而且还通过在不确定时启用弃权来激励模型减少幻觉,从而提高真实性。跨越四个知识密集型基准的广泛实验表明,与普通 RL 相比,TruthRL 显着减少了 28.9% 的幻觉,提高了 21.1% 的真实性,并且在检索和非检索设置下,各种骨干模型(例如 Qwen、Llama)都有一致的增益。深入的消融研究表明,普通的准确性驱动方法,例如监督微调或具有二元奖励的强化学习,很难平衡事实的正确性和不确定性。相比之下,我们提出的真实性驱动的 TruthRL 在准确性和真实性方面都取得了出色的表现,强调了学习目标设计对于培养真实的法学硕士的重要性。
OpenGPT-4o-Image:用于高级图像生成和编辑的综合数据集
- 标题: OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing
- 作者: Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang
- 日期: 2025-09-29
- ArXiv主页: https://arxiv.org/abs/2509.24900
- 论文链接: https://arxiv.org/pdf/2509.24900
英文摘要
The performance of unified multimodal models for image generation and editing is fundamentally constrained by the quality and comprehensiveness of their training data. While existing datasets have covered basic tasks like style transfer and simple object manipulation, they often lack the systematic structure and challenging scenarios required for real-world applications. To address this bottleneck, we introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology that combines hierarchical task taxonomy with automated data generation. Our taxonomy not only includes fundamental capabilities such as text rendering and style control but also introduces highly practical yet challenging categories like scientific imagery for chemistry illustrations and complex instruction editing requiring simultaneous execution of multiple operations. Through an automated pipeline leveraging structured resource pools and GPT-4o, we generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks. Extensive experiments show that fine-tuning leading models on our dataset achieves significant performance gains across multiple benchmarks, with improvements of up to 18% on editing tasks (UniWorld-V1 on ImgEdit-Bench) and 13% on generation tasks (Harmon on GenEval). Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.
中文摘要
用于图像生成和编辑的统一多模态模型的性能从根本上受到其训练数据的质量和全面性的限制。虽然现有数据集涵盖了风格转换和简单对象操作等基本任务,但它们通常缺乏实际应用程序所需的系统结构和具有挑战性的场景。为了解决这个瓶颈,我们引入了 OpenGPT-4o-Image,这是一个使用一种新颖的方法构建的大规模数据集,该方法将分层任务分类法与自动数据生成相结合。我们的分类不仅包括文本渲染和风格控制等基本功能,还引入了高度实用但具有挑战性的类别,例如化学插图的科学图像和需要同时执行多个操作的复杂指令编辑。通过利用结构化资源池和 GPT-4o 的自动化管道,我们生成了 80k 个具有受控多样性的高质量指令图像对,涵盖 11 个主要领域和 51 个子任务。大量实验表明,在我们的数据集上微调领先模型可以在多个基准测试中实现显着的性能提升,编辑任务(ImgEdit-Bench 上的 UniWorld-V1)性能提升高达 18%,生成任务(GenEval 上的 Harmon)性能提升高达 13%。我们的工作表明,系统数据构建是推进多模式人工智能能力的关键。
更多推荐


所有评论(0)