【论文速递】2025年第16周(Apr-13-19)(Robotics/Embodied AI/LLM)
我们介绍了InternVL3,这是Intervl系列的重大进步,该系列具有本地多模式预训练范式。Intervl3并没有将仅文本大语模型(LLM)调整为支持视觉输入的多模式大型语言模型(MLLM),而是在单个预训练阶段中共同从多样化的多模式数据和纯文本公司中获得多模式和语言能力。这种统一的训练范式有效地解决了MLLM的常规事后培训管道中通常遇到的复杂性和一致性挑战。
中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- Internvl3:探索开源多模型的高级培训和测试时间食谱
- prima.cpp:加速70B级LLM在低资源的日常家庭集群上推断
- Seaweed-7B:视频生成基础模型的成本效益培训
- 攀登:基于聚类的迭代数据混合物引导式训练预训练
- xverify:用于推理模型评估的有效答案验证者
- BITNET B1.58 2B4T技术报告
- SEEDREAM 3.0技术报告
- 改造:在LLMS中使用战略工具使用的强化学习
- 抗抗化采样
- 天才:一个可推广的,纯粹无监督的自我训练框架,用于高级推理
- 在视频生成的下一框架预测模型中包装输入框架上下文
- 我们是否有统一的图像产生和理解?GPT-4O图像产生能力的实证研究
- ColorBench:VLM可以看到并理解色彩鲜艳的世界吗?颜色感知,推理和鲁棒性的全面基准
- Gigatok:将视觉令牌缩放到30亿个参数,用于自回归图像生成
- VL-RETHINKER:通过增强学习激励视力模型的自我反思
- MineWorld:Minecraft上的实时和开源互动世界模型
- 培训后培训后如何指导和推理数据形状:通过层梯度的镜头进行数据质量
- 生成但验证:通过回顾性重采样的视觉模型中的幻觉
- 融合:视力语言表示完全集成以进行深层跨模式理解
- 感知编码器:最好的视觉嵌入不在网络的输出
- 通过加强重新排列的代码生成的迭代自我训练
- 世界:长期一致的世界模拟与记忆
- Heimdall:生成验证的测试时间缩放
- VLM-R1:稳定且可推广的R1风格大型视觉模型
- 70%尺寸,100%准确性:通过动态长度浮点
- SQL-R1:通过增强学习训练自然语言对SQL推理模型
Internvl3:探索开源多模型的高级培训和测试时间食谱
- 标题: InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
- 作者: Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Weiye Xu, Hao Li, Jiahao Wang, Han Lv, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenwen Qu, Peng Sun, Penglong Jiao, Lijun Wu, Kaipeng Zhang, Huipeng Deng, Jiaye Ge, Kai Chen, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang
- 日期: 2025-04-14
- ArXiv主页: https://arxiv.org/abs/2504.10479
- 论文链接: https://arxiv.org/pdf/2504.10479
- 项目链接: https://internvl.github.io/blog/2025-04-11-InternVL-3.0/
英文摘要
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
中文摘要
我们介绍了InternVL3,这是Intervl系列的重大进步,该系列具有本地多模式预训练范式。Intervl3并没有将仅文本大语模型(LLM)调整为支持视觉输入的多模式大型语言模型(MLLM),而是在单个预训练阶段中共同从多样化的多模式数据和纯文本公司中获得多模式和语言能力。这种统一的训练范式有效地解决了MLLM的常规事后培训管道中通常遇到的复杂性和一致性挑战。为了进一步提高性能和可伸缩性,InternVL3结合了可变的视觉位置编码(V2PE),以支持扩展的多模式环境,采用先进的后培训技术,例如监督的微调(SFT)(SFT)和混合偏好优化(MPO),并采用优化的培训培训策略以及优化的测试时间缩放策略。广泛的经验评估表明,Intervl3在各种多模式任务中提供了卓越的性能。特别是,Intervl3-78B在MMMU基准上取得了72.2的成绩,在开源MLLM中创造了新的最新时间。它的功能在领先的专有模型中仍然具有很高的竞争力,包括Chatgpt-4O,Claude 3.5 Sonnet和Gemini 2.5 Pro,同时还保持了强大的纯语言水平。为了追求开放科学原则,我们将公开发布培训数据和模型权重,以促进下一代MLLM的进一步研究和开发。
prima.cpp:加速70B级LLM在低资源的日常家庭集群上推断
- 标题: PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters
- 作者: Zonghang Li, Tao Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu
- 日期: 2025-04-07
- ArXiv主页: https://arxiv.org/abs/2504.08791
- 论文链接: https://arxiv.org/pdf/2504.08791
- 项目链接: https://github.com/Lizonghang/prima.cpp
- gitHub仓库: https://github.com/fengwenjiao/Prima.cpp
英文摘要
Emergency of DeepSeek R1 and QwQ 32B have broken through performance barriers for running frontier large language models (LLMs) on home devices. While consumer hardware is getting stronger and model quantization is improving, existing end-side solutions still demand GPU clusters, large RAM/VRAM, and high bandwidth, far beyond what a common home cluster can handle. This paper introduces prima.cpp, a distributed inference system that runs 70B-scale models on everyday home devices using a mix of CPU/GPU, low RAM/VRAM, Wi-Fi, and cross-platform support. It uses mmap to manage model weights and introduces piped-ring parallelism with prefetching to hide disk loading. By modeling heterogeneity in computation, communication, disk, memory (and its management behavior), and OS, it optimally assigns model layers to each device’s CPU and GPU, further reducing token latency. An elegant algorithm named Halda is proposed to solve this NP-hard assignment problem. We evaluate prima.cpp on a common four-node home cluster. It outperforms llama.cpp, exo, and dllama on 30B+ models while keeping memory pressure below 6%. This brings frontier 30B-70B models, such as Llama 3, DeepSeek R1, Qwen 2.5, and QwQ to home assistants, making advanced AI truly accessible to individuals. The code is open source and available at https://github.com/Lizonghang/prima.cpp.
中文摘要
DeepSeek R1和QWQ 32B的紧急情况因在家用设备上运行Frontier大语言模型(LLM)的性能障碍而破坏了。尽管消费者硬件越来越强,并且模型量化正在改善,但现有的终端解决方案仍然需要GPU群集,大型RAM/VRAM和高带宽,远远超出了常见的家庭群集可以处理的。本文介绍了Prima.cpp,这是一种分布式推理系统,该系统使用CPU/GPU,Low RAM/VRAM,WI-FI和跨平台支持在日常家庭设备上运行70B尺寸的模型。它使用MMAP来管理模型权重,并引入带有管道的并行性,并预拿起以隐藏磁盘加载。通过对计算,通信,磁盘,内存(及其管理行为)和OS的异质性进行建模,它将模型层最佳地分配给每个设备的CPU和GPU,从而进一步降低了令牌延迟。提出了一种名为HALDA的优雅算法来解决这个NP-HARD分配问题。我们在一个常见的四节点家庭群集上评估prima.cpp。在30b+模型上,它的表现优于Llama.CPP,EXO和DLLAMA,同时保持记忆压力低于6%。这将Frontier 30b-70b型号(例如Llama 3,DeepSeek R1,Qwen 2.5和QWQ)带到家庭助理,使个人可以真正访问AID。该代码是开源的,可在https://github.com/lizonghang/prima.cpp上找到。
Seaweed-7B:视频生成基础模型的成本效益培训
- 标题: Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
- 作者: Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, Feng Cheng, Feilong Zuo Xuejiao Zeng, Ziyan Yang, Fangyuan Kong, Zhiwu Qing, Fei Xiao, Meng Wei, Tuyen Hoang, Siyu Zhang, Peihao Zhu, Qi Zhao, Jiangqiao Yan, Liangke Gui, Sheng Bi, Jiashi Li, Yuxi Ren, Rui Wang, Huixia Li, Xuefeng Xiao, Shu Liu, Feng Ling, Heng Zhang, Houmin Wei, Huafeng Kuang, Jerry Duncan, Junda Zhang, Junru Zheng, Li Sun, Manlin Zhang, Renfei Sun, Xiaobin Zhuang, Xiaojie Li, Xin Xia, Xuyan Chi, Yanghua Peng, Yuping Wang, Yuxuan Wang, Zhongkai Zhao, Zhuo Chen, Zuquan Song, Zhenheng Yang, Jiashi Feng, Jianchao Yang, Lu Jiang
- 日期: 2025-04-11
- ArXiv主页: https://arxiv.org/abs/2504.08685
- 论文链接: https://arxiv.org/pdf/2504.08685
- 项目链接: https://seaweed.video/
英文摘要
This technical report presents a cost-efficient strategy for training a video generation foundation model. We present a mid-sized research model with approximately 7 billion parameters (7B) called Seaweed-7B trained from scratch using 665,000 H100 GPU hours. Despite being trained with moderate computational resources, Seaweed-7B demonstrates highly competitive performance compared to contemporary video generation models of much larger size. Design choices are especially crucial in a resource-constrained setting. This technical report highlights the key design decisions that enhance the performance of the medium-sized diffusion model. Empirically, we make two observations: (1) Seaweed-7B achieves performance comparable to, or even surpasses, larger models trained on substantially greater GPU resources, and (2) our model, which exhibits strong generalization ability, can be effectively adapted across a wide range of downstream applications either by lightweight fine-tuning or continue training. See the project page at https://seaweed.video/
中文摘要
该技术报告提出了一种培训视频基础模型的经济高效策略。我们提出了一个中型的研究模型,使用665,000 H100 GPU小时,其大约70亿参数(7B)称为Seaweed-7B。尽管接受了适度的计算资源培训,但与更大尺寸的当代视频生成模型相比,Seaweed-7B表现出竞争激烈的性能。在资源约束设置中,设计选择尤为重要。该技术报告强调了增强中型扩散模型性能的关键设计决策。从经验上讲,我们进行了两个观察:(1)Seaweed-7b实现的性能可与较大的GPU资源相当甚至超过较大的模型,并且(2)我们的模型(表现出强大的概括能力)可以通过轻度范围的范围通过轻度Weiblweight Fine-fight-fight-fightning或继续培训来有效地调整。请参阅https://seaweed.video/的项目页面
攀登:基于聚类的迭代数据混合物引导式训练预训练
- 标题: CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
- 作者: Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan, Lin, Jan Kautz, Pavlo Molchanov
- 日期: 2025-04-17
- ArXiv主页: https://arxiv.org/abs/2504.13161
- 论文链接: https://arxiv.org/pdf/2504.13161
英文摘要
Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture. Our data is available at: this https URL
中文摘要
预培训数据集通常是从Web内容中收集的,并且缺乏固有的域分区。例如,广泛使用的数据集(如常见爬网)不包括明确的域标签,同时手动策划标记的数据集(例如堆)是劳动力密集的。因此,尽管在培训前绩效方面,确定最佳的预训练数据混合物仍然是一个具有挑战性的问题。为了应对这些挑战,我们提出了基于聚类的迭代数据混合物引导(CLIMB),这是一个自动化的框架,可在预训练设置中发现,评估和完善数据混合物。具体而言,在语义空间中的攀爬嵌入和簇大规模数据集,然后使用较小的代理模型和预测变量迭代地搜索最佳混合物。当使用这种混合物在400B代币上进行训练时,我们的1B模型超过了最先进的乳白色3.2-1b 2.0%。此外,我们观察到,对特定领域(例如,社会科学)进行优化比随机抽样的提高了5%。最后,我们介绍了Climblab,这是一种经过过滤的1.2亿英里语料库,带有20个群集作为研究游乐场,而Climbmix是一种紧凑而强大的400亿泰式数据集,旨在有效的预训练,以相等的代价预算提供出色的性能。我们分析了最终数据混合物,阐明了最佳数据混合物的特征。我们的数据可用:此HTTPS URL
xverify:用于推理模型评估的有效答案验证者
-
标题: xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
-
作者: Ding Chen, Qingchen Yu, Pengyuan Wang, Wentao Zhang, Bo Tang, Feiyu Xiong, Xinchi Li, Minchuan Yang, Zhiyu Li
-
日期: 2025-04-14
-
ArXiv主页: https://arxiv.org/abs/2504.10481
-
gitHub仓库: https://github.com/IAAR-Shanghai/xVerify
英文摘要
With the release of the o1 model by OpenAI, reasoning models adopting slow thinking strategies have gradually emerged. As the responses generated by such models often include complex reasoning, intermediate steps, and self-reflection, existing evaluation methods are often inadequate. They struggle to determine whether the LLM output is truly equivalent to the reference answer, and also have difficulty identifying and extracting the final answer from long, complex responses. To address this issue, we propose xVerify, an efficient answer verifier for reasoning model evaluations. xVerify demonstrates strong capability in equivalence judgment, enabling it to effectively determine whether the answers produced by reasoning models are equivalent to reference answers across various types of objective questions. To train and evaluate xVerify, we construct the VAR dataset by collecting question-answer pairs generated by multiple LLMs across various datasets, leveraging multiple reasoning models and challenging evaluation sets designed specifically for reasoning model assessment. A multi-round annotation process is employed to ensure label accuracy. Based on the VAR dataset, we train multiple xVerify models of different scales. In evaluation experiments conducted on both the test set and generalization set, all xVerify models achieve overall F1 scores and accuracy exceeding 95%. Notably, the smallest variant, xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o in overall performance. These results validate the effectiveness and generalizability of xVerify.
中文摘要
随着OpenAI释放O1模型,采用缓慢思考策略的推理模型逐渐出现。由于这种模型产生的响应通常包括复杂的推理,中间步骤和自我反省,因此现有的评估方法通常不足。他们努力确定LLM输出是否真正等同于参考答案,并且很难从长期,复杂的答案中识别和提取最终答案。为了解决此问题,我们提出了Xverify,这是一种用于推理模型评估的有效答案验证程序。Xverify表现出在等效判断中的强大能力,使其能够有效地确定推理模型产生的答案是否等效于在各种客观问题上引用答案。为了训练和评估Xverify,我们通过收集由多个LLM在各种数据集中生成的问题 - 答案对来构建VAR数据集,从而利用多个推理模型和具有挑战性的评估集,专门为推理模型评估而设计。采用多轮注释过程来确保标签精度。基于VAR数据集,我们训练多个不同尺度的XVerify模型。在对测试集和概括集进行的评估实验中,所有XVerify模型都达到了总体F1分数,精度超过95 \%。值得注意的是,最小的变体Xverify-0.5b-I优于除GPT-4O以外的所有评估方法,而Xverify-3B-IB在整体性能方面超过了GPT-4O。这些结果证明了Xverify的有效性和概括性。
BITNET B1.58 2B4T技术报告
-
标题: BitNet b1.58 2B4T Technical Report
-
作者: Shuming Ma, Hongyu Wang, Shaohan Huang, Xingxing Zhang, Ying Hu, Ting Song, Yan Xia, Furu Wei
-
日期: 2025-04-16
-
ArXiv主页: https://arxiv.org/abs/2504.12285
-
gitHub仓库: https://github.com/microsoft/bitnet
英文摘要
We introduce BitNet b1.58 2B4T, the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale. Trained on a corpus of 4 trillion tokens, the model has been rigorously evaluated across benchmarks covering language understanding, mathematical reasoning, coding proficiency, and conversational ability. Our results demonstrate that BitNet b1.58 2B4T achieves performance on par with leading open-weight, full-precision LLMs of similar size, while offering significant advantages in computational efficiency, including substantially reduced memory footprint, energy consumption, and decoding latency. To facilitate further research and adoption, the model weights are released via Hugging Face along with open-source inference implementations for both GPU and CPU architectures.
中文摘要
我们以200亿个参数量表介绍了B1.58 2B4T,这是第一个开源的,本机1位大语言模型(LLM)。该模型受过4万亿代币的语料库的培训,对涵盖语言理解,数学推理,编码能力和对话能力的基准进行了严格评估。我们的结果表明,BITNET B1.58 2B4T的性能与领先的开放权重,完全精确的LLM相似,同时提供了计算效率的显着优势,包括大幅降低记忆足迹,能量消耗和解码延迟。为了促进进一步的研究和采用,模型权重是通过拥抱脸以及GPU和CPU架构的开源推理实现发布的。
SEEDREAM 3.0技术报告
- 标题: Seedream 3.0 Technical Report
- 作者: Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xuanda Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, Weilin Huang
- 日期: 2025-04-15
- ArXiv主页: https://arxiv.org/abs/2504.11346
- 论文链接: https://arxiv.org/pdf/2504.11346
- 项目链接: https://team.doubao.com/zh/tech/seedream3_0
英文摘要
We present Seedream 3.0, a high-performance Chinese-English bilingual image generation foundation model. We develop several technical improvements to address existing challenges in Seedream 2.0, including alignment with complicated prompts, fine-grained typography generation, suboptimal visual aesthetics and fidelity, and limited image resolutions. Specifically, the advancements of Seedream 3.0 stem from improvements across the entire pipeline, from data construction to model deployment. At the data stratum, we double the dataset using a defect-aware training paradigm and a dual-axis collaborative data-sampling framework. Furthermore, we adopt several effective techniques such as mixed-resolution training, cross-modality RoPE, representation alignment loss, and resolution-aware timestep sampling in the pre-training phase. During the post-training stage, we utilize diversified aesthetic captions in SFT, and a VLM-based reward model with scaling, thereby achieving outputs that well align with human preferences. Furthermore, Seedream 3.0 pioneers a novel acceleration paradigm. By employing consistent noise expectation and importance-aware timestep sampling, we achieve a 4 to 8 times speedup while maintaining image quality. Seedream 3.0 demonstrates significant improvements over Seedream 2.0: it enhances overall capabilities, in particular for text-rendering in complicated Chinese characters which is important to professional typography generation. In addition, it provides native high-resolution output (up to 2K), allowing it to generate images with high visual quality.
中文摘要
我们提出了SeedReam 3.0,这是一种高性能的中文双语图像生成基础模型。我们开发了几项技术改进,以应对种子Ream 2.0中的现有挑战,包括与复杂的提示,细粒度的排版,次优的视觉美学和忠诚度以及有限的图像分辨率保持一致。具体而言,SeedReam 3.0的进步源于整个管道的改进,从数据构建到模型部署。在数据层上,我们使用缺陷感知的培训范式和双轴协作数据采样框架加倍数据集。此外,我们采用了几种有效的技术,例如在训练阶段中的混合分辨率训练,跨模式绳,表示对准损失和分辨率意识到的时间段采样。在训练后阶段,我们利用SFT中的多元化美学标题,以及具有缩放率的基于VLM的奖励模型,从而实现了与人类偏好很好的输出。此外,SeedReam 3.0先驱者是一种新型的加速度范式。通过采用一致的噪声期望和重要性感知的时间段采样,我们在保持图像质量的同时达到了4到8倍的速度。SeedReam 3.0表现出比Seedream 2.0的显着改善:它增强了整体功能,尤其是对于复杂的汉字呈现文本介绍,这对于专业排版的生成很重要。此外,它提供了天然的高分辨率输出(最高2K),从而使其能够以高视觉质量生成图像。
改造:在LLMS中使用战略工具使用的强化学习
- 标题: ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
- 作者: Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, Wanjun Zhong
- 日期: 2025-04-15
- ArXiv主页: https://arxiv.org/abs/2504.11536
- 论文链接: https://arxiv.org/pdf/2504.11536
- 项目链接: https://retool-rl.github.io/
- gitHub仓库: https://github.com/ReTool-RL/ReTool
英文摘要
While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model’s tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on the challenging MATH Olympiad benchmark AIME demonstrate ReTool’s superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in efficiency and performance. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI’s o1-preview by 27.9%. Further analysis reveals emergent behaviors such as code self-correction, signaling an ‘‘aha moment’’ in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.
中文摘要
尽管推理模型(例如,DeepSeek R1)接受了加固学习(RL)的训练,在文本推理方面表现出色,但它们在需要结构化解决问题的场景中挣扎,例如几何推理,简洁的计算或复杂方程式求解 - 求解方案 - 在其中计算工具(例如代码解释者(CI)(CI)表现出不同的优势)表现出不同的优势。为了弥合这一差距,我们提出了恢复,从工具融合的学习中增强了长形式的推理,包括两个关键特征:(1)在自然语言推理过程中实时代码执行的动态相互交流,以及(2)自动化的RL范式,该自动化的RL Paradigm允许使用多动用代码执行的策略进行实时执行,并在学习过程中进行型模型,并在“学习中”进行求职工具,并允许使用型号。REDOOL采用系统的培训框架,从合成冷启动数据生成开始,以生成用于微调基本模型的代码增强的长格式推理痕迹。随后的RL培训利用任务结果作为迭代的奖励,可以完善模型的工具使用策略,从而自主发现没有人类先验的最佳工具调用模式。关于具有挑战性的数学奥林匹克基准AIME的实验证明了REDOOL的优势:我们的32B型号通过400个训练步骤达到67%的精度,在效率和性能方面优于基于文本的RL基线(40%精度,1080个步骤)。值得注意的是,REDOOL-32B在扩展设置中的精度达到72.5%,超过Openai的O1-preview的准确性增长了27.9%。进一步的分析揭示了诸如代码自我纠正之类的新兴行为,表明模型自动掌握自适应工具使用的“ aha arth时刻”。这些发现凸显了结局驱动的工具集成的希望,以推进复杂的数学推理,并为混合神经符号系统提供新的见解。
抗抗化采样
- 标题: Antidistillation Sampling
- 作者: Yash Savani, Asher Trockman, Zhili Feng, Avi Schwarzschild, Alexander Robey, Marc Finzi, J. Zico Kolter
- 日期: 2025-04-17
- ArXiv主页: https://arxiv.org/abs/2504.13146
- 论文链接: https://arxiv.org/pdf/2504.13146
- 项目链接: https://antidistillation.com
英文摘要
Frontier models that generate extended reasoning traces inadvertently produce rich token sequences that can facilitate model distillation. Recognizing this vulnerability, model owners may seek sampling strategies that limit the effectiveness of distillation without compromising model performance. Antidistillation sampling provides exactly this capability. By strategically modifying a model’s next-token probability distribution, antidistillation sampling poisons reasoning traces, rendering them significantly less effective for distillation while preserving the model’s practical utility. For further details, see https://antidistillation.com.
中文摘要
产生扩展推理轨迹的边界模型无意间产生可以促进模型蒸馏的丰富令牌序列。认识到这种漏洞,模型所有者可能会寻求采样策略,以限制蒸馏的有效性而不会损害模型性能。抗缩息采样提供了这种能力。通过策略性地修改模型的下一概率分布,反抗化采样毒物推理痕迹,使它们在保持模型的实用性的同时,使它们在蒸馏中的有效性明显降低。有关更多详细信息,请参见https://antidistillation.com。
天才:一个可推广的,纯粹无监督的自我训练框架,用于高级推理
- 标题: Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning
- 作者: Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Qiushi Sun, Kanzhi Cheng, Junxian He, Jun Liu, Zhiyong Wu
- 日期: 2025-04-11
- ArXiv主页: https://arxiv.org/abs/2504.08672
- 论文链接: https://arxiv.org/pdf/2504.08672
- 项目链接: https://github.com/xufangzhi/Genius
- gitHub仓库: https://github.com/xufangzhi/Genius%7D
英文摘要
Advancing LLM reasoning skills has captivated wide interest. However, current post-training techniques rely heavily on supervisory signals, such as outcome supervision or auxiliary reward models, which face the problem of scalability and high annotation costs. This motivates us to enhance LLM reasoning without the need for external supervision. We introduce a generalizable and purely unsupervised self-training framework, named Genius. Without external auxiliary, Genius requires to seek the optimal response sequence in a stepwise manner and optimize the LLM. To explore the potential steps and exploit the optimal ones, Genius introduces a stepwise foresight re-sampling strategy to sample and estimate the step value by simulating future outcomes. Further, we recognize that the unsupervised setting inevitably induces the intrinsic noise and uncertainty. To provide a robust optimization, we propose an advantage-calibrated optimization (ACO) loss function to mitigate estimation inconsistencies. Combining these techniques together, Genius provides an advanced initial step towards self-improve LLM reasoning with general queries and without supervision, revolutionizing reasoning scaling laws given the vast availability of general queries. The code will be released at https://github.com/xufangzhi/Genius.
中文摘要
提高LLM推理技能引起了广泛的兴趣。但是,当前的训练后技术在很大程度上取决于监督信号,例如结果监督或辅助奖励模型,这些模型面临可扩展性和高注释成本的问题。这激发了我们在不需要外部监督的情况下增强LLM推理。我们介绍了一个可普遍的,纯粹的无监督的自我训练框架,名为Genius。没有外部辅助,天才需要以逐步方式寻求最佳响应序列并优化LLM。为了探索潜在步骤并利用最佳步骤,Genius引入了逐步的远见卓识策略,以通过模拟未来结果来采样和估算步骤值。此外,我们认识到无监督的设置不可避免地会引起内在噪声和不确定性。为了提供强大的优化,我们提出了一个优势校准的优化(ACO)损耗函数,以减轻估计不一致。将这些技术结合在一起,Genius提供了朝着自我启动的LLM推理以及一般查询而没有监督的高级初步步骤,鉴于广泛的查询可用性,革新推理缩放定律。该代码将在https://github.com/xufangzhi/genius上发布。
在视频生成的下一框架预测模型中包装输入框架上下文
- 标题: Packing Input Frame Context in Next-Frame Prediction Models for Video Generation
- 作者: Lvmin Zhang, Maneesh Agrawala
- 日期: 2025-04-17
- ArXiv主页: https://arxiv.org/abs/2504.12626
- 论文链接: https://arxiv.org/pdf/2504.12626
- 项目链接: https://lllyasviel.github.io/frame_pack_gitpage
- gitHub仓库: https://github.com/lllyasviel/FramePack
英文摘要
We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. The FramePack compresses input frames to make the transformer context length a fixed number regardless of the video length. As a result, we are able to process a large number of frames using video diffusion with computation bottleneck similar to image diffusion. This also makes the training video batch sizes significantly higher (batch sizes become comparable to image diffusion training). We also propose an anti-drifting sampling method that generates frames in inverted temporal order with early-established endpoints to avoid exposure bias (error accumulation over iterations). Finally, we show that existing video diffusion models can be finetuned with FramePack, and their visual quality may be improved because the next-frame prediction supports more balanced diffusion schedulers with less extreme flow shift timesteps.
中文摘要
我们提出了一个神经网络结构FramePack,以训练视频生成的下一框架(或下一框架)预测模型。FramePack压缩输入帧,使变压器上下文长度成为固定数字,而不论视频长度如何。结果,我们能够使用与图像扩散相似的计算瓶颈来处理大量帧。这也使训练视频批量大小显着更高(批量大小可与图像扩散训练相当)。我们还提出了一种反灌溉采样方法,该方法以倒置的时间顺序生成框架,并具有早期建立的端点,以避免暴露偏见(迭代误差积累)。最后,我们表明现有的视频扩散模型可以用framepack进行填充,并且可以提高其视觉质量,因为下一框架预测支持更平衡的扩散调度程序,并且具有较小的极端流动时间段。
我们是否有统一的图像产生和理解?GPT-4O图像产生能力的实证研究
- 标题: Have we unified image generation and understanding yet? An empirical study of GPT-4o’s image generation ability
- 作者: Ning Li, Jingran Zhang, Justin Cui
- 日期: 2025-04-09
- ArXiv主页: https://arxiv.org/abs/2504.08003
- 论文链接: https://arxiv.org/pdf/2504.08003
英文摘要
OpenAI’s multimodal GPT-4o has demonstrated remarkable capabilities in image generation and editing, yet its ability to achieve world knowledge-informed semantic synthesis–seamlessly integrating domain knowledge, contextual reasoning, and instruction adherence–remains unproven. In this study, we systematically evaluate these capabilities across three critical dimensions: (1) Global Instruction Adherence, (2) Fine-Grained Editing Precision, and (3) Post-Generation Reasoning. While existing benchmarks highlight GPT-4o’s strong capabilities in image generation and editing, our evaluation reveals GPT-4o’s persistent limitations: the model frequently defaults to literal interpretations of instructions, inconsistently applies knowledge constraints, and struggles with conditional reasoning tasks. These findings challenge prevailing assumptions about GPT-4o’s unified understanding and generation capabilities, exposing significant gaps in its dynamic knowledge integration. Our study calls for the development of more robust benchmarks and training strategies that go beyond surface-level alignment, emphasizing context-aware and reasoning-grounded multimodal generation.
中文摘要
Openai的多模式GPT-4O在图像生成和编辑中表现出了非凡的功能,但它具有实现世界知识知识的语义综合的能力 - 无需整合域知识,上下文推理和指导依从性 - 依据 - 尚未证实。在这项研究中,我们系统地评估了三个关键维度的这些功能:(1)全球指导依从性,(2)精细颗粒的编辑精度和(3)产后推理。尽管现有基准强调了GPT-4O在图像生成和编辑中的强大功能,但我们的评估揭示了GPT-4O的持续局限性:该模型经常默认用于指令的字面解释,不一致地应用知识限制,并在有条件的推理任务中挣扎。这些发现挑战了有关GPT-4O统一的理解和发电能力的普遍假设,从而在其动态知识整合中揭示了很大的差距。我们的研究要求开发更强大的基准和训练策略,这些策略超出了表面层面的一致性,强调了情境感知和推理的多模式生成。
ColorBench:VLM可以看到并理解色彩鲜艳的世界吗?颜色感知,推理和鲁棒性的全面基准
- 标题: ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
- 作者: Yijun Liang, Ming Li, Chenrui Fan, Ziyue Li, Dang Nguyen, Kwesi Cobbina, Shweta Bhardwaj, Jiuhai Chen, Fuxiao Liu, Tianyi Zhou
- 日期: 2025-04-10
- ArXiv主页: https://arxiv.org/abs/2504.10514
- 论文链接: https://arxiv.org/pdf/2504.10514
- 项目链接: https://huggingface.co/datasets/umd-zhou-lab/ColorBench
- gitHub仓库: https://github.com/tianyi-lab/ColorBench
英文摘要
Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans. This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks. These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBenchcan serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.
中文摘要
颜色在人类感知中起着重要作用,通常在视觉推理中提供关键的线索。但是,目前尚不清楚视觉模型(VLM)是否以及如何感知,理解和利用颜色为人类。本文介绍了ColorBench,这是一种精心制作的创新基准测试,以评估VLM在色彩理解中的能力,包括色彩感知,推理和鲁棒性。通过策划一套不同的测试场景,在实际应用中接地,ColorBench评估了这些模型如何感知颜色,从基于色素的提示中推断含义,并在不同的颜色转换下保持一致的性能。通过对具有不同语言模型和视觉编码器的32个VLM的广泛评估,我们的论文揭示了一些未被发现的发现:(i)比例定律(较大的模型更好)仍然具有ColorBench,而语言模型比视觉编码器更重要。(ii)但是,模型之间的性能差距相对较小,这表明现有VLM在很大程度上忽略了颜色的理解。(iii)COT推理提高了颜色理解精度和鲁棒性,尽管它们是以视觉为中心的任务。(iv)颜色线索确实是由VLM在ColorBench上利用的,但它们在某些任务中也可能误导模型。这些发现突出了当前VLM的关键局限性,并强调了增强颜色理解的需求。我们的ColorBenchcan是推进对多模式AI的人类水平颜色理解的基础工具。
Gigatok:将视觉令牌缩放到30亿个参数,用于自回归图像生成
- 标题: GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation
- 作者: Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, Xihui Liu
- 日期: 2025-04-11
- ArXiv主页: https://arxiv.org/abs/2504.08736
- 论文链接: https://arxiv.org/pdf/2504.08736
- 项目链接: https://silentview.github.io/GigaTok/
- gitHub仓库: https://github.com/SilentView/GigaTok
英文摘要
In autoregressive (AR) image generation, visual tokenizers compress images into compact discrete latent tokens, enabling efficient training of downstream autoregressive models for visual generation via next-token prediction. While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality – a challenge not adequately addressed in existing literature. To address this, we introduce GigaTok, the first approach to simultaneously improve image reconstruction, generation, and representation learning when scaling visual tokenizers. We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma. To mitigate this, we propose semantic regularization, which aligns tokenizer features with semantically consistent features from a pre-trained visual encoder. This constraint prevents excessive latent space complexity during scaling, yielding consistent improvements in both reconstruction and downstream autoregressive generation. Building on semantic regularization, we explore three key practices for scaling tokenizers:(1) using 1D tokenizers for better scalability, (2) prioritizing decoder scaling when expanding both encoder and decoder, and (3) employing entropy loss to stabilize training for billion-scale tokenizers. By scaling to 3 space billion parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.
中文摘要
在自动回归(AR)图像产生中,视觉引导者将图像压缩为紧凑的离散潜在代币,从而有效地训练下游自回归模型通过下一步的预测进行视觉生成。在扩展视觉引物器可以提高图像重建质量的同时,它通常会降低下游的生成质量 - 这是现有文献中未充分解决的挑战。为了解决这个问题,我们介绍了Gigatok,这是在缩放视觉令牌时同时改善图像重建,生成和表示学习的第一种方法。我们确定潜在空间的日益增长的复杂性是重建与产生困境的关键因素。为了减轻这种情况,我们提出了语义正则化,该语义正则化将令牌功能与预先训练的视觉编码器具有语义一致的特征对齐。该限制阻止了缩放期间过度的潜在空间复杂性,从而在重建和下游自回归产生中都能持续改善。在语义正则化的基础上,我们探讨了缩放象征器的三个关键实践:(1)使用1D令牌以更好的可伸缩性,(2)在扩展编码器和解码器时优先考虑解码器缩放,以及(3)使用熵损失来稳定培训数十亿个尺度的象征。通过扩展到3个空间十亿个参数,Gigatok在重建,下游AR生成和下游AR表示质量方面实现了最新的性能。
VL-RETHINKER:通过增强学习激励视力模型的自我反思
- 标题: VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
- 作者: Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, Wenhu Chen
- 日期: 2025-04-10
- ArXiv主页: https://arxiv.org/abs/2504.08837
- 论文链接: https://arxiv.org/pdf/2504.08837
- 项目链接: https://tiger-ai-lab.github.io/VL-Rethinker/
英文摘要
Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1’s performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a textual rethinking trigger to the end of initial rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse, and MathVision to achieve 80.3%, 61.8%, and 43.9% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with GPT-o1.
中文摘要
最近,像GPT-O1和DeepSeek-R1这样的缓慢思考的系统在通过明确反思来解决具有挑战性的问题方面具有巨大的潜力。在各种数学和科学基准上,他们的表现极大地超过了最好的快速思维模型,例如GPT-4O。但是,它们的多模式推理能力与快速思维的模型相当。例如,GPT-O1在MathVista,Mathverse和MathVision等基准上的性能类似于快速思维的模型。在本文中,我们旨在使用强化学习(不依赖蒸馏)来增强视觉模型的缓慢思维能力,以推动最新的现状。首先,我们使用一种称为选择性样本重播(SSR)的新技术来调整GRPO算法,以解决消失的优势问题。尽管这种方法产生了强劲的性能,但由此产生的RL训练模型表现出有限的自我反思或自我验证。为了进一步鼓励缓慢思考,我们引入了强迫重新思考,这将文本重新思考触发器附加到RL训练中的初始推广结束时,明确执行了自我反射推理步骤。通过结合这两种技术,我们的模型,VL-ReThinker,在MathVista,Mathverse和Mathvision上提高了最先进的分数,以达到80.3%,61.8%和43.9%的成绩。VL-RETHINCER还可以在MMMU-Pro,Emma和Mega Bench等多学科基准上实现开源SOTA,并使用GPT-O1缩小了差距。
MineWorld:Minecraft上的实时和开源互动世界模型
-
标题: MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft
-
作者: Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, Jiang Bian
-
日期: 2025-04-11
-
ArXiv主页: https://arxiv.org/abs/2504.08388
-
gitHub仓库: https://github.com/microsoft/MineWorld
英文摘要
World modeling is a crucial task for enabling intelligent agents to effectively interact with humans and operate in dynamic environments. In this work, we propose MineWorld, a real-time interactive world model on Minecraft, an open-ended sandbox game which has been utilized as a common testbed for world modeling. MineWorld is driven by a visual-action autoregressive Transformer, which takes paired game scenes and corresponding actions as input, and generates consequent new scenes following the actions. Specifically, by transforming visual game scenes and actions into discrete token ids with an image tokenizer and an action tokenizer correspondingly, we consist the model input with the concatenation of the two kinds of ids interleaved. The model is then trained with next token prediction to learn rich representations of game states as well as the conditions between states and actions simultaneously. In inference, we develop a novel parallel decoding algorithm that predicts the spatial redundant tokens in each frame at the same time, letting models in different scales generate 4 to 7 frames per second and enabling real-time interactions with game players. In evaluation, we propose new metrics to assess not only visual quality but also the action following capacity when generating new scenes, which is crucial for a world model. Our comprehensive evaluation shows the efficacy of MineWorld, outperforming SoTA open-sourced diffusion based world models significantly. The code and model have been released.
中文摘要
世界建模是使智能代理能够有效与人类互动并在动态环境中运行的至关重要的任务。在这项工作中,我们提出了MineWorld,这是Minecraft上的实时交互式世界模型,Minecraft是一种开放式的沙盒游戏,已被用作世界建模的常见测试台。MineWorld由视觉行动自动回归变压器驱动,该变压器采用配对的游戏场景和相应的动作作为输入,并在操作后生成随后的新场景。具体而言,通过将视觉游戏场景和动作转换为具有图像令牌和动作令牌的离散令牌ID,我们将模型输入与两种IDS的串联相连组成。然后,对模型进行了训练,并以代币预测进行培训,以同时学习游戏状态的丰富表示形式以及状态和行动之间的条件。在推断中,我们开发了一种新颖的平行解码算法,该算法同时预测每个帧中空间冗余令牌,让不同尺度的模型每秒生成4到7帧,并启用与游戏玩家的实时交互。在评估中,我们建议新的指标不仅评估视觉质量,而且还评估产生新场景时的动作能力,这对于世界模型至关重要。我们的全面评估表明,矿世界的有效性高于SOTA开源基于基于扩散的世界模型。代码和模型已发布。
培训后培训后如何指导和推理数据形状:通过层梯度的镜头进行数据质量
-
标题: How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients
-
作者: Ming Li, Yanhong Li, Ziyue Li, Tianyi Zhou
-
日期: 2025-04-14
-
ArXiv主页: https://arxiv.org/abs/2504.10766
英文摘要
As the post-training of large language models (LLMs) advances from instruction-following to complex reasoning tasks, understanding how different data affect finetuning dynamics remains largely unexplored. In this paper, we present a spectral analysis of layer-wise gradients induced by low/high-quality instruction and reasoning data for LLM post-training. Our analysis reveals that widely-studied metrics for data evaluation, e.g., IFD, InsTag, Difficulty, and Reward, can be explained and unified by spectral properties computed from gradients’ singular value decomposition (SVD). Specifically, higher-quality data are usually associated with lower nuclear norms and higher effective ranks. Notably, effective rank exhibits better robustness and resolution than nuclear norm in capturing subtle quality differences. For example, reasoning data achieves substantially higher effective ranks than instruction data, implying richer gradient structures on more complex tasks. Our experiments also highlight that models within the same family share similar gradient patterns regardless of their sizes, whereas different model families diverge significantly. Providing a unified view on the effects of data quality across instruction and reasoning data, this work illuminates the interplay between data quality and training stability, shedding novel insights into developing better data exploration strategies for post-training.
中文摘要
随着大语言模型(LLMS)的培训后从指导跟踪到复杂的推理任务的发展,了解不同的数据如何影响芬特动态的动态仍然在很大程度上没有探索。在本文中,我们介绍了由低/高质量指导和LLM后训练后推理数据引起的层梯度的光谱分析。我们的分析表明,可以通过根据梯度的单数值分解(SVD)计算出的光谱属性来解释和统一数据评估的广泛研究指标,例如IFD,INTAG,困难和奖励。具体而言,高质量的数据通常与较低的核规范和较高的有效等级有关。值得注意的是,在捕获微妙的质量差异方面,有效等级比核规范表现出更好的鲁棒性和解决方案。例如,推理数据比指令数据获得的有效等级要高得多,这意味着对更复杂的任务上的梯度结构更丰富。我们的实验还强调,在同一家族中的模型具有相似的梯度模式,而不论其大小如何,而不同的模型家族的分歧很大。这项工作对跨指令和推理数据提供了统一的看法,阐明了数据质量和训练稳定性之间的相互作用,从而使新的见解对开发更好的数据勘探策略进行后培训。
生成但验证:通过回顾性重采样的视觉模型中的幻觉
- 标题: Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
- 作者: Tsung-Han Wu, Heekyung Lee, Jiaxin Ge, Joseph E. Gonzalez, Trevor Darrell, David M. Chan
- 日期: 2025-04-17
- ArXiv主页: https://arxiv.org/abs/2504.13169
- 论文链接: https://arxiv.org/pdf/2504.13169
- 项目链接: https://reverse-vlm.github.io/
- gitHub仓库: https://github.com/tsunghan-wu/reverse_vlm
英文摘要
Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations, where they generate descriptions of nonexistent objects, actions, or concepts, posing significant risks in safety-critical applications. Existing hallucination mitigation methods typically follow one of two paradigms: generation adjustment, which modifies decoding behavior to align text with visual inputs, and post-hoc verification, where external models assess and correct outputs. While effective, generation adjustment methods often rely on heuristics and lack correction mechanisms, while post-hoc verification is complicated, typically requiring multiple models and tending to reject outputs rather than refine them. In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification. By leveraging a new hallucination-verification dataset containing over 1.3M semi-synthetic samples, along with a novel inference-time retrospective resampling technique, our approach enables VLMs to both detect hallucinations during generation and dynamically revise those hallucinations. Our evaluations show that REVERSE achieves state-of-the-art hallucination reduction, outperforming the best existing methods by up to 12% on CHAIR-MSCOCO and 28% on HaloQuest. Our dataset, model, and code are available at: https://reverse-vlm.github.io.
中文摘要
视觉模型(VLMS)在视觉理解方面表现出色,但经常患有视觉幻觉,它们会在其中描述不存在的对象,动作或概念,并在安全至关重要的应用中带来重大风险。现有的缓解方法通常遵循两个范式之一:生成调整,它修改了解码行为以使文本与视觉输入和事后验证相结合,外部模型评估和正确的输出。虽然有效,但生成调整方法通常依赖于启发式方法和缺乏校正机制,而事后验证却很复杂,通常需要多种模型,并且倾向于拒绝输出而不是完善它们。在这项工作中,我们介绍了反向,这是一个统一的框架,将幻觉感知的培训与自动验证相结合。通过利用一个新的幻觉验证数据集,该数据集包含超过130万半合成样品,以及一种新颖的推理时间回顾性重新采样技术,我们的方法使VLMS可以在发电期间检测幻觉并动态修改这些幻觉。我们的评估表明,反向可实现最新的幻觉减少,在椅子上,最佳现有方法的表现高达12%,而haloquest的方法则优于28%。我们的数据集,模型和代码可在以下网址提供:https://reverse-vlm.github.io。
融合:视力语言表示完全集成以进行深层跨模式理解
-
标题: FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
-
作者: Zheng Liu, Mengjie Liu, Jingzhou Chen, Jingwei Xu, Bin Cui, Conghui He, Wentao Zhang
-
日期: 2025-04-14
-
ArXiv主页: https://arxiv.org/abs/2504.09925
英文摘要
We introduce FUSION, a family of multimodal large language models (MLLMs) with a fully vision-language alignment and integration paradigm. Unlike existing methods that primarily rely on late-stage modality interaction during LLM decoding, our approach achieves deep, dynamic integration throughout the entire processing pipeline. To this end, we propose Text-Guided Unified Vision Encoding, incorporating textual information in vision encoding to achieve pixel-level integration. We further design Context-Aware Recursive Alignment Decoding that recursively aggregates visual features conditioned on textual context during decoding, enabling fine-grained, question-level semantic integration. To guide feature mapping and mitigate modality discrepancies, we develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a Synthesized Language-Driven Question-Answer (QA) dataset through a new data synthesis method, prioritizing high-quality QA pairs to optimize text-guided feature integration. Building on these foundations, we train FUSION at two scales-3B, 8B-and demonstrate that our full-modality integration approach significantly outperforms existing methods with only 630 vision tokens. Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited to 300 vision tokens. Our ablation studies show that FUSION outperforms LLaVA-NeXT on over half of the benchmarks under same configuration without dynamic resolution, highlighting the effectiveness of our approach. We release our code, model weights, and dataset. https://github.com/starriver030515/FUSION
中文摘要
我们介绍了Fusion,这是一个具有完全视觉的一致性和整合范式的多模式大型语言模型(MLLM)家族。与现有的主要依赖于LLM解码期间晚期形态相互作用的方法不同,我们的方法在整个处理管道中实现了深层,动态的整合。为此,我们提出了文本引导的统一视觉编码,将文本信息包含在视觉编码中以实现像素级集成。我们进一步设计了上下文感知的递归对准解码,该解码在解码过程中递归汇总的视觉特征在文本上下文中进行了调节,从而实现了细粒度的,问题级的语义集成。为了指导特征映射并减轻模态差异,我们开发了双重监督的语义映射损失。此外,我们通过新的数据综合方法构建了综合语言驱动的问题解答(QA)数据集,优先考虑高质量的QA对,以优化文本指导的特征集成。在这些基础的基础上,我们以两个量表-3B(8b)训练融合,并证明我们的全模式集成方法极大地超过了只有630个视觉令牌的现有方法。值得注意的是,Fusion 3B在大多数基准上超过了Cambrian-1 8b和Florence-VL 8B。Fusion 3B即使限制为300个视觉令牌,Fusion 3B仍能超过Cambrian-1 8b。我们的消融研究表明,在同一配置下,融合在超过一半的基准上胜过Llava-Next而没有动态分辨率,从而突出了我们方法的有效性。我们发布代码,模型权重和数据集。https://github.com/starriver030515/fusion
感知编码器:最好的视觉嵌入不在网络的输出
-
标题: Perception Encoder: The best visual embeddings are not at the output of the network
-
作者: Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár, Christoph Feichtenhofer
-
日期: 2025-04-17
-
ArXiv主页: https://arxiv.org/abs/2504.13181
-
gitHub仓库: https://github.com/facebookresearch/perception_models
英文摘要
We introduce Perception Encoder (PE), a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods, language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together with the core contrastive checkpoint, our PE family of models achieves state-of-the-art performance on a wide variety of tasks, including zero-shot image and video classification and retrieval; document, image, and video Q&A; and spatial tasks such as detection, depth estimation, and tracking. To foster further research, we are releasing our models, code, and a novel dataset of synthetically and human-annotated videos.
中文摘要
我们介绍了感知编码器(PE),这是一种用于图像和视频理解的最先进的编码器,通过简单的视觉学习训练。传统上,视觉编码器依靠各种预处理的目标,每个目标都针对特定的下游任务,例如分类,字幕或本地化。令人惊讶的是,在缩放了我们精心调整的图像预处理配方并通过强大的视频数据引擎进行完善之后,我们发现仅对比度视觉语言训练就可以为所有这些下游任务产生强大的一般嵌入。只有一个警告:这些嵌入被隐藏在网络的中间层中。为了将它们提出来,我们介绍了两种对齐方式,多模式建模的语言对齐方式以及用于密集预测的空间对齐。与核心对比检查站一起,我们的体育模型家族在各种任务上都达到了最先进的表现,包括零拍摄图像和视频分类和检索;文档,图像和视频问答;以及空间任务,例如检测,深度估计和跟踪。为了促进进一步的研究,我们正在发布合成和人类宣布视频的新型模型,代码和新颖的数据集。
通过加强重新排列的代码生成的迭代自我训练
- 标题: Iterative Self-Training for Code Generation via Reinforced Re-Ranking
- 作者: Nikita Sorokin, Ivan Sedykh, Valentin Malykh
- 日期: 2025-04-13
- ArXiv主页: https://arxiv.org/abs/2504.09643
- 论文链接: https://arxiv.org/pdf/2504.09643
英文摘要
Generating high-quality code that solves complex programming tasks is challenging, especially with current decoder-based models that produce highly stochastic outputs. In code generation, even minor errors can easily break the entire solution. Leveraging multiple sampled solutions can significantly improve the overall output quality. One effective way to enhance code generation is by pairing a code generation model with a reranker model, which selects the best solution from the generated samples. We propose a novel iterative self-training approach for self-training reranker models using Proximal Policy Optimization (PPO), aimed at improving both reranking accuracy and the overall code generation process. Unlike traditional PPO approaches, where the focus is on optimizing a generative model with a reward model, our approach emphasizes the development of a robust reward/reranking model. This model improves the quality of generated code through reranking and addresses problems and errors that the reward model might overlook during PPO alignment with the reranker. Our method iteratively refines the training dataset by re-evaluating outputs, identifying high-scoring negative examples, and incorporating them into the training loop, that boosting model performance. Our evaluation on the MultiPL-E dataset demonstrates that our 13.4B parameter model outperforms a 33B model in code generation quality while being three times faster. Moreover, it achieves performance comparable to GPT-4 and surpasses it in one programming language.
中文摘要
生成解决复杂编程任务的高质量代码很具有挑战性,尤其是对于当前基于解码器的模型而产生高度随机输出的模型。在代码生成中,即使是小错误也可以轻松打破整个解决方案。利用多个采样解决方案可以显着提高整体产出质量。增强代码生成的一种有效方法是将代码生成模型与Reranker模型配对,该模型从生成的样品中选择最佳解决方案。我们建议使用近端策略优化(PPO)为自我训练的重读者模型(PPO)提出一种新型的迭代自我训练方法,旨在提高重新计算的准确性和整体代码生成过程。与传统的PPO方法不同,该方法的重点是通过奖励模型优化生成模型,我们的方法强调了强大的奖励/重新播放模型的发展。该模型通过重读来提高生成的代码的质量,并解决奖励模型在与Reranker的PPO对齐过程中可能忽略的问题和错误。我们的方法通过重新评估输出,识别高分负面示例并将其纳入培训循环,从而提高模型性能,从而迭代地完善培训数据集。我们对Multipl-E数据集的评估表明,我们的13.4B参数模型在代码生成质量方面的表现优于33B模型,而三倍的速度则超过了三倍。此外,它可以实现与GPT-4相当的性能,并以一种编程语言超越它。
世界:长期一致的世界模拟与记忆
- 标题: WORLDMEM: Long-term Consistent World Simulation with Memory
- 作者: Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, Xingang Pan
- 日期: 2025-04-16
- ArXiv主页: https://arxiv.org/abs/2504.12369
- 论文链接: https://arxiv.org/pdf/2504.12369
- 项目链接: https://xizaoqu.github.io/worldmem/
- gitHub仓库: https://github.com/xizaoqu/WorldMem
英文摘要
World simulation has gained increasing popularity due to its ability to model virtual environments and predict the consequences of actions. However, the limited temporal context window often leads to failures in maintaining long-term consistency, particularly in preserving 3D spatial consistency. In this work, we present WorldMem, a framework that enhances scene generation with a memory bank consisting of memory units that store memory frames and states (e.g., poses and timestamps). By employing a memory attention mechanism that effectively extracts relevant information from these memory frames based on their states, our method is capable of accurately reconstructing previously observed scenes, even under significant viewpoint or temporal gaps. Furthermore, by incorporating timestamps into the states, our framework not only models a static world but also captures its dynamic evolution over time, enabling both perception and interaction within the simulated world. Extensive experiments in both virtual and real scenarios validate the effectiveness of our approach.
中文摘要
由于其对虚拟环境建模并预测行动后果的能力,世界模拟的流行程度越来越大。但是,有限的时间上下文窗口通常会导致保持长期一致性的失败,尤其是在保持3D空间一致性方面。在这项工作中,我们介绍了WorldMem,该框架通过由存储器单元组成的内存库增强场景的生成,该内存单元存储内存框架和状态(例如,姿势和时间戳)。通过采用一种记忆注意机制,该机制可以根据其状态有效从这些记忆框架中提取相关信息,我们的方法即使在显着的观点或时间间隙下,我们的方法也能够准确地重建先前观察到的场景。此外,通过将时间戳纳入州,我们的框架不仅建模了静态世界,而且还捕捉了随着时间的流逝的动态演变,从而在模拟世界中既可以感知和互动。在虚拟和实际场景中进行的广泛实验验证了我们方法的有效性。
Heimdall:生成验证的测试时间缩放
- 标题: Heimdall: test-time scaling on the generative verification
- 作者: Wenlei Shi, Xing Jin
- 日期: 2025-04-14
- ArXiv主页: https://arxiv.org/abs/2504.10337
- 论文链接: https://arxiv.org/pdf/2504.10337
英文摘要
An AI system can create and maintain knowledge only to the extent that it can verify that knowledge itself. Recent work on long Chain-of-Thought reasoning has demonstrated great potential of LLMs on solving competitive problems, but their verification ability remains to be weak and not sufficiently investigated. In this paper, we propose Heimdall, the long CoT verification LLM that can accurately judge the correctness of solutions. With pure reinforcement learning, we boost the verification accuracy from 62.5% to 94.5% on competitive math problems. By scaling with repeated sampling, the accuracy further increases to 97.5%. Through human evaluation, Heimdall demonstrates impressive generalization capabilities, successfully detecting most issues in challenging math proofs, the type of which is not included during training. Furthermore, we propose Pessimistic Verification to extend the functionality of Heimdall to scaling up the problem solving. It calls Heimdall to judge the solutions from a solver model and based on the pessimistic principle, selects the most likely correct solution with the least uncertainty. Taking DeepSeek-R1-Distill-Qwen-32B as the solver model, Pessimistic Verification improves the solution accuracy on AIME2025 from 54.2% to 70.0% with 16x compute budget and to 83.3% with more compute budget. With the stronger solver Gemini 2.5 Pro, the score reaches 93.0%. Finally, we prototype an automatic knowledge discovery system, a ternary system where one poses questions, another provides solutions, and the third verifies the solutions. Using the data synthesis work NuminaMath for the first two components, Heimdall effectively identifies problematic records within the dataset and reveals that nearly half of the data is flawed, which interestingly aligns with the recent ablation studies from NuminaMath.
中文摘要
AI系统只能在可以验证知识本身的范围内创建和维护知识。关于长期思考推理的最新工作表明了LLM在解决竞争问题上的巨大潜力,但是他们的验证能力仍然是弱的,并且不充分研究。在本文中,我们提出了Heimdall,这是可以准确判断解决方案正确性的长床验证LLM。通过纯净的增强学习,我们在竞争性数学问题上将验证准确性从62.5%提高到94.5%。通过重复采样缩放,准确性进一步增加到97.5%。通过人类评估,海姆德尔(Heimdall)表现出令人印象深刻的概括能力,成功地检测到了挑战数学证明的大多数问题,而训练过程中未包括这种数学的类型。此外,我们提出了悲观的验证,以扩展海姆德尔的功能以扩大问题的解决。它称Heimdall从求解器模型中判断解决方案,并基于悲观的原则,选择最有可能的解决方案,而不确定性最少。以16倍的计算预算,以DeepSeek-R1-Distill-Qwen-32b为求解器模型,将AIME2025的解决方案准确性从54.2%提高到70.0%,并以更多的计算预算提高了解决方案精度。随着较强的求解器Gemini 2.5 Pro,得分达到93.0%。最后,我们原型一种自动知识发现系统,一个提出问题,另一个提出解决方案的三元系统,第三个验证了解决方案。Heimdall使用数据合成工作NuminAmath,有效地确定了数据集中的有问题记录,并揭示了将近一半的数据存在缺陷,这很有趣,这与Numinamath的最新消融研究一致。
VLM-R1:稳定且可推广的R1风格大型视觉模型
-
标题: VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
-
作者: Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, Tiancheng Zhao
-
日期: 2025-04-10
-
ArXiv主页: https://arxiv.org/abs/2504.07615
-
gitHub仓库: https://github.com/om-ai-lab/VLM-R1
英文摘要
Recently DeepSeek R1 has shown that reinforcement learning (RL) can substantially improve the reasoning capabilities of Large Language Models (LLMs) through a simple yet effective design. The core of R1 lies in its rule-based reward formulation, which leverages tasks with deterministic ground-truth answers to enable precise and stable reward computation. In the visual domain, we similarly observe that a wide range of visual understanding tasks are inherently equipped with well-defined ground-truth annotations. This property makes them naturally compatible with rule-based reward mechanisms. Motivated by this observation, we investigate the extension of R1-style reinforcement learning to Vision-Language Models (VLMs), aiming to enhance their visual reasoning capabilities. To this end, we develop VLM-R1, a dedicated framework designed to harness RL for improving VLMs’ performance on general vision-language tasks. Using this framework, we further explore the feasibility of applying RL to visual domain. Experimental results indicate that the RL-based model not only delivers competitive performance on visual understanding tasks but also surpasses Supervised Fine-Tuning (SFT) in generalization ability. Furthermore, we conduct comprehensive ablation studies that uncover a series of noteworthy insights, including the presence of reward hacking in object detection, the emergence of the “OD aha moment”, the impact of training data quality, and the scaling behavior of RL across different model sizes. Through these analyses, we aim to deepen the understanding of how reinforcement learning enhances the capabilities of vision-language models, and we hope our findings and open-source contributions will support continued progress in the vision-language RL community. Our code and model are available at https://github.com/om-ai-lab/VLM-R1
中文摘要
最近,DeepSeek R1表明,加强学习(RL)可以通过简单而有效的设计实质上提高大语言模型(LLM)的推理能力。R1的核心在于其基于规则的奖励公式,该公式通过确定性的基础真实答案来利用任务,以实现精确稳定的奖励计算。在视觉域中,我们同样观察到,广泛的视觉理解任务固有地配备了定义明确的地面真相注释。该属性使它们自然与基于规则的奖励机制兼容。在这一观察过程中,我们研究了R1风格的增强学习的扩展到视觉模型(VLMS),旨在增强其视觉推理能力。为此,我们开发了VLM-R1,这是一个专门的框架,旨在利用RL来改善VLMS在一般视觉任务上的性能。使用此框架,我们进一步探讨了将RL应用于视觉域的可行性。实验结果表明,基于RL的模型不仅在视觉理解任务上提供竞争性能,而且还超过了概括能力的监督微调(SFT)。此外,我们进行了全面的消融研究,以揭示一系列值得注意的见解,包括在物体检测中存在奖励黑客入侵,“ OD AHA时刻”的出现,训练数据质量的影响以及RL在不同模型范围内的缩放行为。通过这些分析,我们旨在加深对强化学习如何增强视觉模型能力的理解,我们希望我们的发现和开源贡献将支持视觉RL社区的持续进展。我们的代码和型号可从https://github.com/om–ai-lab/vlm-r1获得
70%尺寸,100%准确性:通过动态长度浮点
-
标题: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float
-
作者: Tianyi Zhang, Yang Sui, Shaochen Zhong, Vipin Chaudhary, Xia Hu, Anshumali Shrivastava
-
日期: 2025-04-15
-
ArXiv主页: https://arxiv.org/abs/2504.11651
-
gitHub仓库: https://github.com/LeanModels/DFloat11
英文摘要
Large Language Models (LLMs) have grown rapidly in size, creating significant challenges for efficient deployment on resource-constrained hardware. In this paper, we introduce Dynamic-Length Float (DFloat11), a lossless compression framework that reduces LLM size by 30% while preserving outputs that are bit-for-bit identical to the original model. DFloat11 is motivated by the low entropy in the BFloat16 weight representation of LLMs, which reveals significant inefficiency in existing storage format. By applying entropy coding, DFloat11 assigns dynamic-length encodings to weights based on frequency, achieving near information-optimal compression without any loss of precision. To facilitate efficient inference with dynamic-length encodings, we develop a custom GPU kernel for fast online decompression. Our design incorporates the following: (i) decomposition of memory-intensive lookup tables (LUTs) into compact LUTs that fit in GPU SRAM, (ii) a two-phase kernel for coordinating thread read/write positions using lightweight auxiliary variables, and (iii) transformer-block-level decompression to minimize latency. Experiments on recent models, including Llama-3.1, Qwen-2.5, and Gemma-3, validates our hypothesis that DFloat11 achieves around 30% model size reduction while preserving bit-for-bit exact outputs. Compared to a potential alternative of offloading parts of an uncompressed model to the CPU to meet memory constraints, DFloat11 achieves 1.9-38.8x higher throughput in token generation. With a fixed GPU memory budget, DFloat11 enables 5.3-13.17x longer context lengths than uncompressed models. Notably, our method enables lossless inference of Llama-3.1-405B, an 810GB model, on a single node equipped with 8x80GB GPUs. Our code and models are available at https://github.com/LeanModels/DFloat11.
中文摘要
大型语言模型(LLMS)的规模迅速增长,为在资源受限的硬件上有效部署带来了重大挑战。在本文中,我们引入了动态长度浮点(DFLOAT11),这是一个无损压缩框架,可将LLM大小降低30%,同时保留与原始模型相同的输出。DFLOAT11是由LLMS的BFLOAT16重量表示中的低熵激励的,这揭示了现有的存储格式的效率显着效率。通过应用熵编码,DFLOAT11根据频率将动态长度编码分配给权重,从而实现近乎信息的压缩,而不会损失任何精确度。为了促进使用动态长度编码的有效推断,我们开发了一个自定义的GPU内核,用于快速在线减压。我们的设计结合了以下内容:(i)将内存密集型查找表(LUTS)分解为适合GPU SRAM中的紧凑型LUT,(ii)使用轻量级辅助辅助变量协调线程读取/写入位置的两相内核,以及(iii)transferer-Block-Block-Block-Block-block-block-block lepel level level Decompressions以最大程度地减少LATENCY LATENCY LATENCY。在包括Llama-3.1,QWEN-2.5和GEMMA-3在内的最新模型的实验验证了我们的假设,即DFLOAT11可实现约30%的模型尺寸降低,同时保留位点精确的输出。与将未压缩模型的部分卸载到CPU以满足内存约束的潜在替代方案相比,DFLOAT11在令牌生成中的吞吐量高1.9-38.8 x。使用固定的GPU内存预算,DFLOAT11启用5.3-13.17x的上下文长度比未压缩模型更长。值得注意的是,我们的方法可以在配备8x80GB GPU的单个节点上对Llama-3.1-405B(810GB模型)的无损推断。我们的代码和模型可在https://github.com/leanmodels/dfloat11上找到。
SQL-R1:通过增强学习训练自然语言对SQL推理模型
- 标题: SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning
- 作者: Peixian Ma, Xialie Zhuang, Chengjin Xu, Xuhui Jiang, Ran Chen, Jian Guo
- 日期: 2025-04-11
- ArXiv主页: https://arxiv.org/abs/2504.08600
- 论文链接: https://arxiv.org/pdf/2504.08600
- 项目链接: https://huggingface.co/collections/MPX0222forHF/sql-r1-682ae0b3483ed07ae622d8a4
- gitHub仓库: https://github.com/IDEA-FinAI/SQL-R1
英文摘要
Natural Language to SQL (NL2SQL) enables intuitive interactions with databases by transforming natural language queries into structured SQL statements. Despite recent advancements in enhancing human-computer interaction within database applications, significant challenges persist, particularly regarding the inference performance in complex scenarios involving multi-table joins and nested queries. Current methodologies primarily utilize supervised fine-tuning (SFT) to train the NL2SQL model, which may limit adaptability and interpretability in new environments (e.g., finance and healthcare). In order to enhance the reasoning performance of the NL2SQL model in the above complex situations, we introduce SQL-R1, a novel NL2SQL reasoning model trained by the reinforcement learning (RL) algorithms. We design a specialized RL-based reward function tailored for NL2SQL tasks and discussed the impact of cold start on the effectiveness of intensive training. In addition, we achieve competitive accuracy using only a tiny amount of synthetic NL2SQL data for augmented training and further explore data engineering for RL. In existing experiments, SQL-R1 achieves execution accuracy of 88.6% and 66.6% on the benchmark Spider and BIRD, respectively, only using the 7B base model.
中文摘要
SQL(NL2SQL)的自然语言通过将自然语言查询转换为结构化的SQL语句,从而可以与数据库进行直观的交互。尽管最近在增强数据库应用程序中的人类计算机互动方面取得了进步,但重大挑战仍然存在,尤其是在涉及多桌子连接和嵌套查询的复杂场景中的推理性能方面。当前的方法论主要利用监督的微调(SFT)来训练NL2SQL模型,这可能会限制新环境中的适应性和可解释性(例如,金融和医疗保健)。为了在上述复杂情况下增强NL2SQL模型的推理性能,我们引入了SQL-R1,这是一种新型的NL2SQL推理模型,该模型受增强学习(RL)算法训练的新型NL2SQL推理模型。我们设计了针对NL2SQL任务量身定制的专门基于RL的奖励功能,并讨论了冷启动对强化培训有效性的影响。此外,我们仅使用少量合成NL2SQL数据来实现竞争精度,以增强训练并进一步探索RL的数据工程。在现有的实验中,SQL-R1仅使用7B基本模型,在基准蜘蛛和鸟类上分别达到了88.6%和66.6%的执行精度。
更多推荐


所有评论(0)