中文使用 googletrans 翻译,翻译不对的地方以英文为准

目录

反思,重试,奖励:通过加强学习来自我提高LLM

英文摘要

We explore a method for improving the performance of large language models through self-reflection and reinforcement learning. By incentivizing the model to generate better self-reflections when it answers incorrectly, we demonstrate that a model’s ability to solve complex, verifiable tasks can be enhanced even when generating synthetic data is infeasible and only binary feedback is available. Our framework operates in two stages: first, upon failing a given task, the model generates a self-reflective commentary analyzing its previous attempt; second, the model is given another attempt at the task with the self-reflection in context. If the subsequent attempt succeeds, the tokens generated during the self-reflection phase are rewarded. Our experimental results show substantial performance gains across a variety of model architectures, as high as 34.7% improvement at math equation writing and 18.1% improvement at function calling. Notably, smaller fine-tuned models (1.5 billion to 7 billion parameters) outperform models in the same family that are 10 times larger. Our novel paradigm is thus an exciting pathway to more useful and reliable language models that can self-improve on challenging tasks with limited external feedback.

中文摘要

我们探讨了一种通过自我反思和强化学习来改善大语言模型的性能的方法。通过激励模型在错误地回答时产生更好的自我反射,我们证明了模型解决复杂的能力,即使生成合成数据是不可行的,并且只有二进制反馈,也可以增强可验证的任务。我们的框架分为两个阶段:首先,在未完成一项任务后,该模型会产生自我反射的评论,以分析其先前的尝试;其次,该模型在上下文中以自我反射为任务进行了另一次尝试。如果随后的尝试成功,则会奖励在自我反思阶段产生的令牌。我们的实验结果表明,各种模型体系结构的大量性能提高,在数学方程式写作中提高了34.7%,功能呼叫时提高了18.1%。值得注意的是,较小的微型模型(15亿至70亿个参数)在同一家族中的表现要大10倍。因此,我们的新颖范式是通往更有用的语言模型的令人兴奋的途径,可以在有限的外部反馈方面自我爆发。


超越80/20规则:高渗透少数族裔代币推动有效的强化学习LLM推理

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways. Furthermore, studying how entropy patterns evolve during RLVR training reveals that RLVR largely adheres to the base model’s entropy patterns, primarily adjusting the entropy of high-entropy tokens. These findings highlight the significance of high-entropy tokens (i.e., forking tokens) to RLVR. We ultimately improve RLVR by restricting policy gradient updates to forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates on the Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME’25 and +7.71 on AIME’24) and Qwen3-14B (+4.79 on AIME’25 and +5.21 on AIME’24) base models, highlighting a strong scaling trend. In contrast, training exclusively on the 80% lowest-entropy tokens leads to a marked decline in performance. These findings indicate that the efficacy of RLVR primarily arises from optimizing the high-entropy tokens that decide reasoning directions. Collectively, our results highlight the potential to understand RLVR through a token-entropy perspective and optimize RLVR by leveraging high-entropy minority tokens to further improve LLM reasoning.

中文摘要

具有可验证奖励(RLVR)的增强学习已成为增强大语言模型(LLMS)的推理能力的强大方法,而其机制尚未得到充分了解。在这项工作中,我们通过新颖的熵模式进行了对RLVR的开创性探索,并全面分析了不同的令牌如何影响推理性能。通过检查对经营链(COT)推理中的令牌熵模式,我们观察到只有一小部分令牌表现出较高的熵,并且这些令牌是将模型引导到各种推理途径的关键叉子。此外,研究在RLVR训练期间熵模式如何发展表明,RLVR在很大程度上遵守基本模型的熵模式,主要调整了高渗透令牌的熵。这些发现突出了高渗透令牌(即分叉令牌)对RLVR的重要性。We ultimately improve RLVR by restricting policy gradient updates to forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates on the Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME’25 and +7.71 onAime’24)和Qwen3-14b(Aime’25上的+4.79和Aime’24上的+5.21)基本模型,突出了强大的缩放趋势。相比之下,仅对80%最低室内令牌的培训导致性能明显下降。这些发现表明,RLVR的疗效主要是由于优化决定推理方向的高注重令牌。总的来说,我们的结果突出了通过令牌 - 内部透视的理解RLVR的潜力,并通过利用高渗透少数族裔代币来进一步改善LLM推理来优化RLVR。


PRORL:长时间的强化学习扩大了大语言模型中的推理界限

英文摘要

Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whether RL truly expands a model’s reasoning capabilities or merely amplifies high-reward outputs already latent in the base model’s distribution, and whether continually scaling up RL compute reliably leads to improved reasoning performance. In this work, we challenge prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy resetting, and a diverse suite of tasks. Our empirical analysis reveals that RL-trained models consistently outperform base models across a wide range of pass@k evaluations, including scenarios where base models fail entirely regardless of the number of attempts. We further show that reasoning boundary improvements correlates strongly with task competence of base model and training duration, suggesting that RL can explore and populate new regions of solution space over time. These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in language models and establish a foundation for future work on long-horizon RL for reasoning. We release model weights to support further research: https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B

中文摘要

以推理为中心的语言模型的最新进展突出了增强学习(RL),作为将模型与可验证奖励保持一致的有前途的方法。但是,RL是真正扩展模型的推理功能还是仅仅放大基本模型分布中已经潜在的高回报输出,以及是否不断扩大RL计算会可靠地导致推理性能提高,这仍然是有争议的。在这项工作中,我们通过证明延长的RL(PRORL)培训可以发现基本模型无法访问的新型推理策略,即使在广泛的采样中也无法访问,我们可以挑战普遍的假设。我们介绍了Prorl,这是一种新颖的培训方法,该方法结合了KL差异控制,参考策略重置和各种任务套件。我们的经验分析表明,经过RL训练的模型在广泛的Pass@K评估中始终优于基本模型,包括基本模型完全失败的方案,无论尝试的数量如何。我们进一步表明,推理边界改进与基本模型和训练持续时间的任务能力密切相关,这表明RL可以随着时间的推移探索和填充解决方案空间的新区域。这些发现为RL有意义地扩大语言模型中的推理界限的条件提供了新的见解,并为未来的长距离RL工作建立了基础。我们发布模型权重以支持进一步的研究:https://huggingface.co/nvidia/nemotron-research-reasoning-qwen-1.5b


Smolvla:一种可负担和高效机器人技术的视觉语言动作模型

英文摘要

Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches adapt VLMs into vision-language-action (VLA) models that enable natural language-driven perception and control. However, existing VLAs are typically massive–often with billions of parameters–leading to high training costs and limited real-world deployability. Moreover, they rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms. In this work, we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance. SmolVLA is designed to be trained on a single GPU and deployed on consumer-grade GPUs or even CPUs. To further improve responsiveness, we introduce an asynchronous inference stack decoupling perception and action prediction from action execution, allowing higher control rates with chunked action generation. Despite its compact size, SmolVLA achieves performance comparable to VLAs that are 10x larger. We evaluate SmolVLA on a range of both simulated as well as real-world robotic benchmarks and release all code, pretrained models, and training data.

中文摘要

在大规模的多模式数据集上预测的视觉语言模型(VLM)编码丰富的视觉和语言知识,使它们成为机器人技术的坚实基础。最近的方法没有从头开始训练机器人策略,而是将VLMS调整为视觉语言动作(VLA)模型,从而实现自然语言驱动的感知和控制。但是,现有的VLA通常是巨大的 - 通常带有数十亿个参数 - 领导着高训练成本和有限的现实可部署性。此外,他们依靠学术和工业数据集,忽视了来自负担得起的机器人平台的社区收集数据的不断增长。在这项工作中,我们提出了Smolvla,这是一个小型,高效且以社区为导向的VLA,可大大降低培训和推理成本,同时保持竞争性绩效。Smolvla设计为对单个GPU进行培训,并在消费级GPU甚至CPU上部署。为了进一步提高响应能力,我们引入了一种异步推理堆栈的脱钩感知和行动执行的行动预测,从而使较高的控制速率随着行动的产生而产生。尽管大小紧凑,但Smolvla的性能与大10倍的VLA相当。我们评估了Smolvla在一系列模拟的机器人基准和释放所有代码,验证模型和培训数据的范围内评估Smolvla。


alphaone:推理模型在测试时间缓慢而快速地思考

英文摘要

This paper presents AlphaOne (alpha1), a universal framework for modulating reasoning progress in large reasoning models (LRMs) at test time. alpha1 first introduces alpha moment, which represents the scaled thinking phase with a universal parameter alpha. Within this scaled pre-alpha moment phase, it dynamically schedules slow thinking transitions by modeling the insertion of reasoning transition tokens as a Bernoulli stochastic process. After the alpha moment, alpha1 deterministically terminates slow thinking with the end-of-thinking token, thereby fostering fast reasoning and efficient answer generation. This approach unifies and generalizes existing monotonic scaling methods by enabling flexible and dense slow-to-fast reasoning modulation. Extensive empirical studies on various challenging benchmarks across mathematical, coding, and scientific domains demonstrate alpha1’s superior reasoning capability and efficiency. Project page: https://alphaone-project.github.io/

中文摘要

本文介绍了Alphaone(Alpha1),这是一个通用框架,用于调节测试时大型推理模型(LRMS)的推理进展。alpha1首先引入α矩,该矩矩代表具有通用参数alpha的缩放思维阶段。在这个缩放的α前矩阶段中,它通过建模推理过渡令牌作为伯努利随机过程的插入来动态安排慢速思维过渡。在Alpha时刻之后,Alpha1确定性地终止了思维的终结令牌,从而促进了快速的推理和有效的答案。这种方法通过启用灵活且密集的慢速推理调制来统一并概括现有的单调缩放方法。关于数学,编码和科学领域各种具有挑战性的基准测试的广泛实证研究表明,Alpha1的卓越推理能力和效率。项目页面:https://alphaone-project.github.io/


MIMO-VL技术报告

  • 标题: MiMo-VL Technical Report

  • 作者: Xiaomi LLM-Core Team, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song, Zhenbo Luo, Yue Yu, Yudong Wang, Yuanyuan Tian, Yu Tu, Yihan Yan, Yi Huang, Xu Wang, Xinzhe Xu, Xingchen Song, Xing Zhang, Xing Yong, Xin Zhang, Xiangwei Deng, Wenyu Yang, Wenhan Ma, Weiwei Lv, Weiji Zhuang, Wei Liu, Sirui Deng, Shuo Liu, Shimao Chen, Shihua Yu, Shaohui Liu, Shande Wang, Rui Ma, Qiantong Wang, Peng Wang, Nuo Chen, Menghang Zhu, Kangyang Zhou, Kang Zhou, Kai Fang, Jun Shi, Jinhao Dong, Jiebao Xiao, Jiaming Xu, Huaqiu Liu, Hongshen Xu, Heng Qu, Haochen Zhao, Hanglong Lv, Guoan Wang, Duo Zhang, Dong Zhang, Di Zhang, Chong Ma, Chang Liu, Can Cai, Bingquan Xia

  • 日期: 2025-06-04

  • ArXiv主页: https://arxiv.org/abs/2506.03569

  • 论文链接: https://arxiv.org/pdf/2506.03569

  • gitHub仓库: https://github.com/XiaomiMiMo/lmms-eval

英文摘要

We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify the importance of incorporating high-quality reasoning data with long Chain-of-Thought into pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain optimization. We also contribute a comprehensive evaluation suite covering 50+ tasks to promote reproducibility and advance the field. The model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-VL.

中文摘要

我们开源MIMO-VL-7B-SFT和MIMO-VL-7B-RL,这是两个强大的视觉语言模型,在一般的视觉理解和多模式推理方面提供了最先进的性能。MIMO-VL-7B-RL在40个评估任务中的35个上优于QWEN2.5-VL-7B,在OlympiaDbench上得分为59.4,超过了具有高达78B参数的模型。对于GUI接地应用程序,它在OSWorld-G上设置了一个新标准,甚至超过了UI-TARS等专业模型。我们的培训结合了四个阶段的预训练(2.4万亿代币)与混合的车上加强学习(MORL)结合了整合多样化的奖励信号。我们确定将高质量的推理数据与长期思考链纳入预训练阶段的重要性,以及尽管同时进行多域优化的挑战,但混合RL的好处。我们还贡献了一个全面的评估套件,涵盖了50多个任务,以促进可重复性并推进该领域。模型检查点和完整评估套件可在https://github.com/xiaomimo/mimo-vl上找到。


时间失明:为什么视频语言模型看不到人类可以看什么?

英文摘要

Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce SpookyBench, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code has been made available on our project website: https://timeblindness.github.io/.

中文摘要

视觉模型(VLM)的最新进展在理解视频中的时空关系方面取得了令人印象深刻的进步。但是,当空间信息被遮盖时,这些模型难以捕获纯粹的时间模式。我们介绍了Spookybench,这是一种基准,其中信息仅以噪声样框架的时间序列编码,从而反映了从生物信号传导到秘密通信的自然现象。有趣的是,尽管人类可以识别这些序列中的形状,文本和模式,但精度超过98%,最先进的VLMS达到了0%的精度。该性能差距突出了一个关键的限制:对框架级空间特征的过度依赖,并且无法从时间提示中提取含义。此外,当接受具有低空间信噪比(SNR)的数据集接受培训时,对模型的时间理解比人类感知更快地降解,尤其是在需要细粒度时间推理的任务中。克服这一限制将需要新颖的架构或训练范式,以使空间依赖性与时间处理相结合。我们的系统分析表明,此问题跨模型量表和体系结构持续存在。我们释放Spookybench以促进时间模式识别的研究,并弥合人类和机器视频理解之间的差距。数据集和代码已在我们的项目网站上提供:https://timeblindness.github.io/。


comfyui-copilot:自动化工作流程开发的智能助手

英文摘要

We introduce ComfyUI-Copilot, a large language model-powered plugin designed to enhance the usability and efficiency of ComfyUI, an open-source platform for AI-driven art creation. Despite its flexibility and user-friendly interface, ComfyUI can present challenges to newcomers, including limited documentation, model misconfigurations, and the complexity of workflow design. ComfyUI-Copilot addresses these challenges by offering intelligent node and model recommendations, along with automated one-click workflow construction. At its core, the system employs a hierarchical multi-agent framework comprising a central assistant agent for task delegation and specialized worker agents for different usages, supported by our curated ComfyUI knowledge bases to streamline debugging and deployment. We validate the effectiveness of ComfyUI-Copilot through both offline quantitative evaluations and online user feedback, showing that it accurately recommends nodes and accelerates workflow development. Additionally, use cases illustrate that ComfyUI-Copilot lowers entry barriers for beginners and enhances workflow efficiency for experienced users. The ComfyUI-Copilot installation package and a demo video are available at https://github.com/AIDC-AI/ComfyUI-Copilot.

中文摘要

我们介绍了Comfyui-Copilot,这是一个大型语言模型供电插件,旨在提高Comfyui的可用性和效率,Comfyui是AI驱动的Art Creation的开源平台。尽管Comfyui具有灵活性和用户友好的界面,但Comfyui仍可以向新移民带来挑战,包括有限的文档,模型错误配置和工作流程设计的复杂性。Comfyui-Copilot通过提供智能节点和模型建议以及自动化的一键式工作流构建来解决这些挑战。该系统以层次结构的多代理框架为核心,该框架包括一个任务委托的中央助理代理商和专门的工人代理,用于不同用法,并由我们精选的comfyui知识库支持,以简化调试和部署。我们通过离线定量评估和在线用户反馈来验证comfyui-copilot的有效性,这表明它准确地推荐了节点并加速了工作流程的开发。此外,用例表明,Comfyui-Copilot降低了初学者的入口障碍,并提高了有经验的用户的工作流效率。可在https://github.com/aidc-ai/comfyui-copilot上获得Comfyui-Copilot安装程序包和演示视频。


QWEN3嵌入:通过基础模型推进文本嵌入和重新固定

英文摘要

In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs’ robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data across multiple domains and languages, thus enhancing the training pipeline. The Qwen3 Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both embedding and reranking tasks, addressing diverse deployment scenarios where users can optimize for either efficiency or effectiveness. Empirical evaluations demonstrate that the Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks. Notably, it excels on the multilingual evaluation benchmark MTEB for text embedding, as well as in various retrieval tasks, including code retrieval, cross-lingual retrieval and multilingual retrieval. To facilitate reproducibility and promote community-driven research and development, the Qwen3 Embedding models are publicly available under the Apache 2.0 license.

中文摘要

在这项工作中,我们介绍了QWEN3嵌入式系列,这是对其前身GTE-QWEN系列的重大进步,它是基于QWEN3基础模型的文本嵌入和重新固定功能。利用QWEN3 LLMS在多语言文本理解和产生中的强大功能,我们的创新性多阶段培训管道将大规模无监督的预训练与高质量数据集中有监督的微调结合在一起。有效的模型合并策略进一步确保了QWEN3嵌入序列的鲁棒性和适应性。在培训过程中,QWEN3 LLM不仅用作骨干模型,而且在跨多个领域和语言的高质量,丰富和多样化的培训数据中发挥着至关重要的作用,从而增强了培训管道。QWEN3嵌入式系列提供了嵌入和重新管理任务的一系列模型尺寸(0.6b,4b,8b),以解决用户可以优化效率或有效性的各种部署方案。经验评估表明,QWEN3嵌入序列可在不同的基准中实现最先进的结果。值得注意的是,它在多种语言评估基准MTEB上符合文本嵌入以及各种检索任务,包括代码检索,跨语性检索和多语言检索。为了促进可重复性并促进社区驱动的研发,QWEN3嵌入模型可在Apache 2.0许可下公开使用。


推理体育馆:通过可验证的奖励进行加固学习的推理环境

英文摘要

We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games. Its key innovation is the ability to generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed. This procedural generation approach allows for continuous evaluation across varying difficulty levels. Our experimental results demonstrate the efficacy of RG in both evaluating and reinforcement learning of reasoning models.

中文摘要

我们介绍了推理体育馆(RG),这是一个以可验证的奖励的推理环境库。它提供了100多个数据生成器和验证仪,这些数据生成器和验证者涵盖了多个域,包括代数,算术,计算,认知,几何学,图理论,逻辑和各种常见游戏。它的关键创新是能够以可调节的复杂性生成几乎无限的培训数据,这与大多数以前的推理数据集不同,这些数据集通常是固定的。这种程序生成方法允许在不同的难度水平上进行持续评估。我们的实验结果表明,RG在评估和加强推理模型中的功效。


Uniworld:高分辨率语义编码器,用于统一的视觉理解和发电

英文摘要

Although existing unified models deliver strong performance on vision-language understanding and text-to-image generation, their models are limited in exploring image perception and manipulation tasks, which are urgently desired by users for wide applications. Recently, OpenAI released their powerful GPT-4o-Image model for comprehensive image perception and manipulation, achieving expressive capability and attracting community interests. By observing the performance of GPT-4o-Image in our carefully constructed experiments, we infer that GPT-4o-Image leverages features extracted by semantic encoders instead of VAE, while VAEs are considered essential components in many image manipulation models. Motivated by such inspiring observations, we present a unified generative framework named UniWorld based on semantic features provided by powerful visual-language models and contrastive semantic encoders. As a result, we build a strong unified model using only 1% amount of BAGEL’s data, which consistently outperforms BAGEL on image editing benchmarks. UniWorld also maintains competitive image understanding and generation capabilities, achieving strong performance across multiple image perception tasks. We fully open-source our models, including model weights, training and evaluation scripts, and datasets.

中文摘要

尽管现有的统一模型在视觉理解和文本到图像生成方面表现出色,但他们的模型在探索图像感知和操纵任务方面受到限制,这些任务迫切需要广泛的应用程序。最近,OpenAI发布了其强大的GPT-4O图像模型,以实现全面的图像感知和操纵,实现表达能力并吸引社区利益。通过观察我们精心构造的实验中GPT-4O图像的性能,我们推断出语义编码器而不是VAE提取的GPT-4O图像杠杆功能,而VAE在许多图像操作模型中被认为是必不可少的组件。由这种鼓舞人心的观察激励,我们基于强大的视觉语言模型和对比性语义编码器提供的语义特征,提出了一个名为uniworld的统一生成框架。结果,我们仅使用1%的百吉饼数据构建了一个强大的统一模型,该模型始终超过图像编辑基准测试的百吉饼。Uniworld还保持了竞争性的图像理解和发电能力,从而在多个图像感知任务中实现了强劲的性能。我们完全开放我们的模型,包括模型权重,培训和评估脚本以及数据集。


VS Bench:评估VLM在多代理环境中的战略推理和决策

英文摘要

Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and linguistic contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic reasoning and decision-making in multi-agent environments. VS-Bench comprises eight vision-grounded environments spanning cooperative, competitive, and mixed-motive interactions, designed to assess agents’ ability to predict others’ future moves and optimize for long-term objectives. We consider two complementary evaluation dimensions, including offline evaluation of strategic reasoning by next-action prediction accuracy and online evaluation of decision-making by normalized episode return. Extensive experiments of fourteen leading VLMs reveal a significant gap between current models and optimal performance, with the best models attaining 47.8% prediction accuracy and 24.3% normalized return. We further conduct in-depth analyses on multimodal observations, test-time scaling, social behaviors, and failure cases of VLM agents. By standardizing the evaluation and highlighting the limitations of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents. Code and data are available at https://vs-bench.github.io.

中文摘要

视觉语言模型(VLM)的最新进展已将其功能扩展到交互式代理任务,但现有的基准仍限于单一代理或仅文本环境。相比之下,现实世界中的情况通常涉及在丰富的视觉和语言环境中相互作用的多个代理,从而通过多模式观察和战略性相互作用提出挑战。为了弥合这一差距,我们引入了视觉战略基准(VS-Bench),这是一种多模式基准,可评估VLM在多代理环境中的战略推理和决策。VS Bench包括八个横向合作,竞争性和混合动力相互作用的远景环境,旨在评估代理人预测他人的未来移动并优化长期目标的能力。我们考虑两个互补的评估维度,包括通过下一个行动预测准确性对战略推理的离线评估以及通过归一化情节返回对决策的在线评估。14个领先的VLM的广泛实验揭示了当前模型与最佳性能之间的显着差距,最佳模型达到了47.8%的预测准确性,而24.3%的归一化收益。我们进一步对VLM代理的多模式观察,测试时间缩放,社交行为和失败案例进行了深入的分析。通过标准化评估并突出现有模型的局限性,我们设想与基础座位作为对战略多模式代理的未来研究的基础。代码和数据可从https://vs-bench.github.io获得。


SEEDVR2:通过扩散对抗后训练的一步视频修复

英文摘要

Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings. In this work, we propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data. To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures. Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency observed under high-resolution VR using window attention with a predefined window size. To stabilize and improve the adversarial post-training towards VR, we further verify the effectiveness of a series of losses, including a proposed feature matching loss without significantly sacrificing training efficiency. Extensive experiments show that SeedVR2 can achieve comparable or even better performance compared with existing VR approaches in a single step.

中文摘要

基于扩散的视频恢复(VR)的最新进展表现出视觉质量的显着改善,但在推断过程中产生了高度的计算成本。尽管几种基于蒸馏的方法表现出一步图像恢复的潜力,但扩展现有的VR方法仍然具有挑战性且无人驾驶,尤其是在现实世界中处理高分辨率视频时。在这项工作中,我们提出了一种基于SEEDVR2的一步基于扩散的VR模型,该模型对实际数据进行了对抗VR训练。为了在一个步骤中处理具有挑战性的高分辨率VR,我们为模型架构和培训程序介绍了几种增强功能。具体而言,提出了一种自适应窗口注意机制,在其中动态调整窗口大小以适合输出分辨率,避免使用带有预定义窗口大小的窗口注意力在高分辨率VR下观察到的窗口不一致。为了稳定和改善对VR的对抗性训练后,我们进一步验证了一系列损失的有效性,包括提出的特征匹配损失而不显着牺牲训练效率。广泛的实验表明,与单个步骤中现有的VR方法相比,SeedVr2可以实现可比甚至更好的性能。


带有长期空间记忆的视频世界模型

英文摘要

Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, leading to severe forgetting of previously generated environments. Inspired by the mechanisms of human memory, we introduce a novel framework to enhancing long-term consistency of video world models through a geometry-grounded long-term spatial memory. Our framework includes mechanisms to store and retrieve information from the long-term spatial memory and we curate custom datasets to train and evaluate world models with explicitly stored 3D memory mechanisms. Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long-term consistent world generation.

中文摘要

新兴的世界模型自动重新调查对响应动作(例如相机运动和文本提示)以及其他控制信号等动作产生视频帧。由于时间上下文的窗口大小有限,这些模型通常很难在重新访问期间保持场景一致性,从而导致对以前生成的环境的严重忘记。受到人类记忆机制的启发,我们引入了一个新颖的框架,通过几何结构的长期空间记忆来增强视频世界模型的长期一致性。我们的框架包括从长期空间内存中存储和检索信息的机制,我们策划了自定义数据集,以训练和评估具有明确存储的3D内存机制的世界模型。与相关基线相比,我们的评估表明质量,一致性和上下文长度的提高,为长期一致的世界一代铺平了道路。


Gui-actor:GUI剂的无坐标视觉接地

英文摘要

One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment, inability to handle ambiguous supervision targets, and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B even surpasses UI-TARS-72B (38.1) on ScreenSpot-Pro, achieving scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths.

中文摘要

构建VLM驱动的GUI代理的主要挑战之一是视觉接地,即,基于视觉内容和文本计划,将适当的屏幕区域定位以进行操作执行。大多数现有的工作将其作为基于文本的坐标生成任务。但是,这些方法受到了几种局限性:空间语义的对准弱,无法处理模棱两可的监督目标以及屏幕坐标的密集性质与模型诸如视觉变形金刚等模型提取的视觉特征的粗糙,斑块级粒度之间的不匹配。在本文中,我们提出了GUI-Actor,这是一种基于VLM的无坐标GUI接地方法。Gui-Actor的核心引入了一个基于注意力的动作头,该主管学会了将专用的令牌与所有相关的视觉贴片令牌保持一致,从而使该模型能够在单个正向传球中提出一个或多个动作区域。与此相一致,我们进一步设计了一个接地验证者,以评估并从提议的行动执行的候选人中选择最合理的行动区域。广泛的实验表明,GUI-Actor在多个GUI动作基准测试基准上的先验最新方法优于先前的最新方法,并改善了概括,从而看不见屏幕的分辨率和布局。值得注意的是,GUI-ACTOR-7B甚至在屏幕杆位上超过UI-TARS-72B(38.1),用QWEN2-VL获得40.7的得分,而QWEN2.5-VL作为骨架。此外,通过合并验证器,我们发现仅通过保持VLM骨架冷冻的VLM骨架而仅对新引入的动作头(〜100m参数)(〜100m参数),足以实现与以前的最新模型相媲美的性能,强调Gui-actor可以使底层的VLM具有有效的地面能力,而无需构成其一般地位,从而可以实现其一般性的强度。


Synthrl:通过可验证数据综合来缩放视觉推理

英文摘要

Vision-language models (VLMs) trained via reinforcement learning with verifiable reward (RLVR) have shown notable progress in scaling test-time compute effectively. In this work, we investigate how synthesized RL data can further improve RLVR. To this end, we propose SynthRL-a scalable and guaranteed pipeline for automatic data scaling in reasoning-oriented RL training. SynthRL comprises three key stages: (1) selecting seed questions with appropriate distribution, (2) augmenting them into more challenging variants while preserving the original answers, and (3) a guaranteed verification stage that ensures near-perfect correctness and difficulty enhancement. Our empirical experiments demonstrate SynthRL’s scalability and effectiveness. When applied to the MMK12 dataset, SynthRL synthesizes over 3.3K additional verifiable, challenging questions from approximately 8K seed samples. Models trained with our synthesized data achieve consistent gains across five out-of-domain visual math reasoning benchmarks, with a significant improvement over baseline models trained on seed data alone. Notably, detailed analysis reveals that the gains are more pronounced on the most challenging evaluation samples, highlighting SynthRL’s effectiveness in eliciting deeper and more complex reasoning patterns.

中文摘要

通过加强学习训练的视觉模型(VLM)具有可验证的奖励(RLVR),在有效地缩放测试时间计算方面表现出显着的进展。在这项工作中,我们研究了合成的RL数据如何进一步改善RLVR。为此,我们建议在面向推理的RL训练中自动数据扩展和保证的管道来扩展和保证管道。Synthr构成了三个关键阶段:(1)选择适当分布的种子问题,(2)将它们扩大到更具挑战性的变体中,同时保留原始答案,以及(3)保证的验证阶段,以确保近乎完美的正确性和难度增强。我们的经验实验证明了综合的可伸缩性和有效性。当应用于MMK12数据集时,Synthrl综合了超过3.3k的其他可验证的,具有挑战性的问题,来自大约8K种子样本。经过合成数据训练的模型在五个室外视觉数学推理基准中实现了一致的收益,仅在种子数据上训练的基线模型就显着改善。值得注意的是,详细的分析表明,在最具挑战性的评估样本上,收益更为明显,强调了Synthrl在引发更深入和更复杂的推理模式方面的有效性。


CSVQA:一种用于评估VLMS茎推理能力的中国多模式基准

  • 标题: CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs
  • 作者: Ai Jian, Weijie Qiu, Xiaokun Wang, Peiyu Wang, Yunzhuo Hao, Jiangbo Pei, Yichen Wei, Yi Peng, Xuchen Song
  • 日期: 2025-05-30
  • ArXiv主页: https://arxiv.org/abs/2505.24120
  • 论文链接: https://arxiv.org/pdf/2505.24120

英文摘要

Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal understanding, yet their capabilities for scientific reasoning remains inadequately assessed. Current multimodal benchmarks predominantly evaluate generic image comprehension or text-driven reasoning, lacking authentic scientific contexts that require domain-specific knowledge integration with visual evidence analysis. To fill this gap, we present CSVQA, a diagnostic multimodal benchmark specifically designed for evaluating scientific reasoning through domain-grounded visual question answering.Our benchmark features 1,378 carefully constructed question-answer pairs spanning diverse STEM disciplines, each demanding domain knowledge, integration of visual evidence, and higher-order reasoning. Compared to prior multimodal benchmarks, CSVQA places greater emphasis on real-world scientific content and complex reasoning.We additionally propose a rigorous evaluation protocol to systematically assess whether model predictions are substantiated by valid intermediate reasoning steps based on curated explanations. Our comprehensive evaluation of 15 VLMs on this benchmark reveals notable performance disparities, as even the top-ranked proprietary model attains only 49.6% accuracy.This empirical evidence underscores the pressing need for advancing scientific reasoning capabilities in VLMs. Our CSVQA is released at https://huggingface.co/datasets/Skywork/CSVQA.

中文摘要

视觉模型(VLM)在多模式理解中表现出了显着的进步,但是它们的科学推理能力仍然不足。当前的多模式基准主要评估通用图像理解或文本驱动的推理,缺乏正宗的科学环境,这些环境需要特定于领域的知识集成与视觉证据分析。为了填补这一空白,我们提出了CSVQA,这是一种诊断性的多模式基准,专门设计用于通过域接地的视觉询问来评估科学推理。我们的基准标有1,378个精心构建的问答答案对,跨越了跨越多样化的茎学科,每个较少的域知识,可视验证的域名,视觉证据的整合以及更高阶段的理由。与先前的多模式基准相比,CSVQA更加重视现实世界的科学内容和复杂的推理。我们还提出了一种严格的评估协议,以系统地评估模型预测是否通过基于精选解释的有效中间推理步骤来证实。我们对这一基准的15个VLM的全面评估揭示了显着的绩效差异,因为即使排名最高的专有模型仅达到49.6 \%的准确性。这些经验证据强调了对VLMS中的科学推理能力的迫切需求。我们的CSVQA在https://huggingface.co/datasets/skywork/csvqa上发布。


数据合成的大型语言模型

英文摘要

Generating synthetic data that faithfully captures the statistical structure of real-world distributions is a fundamental challenge in data modeling. Classical approaches often depend on strong parametric assumptions or manual structural design and struggle in high-dimensional or heterogeneous domains. Recent progress in Large Language Models (LLMs) reveals their potential as flexible, high-dimensional priors over real-world distributions. However, when applied to data synthesis, standard LLM-based sampling is inefficient, constrained by fixed context limits, and fails to ensure statistical alignment. Given this, we introduce LLMSynthor, a general framework for data synthesis that transforms LLMs into structure-aware simulators guided by distributional feedback. LLMSynthor treats the LLM as a nonparametric copula simulator for modeling high-order dependencies and introduces LLM Proposal Sampling to generate grounded proposal distributions that improve sampling efficiency without requiring rejection. By minimizing discrepancies in the summary statistics space, the iterative synthesis loop aligns real and synthetic data while gradually uncovering and refining the latent generative structure. We evaluate LLMSynthor in both controlled and real-world settings using heterogeneous datasets in privacy-sensitive domains (e.g., e-commerce, population, and mobility) that encompass both structured and unstructured formats. The synthetic data produced by LLMSynthor shows high statistical fidelity, practical utility, and cross-data adaptability, positioning it as a valuable tool across economics, social science, urban studies, and beyond.

中文摘要

生成忠实捕获现实世界分布的统计结构的合成数据是数据建模的基本挑战。经典方法通常取决于强有力的参数假设或手动结构设计以及高维或异质域中的斗争。大型语言模型(LLMS)的最新进展揭示了其对现实世界分布的灵活,高维的先验的潜力。但是,当应用于数据合成时,基于标准LLM的采样效率低下,受固定上下文限制的约束,无法确保统计对齐。鉴于此,我们介绍了llmsynthor,这是数据合成的一般框架,该框架将LLMS转化为以分布反馈为指导的结构感知的模拟器。LLMSYNTHOR将LLM视为非参数模拟器,用于建模高阶依赖性,并引入LLM提案采样,以生成扎根的建议分布,以提高采样效率而不需要拒绝。通过最大程度地减少汇总统计空间中的差异,迭代合成环将对齐真实和合成数据,同时逐渐发现和完善潜在的生成结构。我们使用涵盖结构化和非结构化格式的非均质数据集(例如,电子商务,人口和移动性)中使用异构数据集(例如,电子商务,人口和移动性)中的llmsynthor评估了llmsynthor。LLMSYNTHOR生产的合成数据显示出很高的统计保真度,实用性和跨数据适应性,将其定位为经济学,社会科学,城市研究以及其他地区的宝贵工具。


SRPO:通过反射感知的增强学习增强多模式LLM推理

英文摘要

Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.

中文摘要

多模式的大型语言模型(MLLM)在推理任务中表现出了有希望的能力,但仍在需要明确的自我反思和自我纠正的复杂问题上挣扎,尤其是与他们的单峰基于文本的同行相比。现有的反思方法是简单的,并且难以产生有意义的和启发性的反馈,因为预训练模型的推理能力和知识限制在初始培训期间很大程度上是固定的。为了克服这些挑战,我们提出了多模式的自我反思增强推理的推理(SRPO),这是一个两阶段反思 - 意识到的强化学习(RL)框架,旨在显式设计,以增强多模式LLM推理。在第一阶段,我们在高级MLLM的指导下构建了一个以反射为重点的数据集,该数据集基于初始响应来产生反思,以帮助政策模型学习推理和自我反思。在第二阶段,我们在GRPO框架内介绍了一种新颖的奖励机制,鼓励简洁明了,同时避免冗余。使用QWEN-2.5-VL-7B和QWEN-2.5-VL-32B进行了多种多模式推理基准,包括MathVista,MathVision,Mathverse和MMMU-Pro在内的广泛实验,这表明SRPO显着胜过态度的较大模型,可以提高良好的改善,从而在推理方面得到了良好的改善,并且可以很好地提高推理和反思。


推进多模式推理:从优化的冷启动到上演的加固学习

英文摘要

Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (MLLMs) by directly applying reinforcement learning (RL). However, they still struggle to activate complex reasoning. In this paper, rather than examining multimodal RL in isolation, we delve into current training pipelines and identify three crucial phenomena: 1) Effective cold start initialization is critical for enhancing MLLM reasoning. Intriguingly, we find that initializing with carefully selected text data alone can lead to performance surpassing many recent multimodal reasoning models, even before multimodal RL. 2) Standard GRPO applied to multimodal RL suffers from gradient stagnation, which degrades training stability and performance. 3) Subsequent text-only RL training, following the multimodal RL phase, further enhances multimodal reasoning. This staged training approach effectively balances perceptual grounding and cognitive reasoning development. By incorporating the above insights and addressing multimodal RL issues, we introduce ReVisual-R1, achieving a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.

中文摘要

受到复杂文本任务中DeepSeek-R1的显着推理能力的启发,许多作品试图通过直接应用加固学习(RL)来激励多模式大语模型(MLLM)中的类似能力。但是,他们仍然很难激活复杂的推理。在本文中,我们不是孤立地检查多模式RL,而是深入研究当前的训练管道,并确定三个关键现象:1)有效的冷启动初始化对于增强MLLM推理至关重要。有趣的是,我们发现仅使用精心选择的文本数据初始化就可以导致性能超过许多最近的多模式推理模型,甚至在多模式RL之前。2)应用于多模式RL的标准GRPO受到梯度停滞的损坏,从而降低了训练稳定性和性能。3)随后在多模式RL阶段进行的仅文本RL训练进一步增强了多模式推理。这种分阶段的培训方法有效地平衡了知觉基础和认知推理的发展。通过纳入上述洞察力并解决了多模式RL问题,我们介绍了Revisual-R1,在开源7B MLLM中获得了新的最先进的基准,这些基准包括Mathverse,Mathverse,Mathverse,Wemath,Wemath,Logicvista,Dommicta,Dommanta,dommantath,and Domminath和Aime2024和Aime2025和Aime2025。


opthoughts:推理模型的数据食谱

  • 标题: OpenThoughts: Data Recipes for Reasoning Models
  • 作者: Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, Yichuan Deng, Sarah Pratt, Vivek Ramanujan, Jon Saad-Falcon, Jeffrey Li, Achal Dave, Alon Albalak, Kushal Arora, Blake Wulfe, Chinmay Hegde, Greg Durrett, Sewoong Oh, Mohit Bansal, Saadia Gabriel, Aditya Grover, Kai-Wei Chang, Vaishaal Shankar, Aaron Gokaslan, Mike A. Merrill, Tatsunori Hashimoto, Yejin Choi, Jenia Jitsev, Reinhard Heckel, Maheswaran Sathiamoorthy, Alexandros G. Dimakis, Ludwig Schmidt
  • 日期: 2025-06-04
  • ArXiv主页: https://arxiv.org/abs/2506.04178
  • 论文链接: https://arxiv.org/pdf/2506.04178

英文摘要

Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond. All of our datasets and models are available on https://openthoughts.ai.

中文摘要

推理模型已在许多涉及数学,代码和科学的基准上取得了迅速的进步。然而,关于推理的最佳培训食谱仍然存在许多公开问题,因为最先进的模型通常依赖于几乎没有可用信息的专有数据集。为了解决这个问题,Openthight项目的目标是创建用于培训推理模型的开源数据集。经过初步探索,我们的Openthoughts2-1M数据集导致OpenthInker2-32B,这是第一个在公共推理数据上训练的模型,以匹配DeepSeek-R1-Distill-32B在AIME和LiveCodeBench等标准推理基准上。然后,我们通过系统地通过1,000多个受控实验来系统地研究数据生成管道的每个步骤,从而进一步改善数据集,从而导致了Optheeds3。将管道缩放到1.2m的示例,并使用QWQ-32B作为老师,可以产生我们的OpenthInker3-7b模型,该模型可实现最先进的结果:AIME 2025的53%,在LiveCodeBench 06/24-24-01/25上获得51%,在GPQA Diamond上为54%。我们所有的数据集和模型均可在https://openthoughts.ai上找到。


AMBIK:厨房环境中模棱两可的任务数据集

英文摘要

As a part of an embodied agent, Large Language Models (LLMs) are typically used for behavior planning given natural language instructions from the user. However, dealing with ambiguous instructions in real-world environments remains a challenge for LLMs. Various methods for task ambiguity detection have been proposed. However, it is difficult to compare them because they are tested on different datasets and there is no universal benchmark. For this reason, we propose AmbiK (Ambiguous Tasks in Kitchen Environment), the fully textual dataset of ambiguous instructions addressed to a robot in a kitchen environment. AmbiK was collected with the assistance of LLMs and is human-validated. It comprises 1000 pairs of ambiguous tasks and their unambiguous counterparts, categorized by ambiguity type (Human Preferences, Common Sense Knowledge, Safety), with environment descriptions, clarifying questions and answers, user intents, and task plans, for a total of 2000 tasks. We hope that AmbiK will enable researchers to perform a unified comparison of ambiguity detection methods. AmbiK is available at https://github.com/cog-model/AmbiK-dataset.

中文摘要

作为体现代理的一部分,大型语言模型(LLMS)通常用于行为规划,给定用户的自然语言指令。但是,在现实世界环境中处理模棱两可的说明仍然是LLM的挑战。已经提出了各种任务歧义检测方法。但是,很难比较它们,因为它们在不同的数据集上进行了测试,并且没有通用的基准测试。因此,我们提出了Ambik(厨房环境中的模棱两可的任务),这是向厨房环境中机器人讲述的模棱两可说明的完全文字数据集。在LLM的协助下收集Ambik,并被人类验证。它包括1000对模棱两可的任务及其明确的任务,由歧义类型(人类的偏好,常识知识,安全)进行分类,以及环境描述,澄清问题和答案,用户意图和任务计划,总共完成2000个任务。我们希望AMBIK能够使研究人员对歧义检测方法进行统一的比较。Ambik可在https://github.com/cog-model/ambik-dataset上找到。


普通堆V0.1:公共领域的8TB数据集和公开许可的文本

  • 标题: The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
  • 作者: Nikhil Kandpal, Brian Lester, Colin Raffel, Sebastian Majstorovic, Stella Biderman, Baber Abbasi, Luca Soldaini, Enrico Shippole, A. Feder Cooper, Aviya Skowron, John Kirchenbauer, Shayne Longpre, Lintang Sutawika, Alon Albalak, Zhenlin Xu, Guilherme Penedo, Loubna Ben Allal, Elie Bakouch, John David Pressman, Honglu Fan, Dashiell Stander, Guangyu Song, Aaron Gokaslan, Tom Goldstein, Brian R. Bartoldson, Bhavya Kailkhura, Tyler Murray
  • 日期: 2025-06-05
  • ArXiv主页: https://arxiv.org/abs/2506.05209
  • 论文链接: https://arxiv.org/pdf/2506.05209
  • 项目链接: https://huggingface.co/common-pile
  • gitHub仓库: https://github.com/r-three/common-pile

英文摘要

Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.

中文摘要

大型语言模型(LLMS)通常受到大量未经许可文本的培训,这种做法由于可能的知识产权侵权和道德问题而导致了审查。对公开许可的文本进行培训LLM介绍了解决这些问题的第一步,但是先前的数据收集工作使数据集太小或低质量而无法产生表现llms。为了解决这一差距,我们收集,策划和释放普通桩v0.1,这是一个八个公开许可的文本的八个terabyte集合,该文本旨在训练LLM。该普通堆包括来自30个来源的内容,这些内容涵盖了各种领域,包括研究论文,代码,书籍,百科全书,教育材料,音频成绩单等。至关重要的是,我们通过对共同堆的文本进行了培训270亿个参数LLM来验证我们的努力:Comma V0.1-1T和Comma V0.1-2T,分别在1和2万亿个代币进行了培训。这两种模型均获得了对具有相似计算预算(例如Llama 1和2 7b)的无执照文本培训的LLM竞争性能。除了释放普通桩V0.1本身外,我们还发布了其创建中使用的代码以及Comma V0.1型号的训练混合物和检查点。


Roborefer:在机器人技术的视觉模型中使用推理的空间引用

英文摘要

Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes.

中文摘要

空间引用是体现机器人与3D物理世界相互作用的基本能力。但是,即使有了强大的审慎视觉语言模型(VLM),最近的方法仍然没有资格准确地了解复杂的3D场景,并动态地了解指令指示的互动位置。为此,我们提出了Roborefer,Roborefer是一种3D感知的VLM,可以通过监督的微调(SFT)整合散布但专用的深度编码器,首先可以实现精确的空间理解。此外,Roborefer通过增强微调(RFT)推进了概括性的多步空间推理,并具有针对空间参考任务的指标敏感过程奖励函数。为了支持SFT和RFT培训,我们介绍了Ref-Spatial,这是一个大规模的20M QA对(2倍),涵盖了31个空间关系(vs. 15之前),并支持复杂的推理过程(最多5个步骤)。此外,我们引入了雷特台基台,这是一个具有挑战性的基准,填补了评估空间中用多步推理的空间的空白。实验表明,经过SFT训练的Roborefer获得了最新的空间理解,平均成功率为89.6%。RFT训练的Roborefer进一步超过了所有其他基线,甚至超过了Gemini-2.5-Pro的平均准确性17.4%。值得注意的是,Roborefer可以与各种控制策略集成,以在混乱的现实场景中执行各种机器人(e,g。,ur5,g1 holudoid)的动态任务。


硬测试:合成LLM编码的高质量测试用例

英文摘要

Verifiers play a crucial role in large language model (LLM) reasoning, needed by post-training techniques such as reinforcement learning. However, reliable verifiers are hard to get for difficult coding problems, because a well-disguised wrong solution may only be detected by carefully human-written edge cases that are difficult to synthesize. To address this issue, we propose HARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With this pipeline, we curate a comprehensive competitive programming dataset HARDTESTS with 47k problems and synthetic high-quality tests. Compared with existing tests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage points higher and recall that is 17.5 percentage points higher when evaluating LLM-generated code. For harder problems, the improvement in precision can be as large as 40 points. HARDTESTS also proves to be more effective for model training, measured by downstream code generation performance. We will open-source our dataset and synthesis pipeline at https://leililab.github.io/HardTests/.

中文摘要

验证者在大语言模型(LLM)推理中起着至关重要的作用,这是通过培训后技术(例如增强学习)所需的。但是,对于困难的编码问题,可靠的验证者很难获得,因为只有很难综合的人类写入的边缘案例才能检测到一个被掩盖的错误解决方案。为了解决这个问题,我们提出了HardTestGen,这是使用LLMS高质量测试合成的管道。通过这条管道,我们策划了具有47K问题和合成高质量测试的全面竞争编程数据集硬盘。与现有测试相比,HardTestGen测试表明,评估LLM生成的代码时的精度高11.3个百分点,回想点要高17.5个百分点。对于更严重的问题,精度的提高可以大至40分。事实证明,通过下游代码生成性能来衡量硬测试对模型培训更有效。我们将在https://leililab.github.io/hardtests/上打开数据集和合成管道。


CASS:NVIDIA通过数据,模型和基准进行AMD转移

英文摘要

We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA leftrightarrow HIP) and assembly-level (Nvidia SASS leftrightarrow AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the CASS family of domain-specific language models, achieving 95% source translation accuracy and 37.5% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce CASS-Bench, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation. Dataset and benchmark are on https://huggingface.co/datasets/MBZUAI/cass, with code at https://github.com/GustavoStahl/CASS.

中文摘要

我们介绍了CASS,这是第一个用于跨架构GPU代码转译的大规模数据集和模型套件,针对源代码级(CUDA左右箭头HIP)和汇编级(Nvidia SASS左右箭头AMD RDNA3)转换。该数据集包括主机和设备上的70k个经过验证的代码对,解决了低级GPU代码可移植性方面的关键差距。利用这一资源,我们训练了CASS家族的领域特定语言模型,实现了95%的源代码翻译准确率和37.5%的汇编翻译准确率,大大优于GPT-4o、Claude和Hipify等商业基线。我们生成的代码在85%以上的测试用例中与本机性能相匹配,保留了运行时和内存行为。为了支持严格的评估,我们引入了CASS Bench,这是一个经过精心策划的基准测试,涵盖16个GPU领域,具有真实执行能力。所有数据、模型和评估工具都以开源形式发布,以促进GPU编译器工具、二进制兼容性和LLM指导的硬件翻译的进步。数据集和基准测试已打开https://huggingface.co/datasets/MBZUAI/cass,代码在https://github.com/GustavoStahl/CASS.


Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐