中文使用 googletrans 翻译,翻译不对的地方以英文为准

目录

用视频思考:视频生成作为一种有前途的多模态推理范式

英文摘要

“Thinking with Text” and “Thinking with Images” paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce “Thinking with Video”, a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2’s performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions “thinking with video” as a unified multimodal reasoning paradigm.

中文摘要

“用文本思考”和“用图像思考”范式显着提高了大语言模型(LLM)和视觉语言模型(VLM)的推理能力。然而,这些范式具有固有的局限性。(1)图像仅捕捉单个时刻,无法表示动态过程或连续变化;(2)文本和视觉作为不同模态的分离,阻碍了统一的多模态理解和生成。为了克服这些限制,我们引入了“用视频思考”,这是一种利用视频生成模型(例如 Sora-2)在统一时间框架中桥接视觉和文本推理的新范式。为了支持这一探索,我们开发了视频思维基准(VideoThinkBench)。VideoThinkBench 包含两个任务类别:(1) 以视觉为中心的任务(例如,眼球拼图),以及 (2) 以文本为中心的任务(例如,GSM8K、MMMU 的子集)。我们的评估表明 Sora-2 是一个有能力的推理者。在以视觉为中心的任务上,Sora-2 通常可以与最先进的 (SOTA) VLM 相媲美,甚至在一些任务上超过了 VLM,例如 Eyeballing Games。在以文本为中心的任务中,Sora-2 在 MATH 上实现了 92% 的准确率,在 MMMU 上实现了 75.53% 的准确率。此外,我们系统地分析了这些能力的来源。我们还发现自我一致性和情境学习可以提高 Sora-2 的性能。总之,我们的研究结果表明,视频生成模型是潜在的统一多模态理解和生成模型,将“用视频思考”定位为统一的多模态推理范式。


扩散语言模型是超级数据学习者

英文摘要

Under strictly controlled pre-training settings, we observe a Crossover: when unique data is limited, diffusion language models (DLMs) consistently surpass autoregressive (AR) models by training for more epochs. The crossover shifts later with more or higher-quality data, earlier with larger models, and persists across dense and sparse architectures. We attribute the gains to three compounding factors: (1) any-order modeling, (2) super-dense compute from iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation; input or parameter noise improves AR under data constraint but cannot close the gap. At scale, a 1.7B DLM trained with a ~1.5T-token compute budget on 10B unique Python tokens overtakes an AR coder trained with strictly matched settings. In addition, a 1B-parameter DLM achieves > 56% accuracy on HellaSwag and > 33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. We also show that rising validation cross-entropy does not imply degraded downstream performance in this regime.

中文摘要

在严格控制的预训练设置下,我们观察到交叉:当唯一数据有限时,扩散语言模型(DLM)通过训练更多的时期而始终超越自回归(AR)模型。随着更多或更高质量的数据,交叉会发生得更晚,而随着模型更大,交叉会发生得更早,并且在密集和稀疏架构中持续存在。我们将收益归因于三个复合因素:(1)任意阶建模,(2)迭代双向去噪的超密集计算,以及(3)内置蒙特卡罗增强;输入或参数噪声在数据约束下改善了 AR,但无法缩小差距。从规模上看,在 10B 个独特的 Python 令牌上使用约 1.5T 令牌计算预算训练的 1.7B DLM 超过了使用严格匹配设置训练的 AR 编码器。此外,仅使用 1B 令牌,1B 参数 DLM 在 HellaSwag 上实现了 > 56% 的准确率,在 MMLU 上实现了 > 33% 的准确率,无需任何特殊技巧,只需重复标准预训练数据即可。我们还表明,验证交叉熵的上升并不意味着该机制中下游性能的下降。


VCode:以 SVG 作为符号视觉表示的多模态编码基准

英文摘要

Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model’s intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.

中文摘要

代码已成为代理时代推理和行动的精确且可执行的媒介。然而,进展主要集中在以语言为中心的任务上,例如程序合成和调试,而以视觉为中心的编码尚未得到充分探索。受到人类如何推理草图的启发,我们提倡将 SVG 代码作为一种紧凑、可解释且可执行的视觉表示形式。我们引入了 VCode,这是一个将多模态理解重新构建为代码生成的基准:给定图像,模型必须生成 SVG,为下游推理保留符号含义。VCode涵盖三个领域——一般常识(MM-Vet)、专业学科(MMMU)和以视觉为中心的感知(CV-Bench)。为了评估符号保真度,我们提出了 CodeVQA,这是一种新颖的评估协议,其中策略模型回答有关渲染 SVG 的问题;正确的答案表明忠实的象征性保存。根据经验,前沿 VLM 很难生成忠实的 SVG,揭示了以语言为中心的编码和以视觉为中心的编码之间持续存在的差距。为了弥补这一差距,我们引入了 VCoder,这是一个代理框架,它沿着两个轴增强了 VLM:(i)通过修订思考,迭代地分析差异并改进 SVG 代码;(ii) 使用视觉工具进行操作,其中检测器和解析器提供超出模型内在能力的结构化线索,例如对象、形状和文本。在各个基准测试中,具有强大推理能力的前沿 VLM 总体得分较高,但在专业知识和 3D 推理方面仍然有限。VCoder 的整体增益比表现最佳的 Claude-4-Opus 提高了 12.3 点。人类研究表明,人类和 VLM 在渲染 SVG 上的表现都较差,但它们的一致性揭示了符号视觉表示的前景。基准测试和代码可在 https://github.com/CSU-JPG/VCode 获取。


V-Thinker:图像互动思维

英文摘要

Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising “Thinking with Images” paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.

中文摘要

使大型多模态模型(LMM)将图像交互与长视野推理能力深度集成仍然是该领域长期存在的挑战。以视觉为中心的推理的最新进展探索了一种有前途的 LMM 的“用图像思考”范式,标志着从图像辅助推理到图像交互思维的转变。虽然这一里程碑使模型能够专注于细粒度图像区域,但进展仍然受到有限的视觉工具空间和特定于任务的工作流程设计的限制。为了弥补这一差距,我们推出了 V-Thinker,这是一种通用多模态推理助手,可通过端到端强化学习实现交互式、以视觉为中心的思维。V-Thinker 包含两个关键组件:(1)数据进化飞轮,自动合成、进化和验证跨三个维度(多样性、质量和难度)的交互式推理数据集;(2) 视觉渐进训练课程,首先通过点级监督调整感知,然后通过两阶段强化学习框架整合交互式推理。此外,我们还引入了 VTBench,这是一个经过专家验证的基准测试,针对以视觉为中心的交互式推理任务。大量实验表明,V-Thinker 在一般推理和交互式推理场景中始终优于基于 LMM 的强大基线,为推进图像交互式推理应用提供了宝贵的见解。


不要蒙蔽您的 VLA:调整 OOD 泛化的视觉表示

英文摘要

The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA’s hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io

中文摘要

视觉语言动作(VLA)模型的日益成功源于这样的承诺:预训练的视觉语言模型(VLM)可以赋予智能体可转移的世界知识和视觉语言(VL)基础,为具有更广泛泛化能力的动作模型奠定基础。然而,当这些 VLM 适应行动模式时,仍不清楚它们原始的 VL 表示和知识在多大程度上得以保留。在这项工作中,我们对 VLA 微调过程中的表示保留进行了系统研究,表明朴素动作微调会导致视觉表示的退化。为了表征和测量这些效果,我们探究了 VLA 的隐藏表示并分析了注意力图,此外,我们设计了一组有针对性的任务和方法,将 VLA 模型与其对应的 VLM 进行对比,隔离由动作微调引起的 VL 能力的变化。我们进一步评估了一系列对齐视觉表示的策略,并引入了一种简单而有效的方法,可以减轻退化并提高对分布外(OOD)场景的泛化能力。总而言之,我们的分析阐明了动作微调和 VL 表示退化之间的权衡,并强调了恢复继承的 VL 功能的实用方法。代码公开:https://blind-vla-paper.github.io


每次激活都得到提升:将通用推理机扩展到 1 万亿开放语言基金会

  • 标题: Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation
  • 作者: Ling-Team, Ang Li, Ben Liu, Binbin Hu, Bing Li, Bingwei Zeng, Borui Ye, Caizhi Tang, Changxin Tian, Chao Huang, Chao Zhang, Chen Qian, Chenchen Ju, Chenchen Li, Chengfu Tang, Chili Fu, Chunshao Ren, Chunwei Wu, Cong Zhang, Cunyin Peng, Dafeng Xu, Daixin Wang, Dalong Zhang, Dingnan Jin, Dingyuan Zhu, Dongke Hu, Fangzheng Zhao, Feifan Wu, Feng Zhu, Gangshan Wang, Haitao Zhang, Hailin Zhao, Hanxiao Zhang, Hanzi Wang, Hao Qian, Haoyi Yu, Heng Zhang, Hongliang Zhang, Hongzhi Luan, Huirong Dong, Huizhong Li, Jia Li, Jia Liu, Jialong Zhu, Jian Sha, Jianping Wei, Jiaolong Yang, Jieyue Ma, Jiewei Wu, Jinjing Huang, Jingyun Tian, Jingyuan Zhang, Jinquan Sun, Juanhui Tu, Jun Liu, Jun Xu, Jun Zhou, Junjie Ou, Junpeng Fang, Kaihong Zhang, Kaiqin Hu, Ke Shi, Kun Tang, Kunlong Chen, Lanyin Mei, Lei Liang, Lei Xu, Libo Zhang, Lin Ju, Lin Yuan, Ling Zhong, Lintao Ma, Lu Liu, Lu Yu, Lun Cai, Meiqi Zhu, Mengying Li, Min Chen, Minghao Xue, Minghong Cai, Mingming Yin, Peijie Jiang, Peilong Zhao, Pingping Liu, Qian Zhao, Qing Cui, Qingxiang Huang, Qingyuan Yang, Quankun Yu, Shaowei Wei, Shijie Lian, Shoujian Zheng, Shun Song, Shungen Zhang, Shuo Zhang, Siyuan Li, Song Liu, Ting Guo, Tong Zhao, Wanli Gu, Weichang Wu, Weiguang Han, Wenjing Fang, Wubin Wang, Xiang Shu, Xiao Shi, Xiaoshun Lan, Xiaolu Zhang, Xiaqing Sun, Xin Zhao, Xingyu Lu, Xiong Xu, Xudong Wang, Xudong Wang, Xuemin Yang, Yajie Yang, Yang Xiang, Yanzhe Li, Yi Zhang, Yilong Wang, Yingxue Li, Yongzhen Guo, Yuzhuo Fu, Yuanyuan Wang, Yue Yang, Yue Yu, Yufeng Deng, Yun Zhang, Yunfei Xu, Yuqi Zhang, Yuxiao He, Zengke Gui, Zhaoxin Huan, Zhaoyang Wang, Zhibo Zhu, Zhihao Wang, Zhiqiang Zhang, Zhoufei Wang, Zihang Zeng, Ziqi Liu, Zitao Xuan, Zuoli Tang
  • 日期: 2025-10-25
  • ArXiv主页: https://arxiv.org/abs/2510.22115
  • 论文链接: https://arxiv.org/pdf/2510.22115

英文摘要

We introduce Ling 2.0, a series reasoning-oriented language foundation built upon the principle that every activation boosts reasoning capability. Designed to scale from tens of billions to one trillion parameters under a unified Mixture-of-Experts (MoE) paradigm, Ling 2.0 emphasizes high sparsity, cross-scale consistency, and efficiency guided by empirical scaling laws. The series includes three non-thinking (instruct) models - Ling-mini-2.0, Ling-flash-2.0, and Ling-1T - ranging from 16B to 1T total parameters and achieving up to 7-fold active-compute efficiency compared with dense counterparts. Ling 2.0 integrates coordinated innovations across model architecture, pre-training, post-training, and infrastructure: a high-sparsity MoE with MTP for efficient reasoning, reasoning-oriented data and mid-training CoT activation, reinforcement-based fine-tuning (DFT, Evo-CoT), and full-scale FP8 training with fine-grained heterogeneous pipelines. At the trillion scale, Ling-1T establishes a new Pareto frontier of reasoning accuracy versus computational efficiency, demonstrating that sparse activation, when properly aligned with reasoning objectives, enables scalable and efficient intelligence. Collectively, Ling 2.0 provides a coherent, open, and efficient foundation for advancing future reasoning and thinking models, including the Ring series built upon the same base.

中文摘要

我们推出 Ling 2.0,这是一系列面向推理的语言基础,其原则是每次激活都会增强推理能力。Ling 2.0 旨在在统一的专家混合 (MoE) 范式下将参数从数百亿扩展到一万亿,强调高度稀疏性、跨尺度一致性和由经验缩放定律指导的效率。该系列包括三种非思考(指令)模型 - Ling-mini-2.0、Ling-flash-2.0 和 Ling-1T - 总参数范围从 16B 到 1T,与密集型同类产品相比,主动计算效率高达 7 倍。Ling 2.0整合了跨模型架构、训练前、训练后和基础设施的协调创新:具有MTP的高稀疏MoE用于高效推理、面向推理的数据和训练中期CoT激活、基于强化的微调(DFT、Evo-CoT)以及具有细粒度异构管道的全面FP8训练。在万亿规模上,Ling-1T 建立了推理准确性与计算效率的新帕累托前沿,证明稀疏激活在与推理目标正确对齐时可以实现可扩展且高效的智能。总的来说,Ling 2.0 为推进未来推理和思维模型(包括建立在同一基础上的 Ring 系列)提供了连贯、开放和高效的基础。


通过经验合成扩展代理学习

  • 标题: Scaling Agent Learning via Experience Synthesis
  • 作者: Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, Dat Huynh
  • 日期: 2025-11-05
  • ArXiv主页: https://arxiv.org/abs/2511.03773
  • 论文链接: https://arxiv.org/pdf/2511.03773

英文摘要

While reinforcement learning (RL) can empower large language model (LLM) agents by enabling self-improvement through interaction, its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity, all of which obstruct the collection of scalable experience data. To address these challenges, we introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind to enable effective online RL training for autonomous agents. Rather than relying on expensive real-environment rollouts, DreamGym distills environment dynamics into a reasoning-based experience model that derives consistent state transitions and feedback signals through step-by-step reasoning, enabling scalable agent rollout collection for RL. To improve the stability and quality of transitions, DreamGym leverages an experience replay buffer initialized with offline real-world data and continuously enriched with fresh interactions to actively support agent training. To improve knowledge acquisition, DreamGym adaptively generates new tasks that challenge the current agent policy, enabling more effective online curriculum learning. Experiments across diverse environments and agent backbones demonstrate that DreamGym substantially improves RL training, both in fully synthetic settings and in sim-to-real transfer scenarios. On non-RL-ready tasks like WebArena, DreamGym outperforms all baselines by over 30%. And in RL-ready but costly settings, it matches GRPO and PPO performance using only synthetic interactions. When transferring a policy trained purely on synthetic experiences to real-environment RL, DreamGym yields significant additional performance gains while requiring far fewer real-world interactions, providing a scalable warm-start strategy for general-purpose RL.

中文摘要

虽然强化学习 (RL) 可以通过交互实现自我改进,从而增强大型语言模型 (LLM) 代理的能力,但由于部署成本高昂、任务多样性有限、奖励信号不可靠以及基础设施复杂性,其实际采用仍然具有挑战性,所有这些都阻碍了可扩展经验数据的收集。为了应对这些挑战,我们推出了 DreamGym,这是第一个统一框架,旨在综合具有可扩展性的不同体验,以便为自主代理提供有效的在线 RL 训练。DreamGym 没有依赖昂贵的真实环境部署,而是将环境动态提炼为基于推理的体验模型,该模型通过逐步推理导出一致的状态转换和反馈信号,从而为 RL 实现可扩展的代理部署收集。为了提高转换的稳定性和质量,DreamGym 利用使用离线现实世界数据初始化的体验重放缓冲区,并通过新的交互不断丰富,以积极支持代理培训。为了改善知识获取,DreamGym 自适应地生成挑战当前代理策略的新任务,从而实现更有效的在线课程学习。跨不同环境和代理主干的实验表明,无论是在完全合成的环境中还是在模拟到真实的传输场景中,DreamGym 都显着改善了 RL 训练。在 WebArena 等非 RL 就绪任务中,DreamGym 的性能优于所有基线 30% 以上。在 RL 就绪但成本高昂的设置中,它仅使用合成相互作用即可匹配 GRPO 和 PPO 性能。当将纯粹基于合成经验训练的策略转移到真实环境强化学习时,DreamGym 产生了显着的额外性能提升,同时需要更少的现实世界交互,为通用强化学习提供了可扩展的热启动策略。


ThinkMorph:多模态交错思维链推理中的涌现属性

英文摘要

Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts.These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.

中文摘要

多模态推理需要语言和视觉之间的迭代协调,但目前尚不清楚什么构成了有意义的交错思维链。我们认为文本和图像思维应该作为互补而非同构的模式发挥作用,从而相互促进推理。遵循这一原则,我们构建了 ThinkMorph,这是一个在 24K 高质量交错推理轨迹上进行微调的统一模型,涵盖具有不同视觉参与度的任务。ThinkMorph 学习生成渐进式文本图像推理步骤,具体操作视觉内容,同时保持连贯的语言逻辑。它在以视觉为中心的基准测试中取得了巨大的进步(平均比基本模型提高了 34.7%),并推广到域外任务,匹配或超越更大的专有 VLM。除了性能之外,ThinkMorph 还展示了新兴的多模态智能,包括看不见的视觉操作技能、推理模式之间的自适应切换以及通过多样化的多模态思维更好地扩展测试时间。这些发现为表征多模态推理统一模型的新兴能力提供了有希望的方向。


INT VS.FP:细粒度低比特量化格式的综合研究

英文摘要

Modern AI hardware, such as Nvidia’s Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.

中文摘要

现代人工智能硬件,例如 Nvidia 的 Blackwell 架构,越来越多地采用低精度浮点 (FP) 格式来处理大型语言模型 (LLM) 中普遍存在的激活异常值。尽管存在这种行业趋势,但仍缺乏跨不同粒度的 FP 和整数 (INT) 量化的统一比较,导致算法和硬件协同设计缺乏明确的指导。本文通过系统地研究 FP 和 INT 格式之间的权衡来填补这一空白。我们揭示了一个关键的性能交叉:虽然 FP 在粗粒度量化方面表现出色,但细粒度(按块)级别的比较更加细致。我们的全面比较表明,对于流行的 8 位细粒度格式(例如块大小为 32 的 MX),MXINT8 在算法准确性和硬件效率方面均优于其 FP 对应项。然而,对于 4 位格式,FP(例如 MXFP4、NVFP4)通常具有精度优势,尽管我们表明,当应用哈达玛旋转等异常值缓解技术时,NVINT4 可以超越 NVFP4。我们还引入了一种对称裁剪方法,可以解决细粒度低位 INT 训练中的梯度偏差,从而为 MXINT8 训练提供近乎无损的性能。这些发现对当前的硬件轨迹提出了挑战,表明一刀切的 FP 方法并不是最优的,并提倡细粒度的 INT 格式,特别是 MXINT8,为未来的 AI 加速器提供准确性、功率和效率之间的更好平衡。


OS-Sentinel:通过现实工作流程中的混合验证实现安全增强型移动 GUI 代理

英文摘要

Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast and complex operational space of mobile environments presents a formidable challenge that remains critically underexplored. To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. Experiments show that OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. Further analysis provides critical insights that foster the development of safer and more reliable autonomous mobile agents.

中文摘要

由视觉语言模型(VLM)支持的计算机使用代理已经在操作移动平台等数字环境中展示了类似人类的能力。虽然这些代理在推进数字自动化方面前景广阔,但它们潜在的不安全操作(例如系统泄露和隐私泄露)引起了人们的严重担忧。在移动环境的广阔而复杂的操作空间中检测这些安全问题是一项艰巨的挑战,而这一挑战仍然严重不足。为了为移动代理安全研究奠定基础,我们引入了 MobileRisk-Live,这是一个动态沙箱环境,附带安全检测基准,其中包含带有细粒度注释的真实轨迹。在此基础上,我们提出了 OS-Sentinel,这是一种新颖的混合安全检测框架,它将用于检测显式系统级违规的形式验证器与用于评估上下文风险和代理操作的基于 VLM 的上下文判断器相结合。实验表明,OS-Sentinel 在多个指标上比现有方法实现了 10%-30% 的改进。进一步的分析提供了重要的见解,可以促进更安全、更可靠的自主移动代理的开发。


连续自回归语言模型

英文摘要

The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic bandwidth of each generative step. To this end, we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9% accuracy. This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K. The paradigm shift necessitates a new modeling toolkit; therefore, we develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling in the continuous domain. Experiments show that CALM significantly improves the performance-compute trade-off, achieving the performance of strong discrete baselines at a significantly lower computational cost. More importantly, these findings establish next-vector prediction as a powerful and scalable pathway towards ultra-efficient language models. Code: https://github.com/shaochenze/calm. Project: https://shaochenze.github.io/blog/2025/CALM.

中文摘要

大型语言模型 (LLM) 的效率从根本上受到其顺序、逐个标记生成过程的限制。我们认为,克服这一瓶颈需要 LLM 扩展的新设计轴:增加每个生成步骤的语义带宽。为此,我们引入连续自回归语言模型(CALM),这是从离散下一个标记预测到连续下一个向量预测的范式转变。CALM 使用高保真自动编码器将 K 个标记块压缩为单个连续向量,从中可以以超过 99.9% 的准确度重建原始标记。这使我们能够将语言建模为连续向量序列而不是离散标记,从而将生成步骤的数量减少了 K 倍。范式转变需要一个新的建模工具包;因此,我们开发了一个全面的无似然框架,可以在连续域中实现稳健的训练、评估和可控采样。实验表明,CALM 显着改善了性能与计算的权衡,以显着降低的计算成本实现了强离散基线的性能。更重要的是,这些发现将下一个向量预测确立为实现超高效语言模型的强大且可扩展的途径。代码:https://github.com/shaochenze/calm。项目:https://shaochenze.github.io/blog/2025/CALM。


π_RL:基于流的视觉-语言-动作模型的在线强化学习微调

英文摘要

Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying large-scale RL to flow-based VLAs (e.g., pi_0, pi_{0.5}) remains challenging due to intractable action log-likelihoods from iterative denoising. We address this challenge with pi_{RL}, an open-source framework for training flow-based VLAs in parallel simulation. pi_{RL} implements two RL algorithms: (1) {Flow-Noise} models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) {Flow-SDE} integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration. We evaluate pi_{RL} on LIBERO and ManiSkill benchmarks. On LIBERO, pi_{RL} boosts few-shot SFT models pi_0 and pi_{0.5} from 57.6% to 97.6% and from 77.1% to 98.3%, respectively. In ManiSkill, we train pi_{RL} in 320 parallel environments, improving pi_0 from 41.6% to 85.7% and pi_{0.5} from 40.0% to 84.8% across 4352 pick-and-place tasks, demonstrating scalable multitask RL under heterogeneous simulation. Overall, pi_{RL} achieves significant performance gains and stronger generalization over SFT-models, validating the effectiveness of online RL for flow-based VLAs.

中文摘要

视觉-语言-动作 (VLA) 模型使机器人能够理解并执行来自多模式输入的复杂任务。尽管最近的工作探索使用强化学习 (RL) 来自动化扩展监督微调 (SFT) 中繁琐的数据收集过程,但由于迭代去噪带来的棘手的动作对数似然,将大规模 RL 应用于基于流的 VLA(例如 pi_0、pi_{0.5})仍然具有挑战性。我们使用 pi_{RL} 来应对这一挑战,这是一个开源框架,用于在并行仿真中训练基于流的 VLA。pi_{RL} 实现了两种 RL 算法:(1) {Flow-Noise} 将去噪过程建模为离散时间 MDP,并使用可学习噪声网络进行精确的对数似然计算。(2) {Flow-SDE} 将去噪与智能体-环境交互相结合,制定了一个两层 MDP,采用 ODE 到 SDE 转换来实现高效的 RL 探索。我们在 LIBERO 和 ManiSkill 基准上评估 pi_{RL}。在 LIBERO 上,pi_{RL} 将小样本 SFT 模型 pi_0 和 pi_{0.5} 分别从 57.6% 提高到 97.6% 和从 77.1% 提高到 98.3%。在 ManiSkill 中,我们在 320 个并行环境中训练 pi_{RL},在 4352 个拾放任务中将 pi_0 从 41.6% 提高到 85.7%,将 pi_{0.5} 从 40.0% 提高到 84.8%,展示了异构模拟下的可扩展多任务 RL。总体而言,与 SFT 模型相比,pi_{RL} 实现了显着的性能提升和更强的泛化能力,验证了在线 RL 对于基于流的 VLA 的有效性。


当可视化是推理的第一步时:MIRA,视觉思维链的基准

英文摘要

We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through “drawing to think”. To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.

中文摘要

我们提出了 MIRA,这是一个新的基准,旨在评估生成中间视觉图像对于成功推理至关重要的场景中的模型。与仅依赖文本的传统 CoT 方法不同,MIRA 中的任务需要模型生成和利用中间图像(例如草图、结构图或路径图)来指导其推理过程。这种设置密切反映了人类如何通过“绘画思考”来解决复杂的问题。为了解决这个问题,MIRA 专注于本质上具有挑战性的任务,涉及复杂的结构、空间关系或推理步骤,这些很难仅通过语言来表达。为了确保我们的评估数据是高质量的,我们包含了 546 个多模态问题,并用中间视觉图像和最终答案进行了注释。我们还提出了 MIRA 的统一评估协议,涵盖三个评估输入级别:仅包含图像和问题的直接输入、包含图像和思维提示的纯文本 CoT 输入、以及包含带注释的图像线索和文本思维提示的 Visual-CoT 输入。为了探究我们基准上模型容量的上限,我们还报告了不同 k 设置下的 pass@k 和多数投票准确性。实验结果表明,现有的多模态大语言模型,包括最强的私有模型和强大的开放权重模型,在仅依赖文本提示时表现不佳。然而,当提供中间视觉提示时,模型性能持续提高,所有模型和任务的平均相对增益为 33.7%。我们还通过扩展搜索空间和设计与 Visual-CoT 一致的文本提示来探测上限,但与我们的 Visual-CoT 设置相比,两者仅产生有限的改进。这些结果强调了想象的视觉信息在 MIRA 成功推理中的关键作用。


UniAVGen:具有非对称跨模式交互的统一音频和视频生成

英文摘要

Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we additionally introduce Modality-Aware Classifier-Free Guidance, a novel strategy that explicitly amplifies cross-modal correlation signals. Notably, UniAVGen’s robust joint synthesis design enables seamless unification of pivotal audio-video tasks within a single model, such as joint audio-video generation and continuation, video-to-audio dubbing, and audio-driven video synthesis. Comprehensive experiments validate that, with far fewer training samples (1.3M vs. 30.1M), UniAVGen delivers overall advantages in audio-video synchronization, timbre consistency, and emotion consistency.

中文摘要

由于缺乏有效的跨模态建模,现有的开源音视频生成方法经常表现出唇形同步受损和语义一致性不足的问题。为了减轻这些缺点,我们提出了 UniAVGen,这是一个用于联合音频和视频生成的统一框架。UniAVGen 以双分支联合合成架构为基础,结合两个并行的扩散变压器 (DiT) 来构建一个有凝聚力的跨模态潜在空间。其核心在于非对称跨模态交互机制,可实现双向、时间对齐的交叉注意力,从而确保精确的时空同步和语义一致性。此外,这种跨模式交互通过面部感知调制模块得到增强,该模块动态地优先考虑交互过程中的显着区域。为了增强推理过程中的生成保真度,我们还引入了模态感知无分类器指导,这是一种显式放大跨模态相关信号的新颖策略。值得注意的是,UniAVGen 强大的联合合成设计可以在单个模型中无缝统一关键音频-视频任务,例如联合音频-视频生成和延续、视频到音频配音以及音频驱动的视频合成。综合实验验证,在训练样本少得多的情况下(1.3M vs. 30.1M),UniAVGen 在音视频同步、音色一致性和情感一致性方面具有整体优势。


EBT-政策:能量释放新兴的物理推理能力

英文摘要

Implicit policies parameterized by generative models, such as Diffusion Policy, have become the standard for policy learning and Vision-Language-Action (VLA) models in robotics. However, these approaches often suffer from high computational cost, exposure bias, and unstable inference dynamics, which lead to divergence under distribution shifts. Energy-Based Models (EBMs) address these issues by learning energy landscapes end-to-end and modeling equilibrium dynamics, offering improved robustness and reduced exposure bias. Yet, policies parameterized by EBMs have historically struggled to scale effectively. Recent work on Energy-Based Transformers (EBTs) demonstrates the scalability of EBMs to high-dimensional spaces, but their potential for solving core challenges in physically embodied models remains underexplored. We introduce a new energy-based architecture, EBT-Policy, that solves core issues in robotic and real-world settings. Across simulated and real-world tasks, EBT-Policy consistently outperforms diffusion-based policies, while requiring less training and inference computation. Remarkably, on some tasks it converges within just two inference steps, a 50x reduction compared to Diffusion Policy’s 100. Moreover, EBT-Policy exhibits emergent capabilities not seen in prior models, such as zero-shot recovery from failed action sequences using only behavior cloning and without explicit retry training. By leveraging its scalar energy for uncertainty-aware inference and dynamic compute allocation, EBT-Policy offers a promising path toward robust, generalizable robot behavior under distribution shifts.

中文摘要

由生成模型参数化的隐式策略(例如扩散策略)已成为机器人技术中策略学习和视觉-语言-动作(VLA)模型的标准。然而,这些方法通常面临计算成本高、暴露偏差和推理动态不稳定的问题,从而导致分布变化下的发散。基于能源的模型 (EBM) 通过端到端学习能源景观和平衡动态建模来解决这些问题,从而提高稳健性并减少暴露偏差。然而,由 EBM 参数化的政策历来难以有效扩展。最近关于基于能量的变压器(EBT)的工作证明了 EBM 在高维空间中的可扩展性,但它们解决物理实体模型中核心挑战的潜力仍未得到充分开发。我们引入了一种新的基于能源的架构,EBT-Policy,它解决了机器人和现实世界环境中的核心问题。在模拟和现实世界的任务中,EBT-Policy 始终优于基于扩散的策略,同时需要较少的训练和推理计算。值得注意的是,在某些任务上,它只需两个推理步骤即可收敛,与 Diffusion Policy 的 100 步相比减少了 50 倍。此外,EBT-Policy 展现了先前模型中未见的新兴功能,例如仅使用行为克隆且无需显式重试训练即可从失败的动作序列中进行零样本恢复。通过利用其标量能量进行不确定性感知推理和动态计算分配,EBT-Policy 为在分布变化下实现稳健、可概括的机器人行为提供了一条有前途的道路。


LEGO-Eval:通过工具增强对合成 3D 实体环境进行细粒度评估

英文摘要

Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.

中文摘要

尽管最近在使用大型语言模型 (LLM) 自动生成 3D 场景方面取得了进展,但生成的场景通常缺乏现实环境中的真实空间布局和对象属性。由于这个问题源于不够详细、粗粒度的指令,因此在反映现实世界环境的更详细、细粒度指令的指导下推进 3D 场景合成变得至关重要。如果没有这样的现实场景,在不现实的环境中训练实体代理可能会导致它们学习与现实世界的物理和语义显着不同的先验知识,从而降低它们在部署时的性能。因此,验证细粒度指令和生成场景之间的对齐对于有效学习至关重要。然而,当前的评估方法,例如 CLIPScore 和视觉语言模型 (VLM),通常无法可靠地评估这种一致性。这一缺点主要源于他们对 3D 场景的浅薄理解,这常常导致场景组件接地不当。为了解决这个问题,我们引入了 LEGO-Eval,这是一个评估框架,配备了多种工具,旨在明确地面场景组件,从而实现更准确的对齐评估。我们还推出了 LEGO-Bench,这是一个详细说明的基准,指定了现实世界环境的复杂布局和属性。实验表明,在评估场景指令对齐方面,LEGO-Eval 的 F1 分数比 VLM-as-a-judge 的表现高出 0.41。使用 LEGO-Bench 进行基准测试揭示了当前生成方法的重大局限性。在所有评估的方法中,生成与细粒度指令完全一致的场景的成功率最多达到 10%。


将测试时计算最优缩放推广为可优化图

  • 标题: Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph
  • 作者: Fali Wang, Jihai Chen, Shuhua Yang, Runxue Bao, Tianxiang Zhao, Zhiwei Zhang, Xianfeng Tang, Hui Liu, Qi He, Suhang Wang
  • 日期: 2025-10-29
  • ArXiv主页: https://arxiv.org/abs/2511.00086
  • 论文链接: https://arxiv.org/pdf/2511.00086

英文摘要

Test-Time Scaling (TTS) improves large language models (LLMs) by allocating additional computation during inference, typically through parallel, sequential, or hybrid scaling. However, prior studies often assume fixed collaboration architectures (e.g., topologies) and single-model usage, overlooking that optimal architectures and model combinations can vary across tasks. Therefore, we study the novel problem of searching for compute-optimal model combinations and architectures in TTS under a fixed budget. We formalize it as a multi-LLM collaboration graph, where nodes encode roles and LLM model assignments, and edges capture information flow. This problem is challenging because (i) the combinatorial search space is prohibitively large, and (ii) task-specific requirements demand tailored designs. To address these, we reformulate the problem as probabilistic graph optimization and, through pilot experiments, derive three empirical insights into TTS collaboration graphs. Guided by these insights, we propose Agent-REINFORCE, an LLM-agent-augmented framework that mirrors the REINFORCE pipeline by mapping sampling-gradient-update to sampling-feedback-update, where feedback serves as a textual gradient to update the probabilistic graph and efficiently search for optimal multi-LLM collaboration graphs. Experiments show that Agent-REINFORCE outperforms both traditional and LLM-based baselines in sample efficiency and search performance, and effectively identifies optimal graphs under joint objectives of accuracy and inference latency.

中文摘要

测试时间缩放 (TTS) 通过在推理期间分配额外的计算(通常通过并行、顺序或混合缩放)来改进大型语言模型 (LLM)。然而,先前的研究通常假设固定的协作架构(例如拓扑)和单一模型的使用,而忽略了最佳架构和模型组合可能因任务而异。因此,我们研究了在固定预算下在 TTS 中搜索计算最优模型组合和架构的新问题。我们将其形式化为多 LLM 协作图,其中节点对角色和 LLM 模型分配进行编码,边捕获信息流。这个问题具有挑战性,因为(i)组合搜索空间非常大,并且(ii)特定于任务的要求需要量身定制的设计。为了解决这些问题,我们将问题重新表述为概率图优化,并通过试点实验得出了 TTS 协作图的三个实证见解。在这些见解的指导下,我们提出了 Agent-REINFORCE,这是一个 LLM 代理增强框架,通过将采样梯度更新映射到采样反馈更新来镜像 REINFORCE 管道,其中反馈充当文本梯度来更新概率图并有效搜索最佳的多 LLM 协作图。实验表明,Agent-REINFORCE 在样本效率和搜索性能方面优于传统和基于 LLM 的基线,并在准确性和推理延迟的联合目标下有效地识别最佳图。


使用物理 AI 视频基础模型进行世界模拟

  • 标题: World Simulation with Video Foundation Models for Physical AI

  • 作者: NVIDIA, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, Jinwei Gu, Aryaman Gupta, Siddharth Gururani, Imad El Hanafi, Ali Hassani, Zekun Hao, Jacob Huffman, Joel Jang, Pooya Jannaty, Jan Kautz, Grace Lam, Xuan Li, Zhaoshuo Li, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Yen-Chen Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Seungjun Nah, Yashraj Narang, Abhijeet Panaskar, Lindsey Pavao, Trung Pham, Morteza Ramezanali, Fitsum Reda, Scott Reed, Xuanchi Ren, Haonan Shao, Yue Shen, Stella Shi, Shuran Song, Bartosz Stefaniak, Shangkun Sun, Shitao Tang, Sameena Tasmeen, Lyne Tchapmi, Wei-Cheng Tseng, Jibin Varghese, Andrew Z. Wang, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Jiashu Xu, Dinghao Yang, Xiaodong Yang, Haotian Ye, Seonghyeon Ye, Xiaohui Zeng, Jing Zhang, Qinsheng Zhang, Kaiwen Zheng, Andrew Zhu, Yuke Zhu

  • 日期: 2025-10-28

  • ArXiv主页: https://arxiv.org/abs/2511.00062

  • 论文链接: https://arxiv.org/pdf/2511.00062

  • gitHub仓库: https://github.com/nvidia-cosmos/cosmos-transfer2.5

英文摘要

We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World Foundation Models for Physical AI. Built on a flow-based architecture, [Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation in a single model and leverages [Cosmos-Reason1], a Physical AI vision-language model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learning-based post-training, [Cosmos-Predict2.5] achieves substantial improvements over [Cosmos-Predict1] in video quality and instruction alignment, with models released at 2B and 14B scales. These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems. We further extend the family with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and Real2Real world translation. Despite being 3.5times smaller than [Cosmos-Transfer1], it delivers higher fidelity and robust long-horizon video generation. Together, these advances establish [Cosmos-Predict2.5] and [Cosmos-Transfer2.5] as versatile tools for scaling embodied intelligence. To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nvidia-cosmos/cosmos-transfer2.5. We hope these open resources lower the barrier to adoption and foster innovation in building the next generation of embodied intelligence.

中文摘要

我们介绍 [Cosmos-Predict2.5],最新一代物理 AI 的 Cosmos 世界基础模型。[Cosmos-Predict2.5] 基于流式架构构建,将 Text2World、Image2World 和 Video2World 生成统一在一个模型中,并利用物理 AI 视觉语言模型 [Cosmos-Reason1] 来提供更丰富的文本基础和对世界模拟的更精细控制。[Cosmos-Predict2.5] 经过 2 亿个精选视频剪辑的训练,并通过基于强化学习的后期训练进行了改进,在视频质量和指令对齐方面比 [Cosmos-Predict1] 取得了实质性改进,并以 2B 和 14B 尺度发布了模型。这些功能可以为机器人和自主系统提供更可靠的合成数据生成、策略评估和闭环仿真。我们通过 [Cosmos-Transfer2.5] 进一步扩展了该系列,这是一个用于 Sim2Real 和 Real2Real 世界翻译的控制网风格框架。尽管比 [Cosmos-Transfer1] 小 3.5 倍,但它提供了更高的保真度和强大的长视距视频生成。这些进步共同确立了 [Cosmos-Predict2.5] 和 [Cosmos-Transfer2.5] 作为扩展体现智能的多功能工具。为了加速物理 AI 的研究和部署,我们根据 NVIDIA 开放模型许可证在 https://github.com/nvidia-cosmos/cosmos-predict2.5 和 https://github.com/nvidia-cosmos/cosmos-transfer2.5 上发布了源代码、预训练检查点和策划基准。我们希望这些开放资源能够降低采用的障碍,并促进构建下一代具体智能的创新。


Cambrian-S:迈向视频中的空间超感

英文摘要

We argue that progress in true multimodal intelligence calls for a shift from reactive, task-driven systems and brute-force long context towards a broader paradigm of supersensing. We frame spatial supersensing as four stages beyond linguistic-only understanding: semantic perception (naming what is seen), streaming event cognition (maintaining memory across continuous experiences), implicit 3D spatial cognition (inferring the world behind pixels), and predictive world modeling (creating internal models that filter and organize information). Current benchmarks largely test only the early stages, offering narrow coverage of spatial cognition and rarely challenging models in ways that require true world modeling. To drive progress in spatial supersensing, we present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting). These tasks require arbitrarily long video inputs yet are resistant to brute-force context expansion. We then test data scaling limits by curating VSI-590K and training Cambrian-S, achieving +30% absolute improvement on VSI-Bench without sacrificing general capabilities. Yet performance on VSI-SUPER remains limited, indicating that scale alone is insufficient for spatial supersensing. We propose predictive sensing as a path forward, presenting a proof-of-concept in which a self-supervised next-latent-frame predictor leverages surprise (prediction error) to drive memory and event segmentation. On VSI-SUPER, this approach substantially outperforms leading proprietary baselines, showing that spatial supersensing requires models that not only see but also anticipate, select, and organize experience.

中文摘要

我们认为,真正的多模态智能的进步需要从反应性、任务驱动系统和强力长上下文转向更广泛的超感知范式。我们将空间超感知视为超越纯语言理解的四个阶段:语义感知(命名所看到的内容)、流事件认知(在连续体验中维护记忆)、隐式 3D 空间认知(推断像素背后的世界)和预测世界建模(创建过滤和组织信息的内部模型)。目前的基准测试主要只测试早期阶段,提供了狭窄的空间认知覆盖范围,并且很少以需要真实世界建模的方式挑战模型。为了推动空间超感知的进步,我们提出了 VSI-SUPER,这是一个由两部分组成的基准:VSR(长视距视觉空间回忆)和 VSC(连续视觉空间计数)。这些任务需要任意长的视频输入,但可以抵抗暴力上下文扩展。然后,我们通过策划 VSI-590K 和训练 Cambrian-S 来测试数据扩展限制,在 VSI-Bench 上实现 +30% 的绝对改进,而无需牺牲一般功能。然而,VSI-SUPER 的性能仍然有限,这表明仅靠规模不足以实现空间超感知。我们提出预测传感作为前进的道路,提出了一种概念验证,其中自监督的下一个潜在帧预测器利用意外(预测误差)来驱动记忆和事件分割。在 VSI-SUPER 上,这种方法大大优于领先的专有基线,这表明空间超感知需要的模型不仅能够看到而且能够预测、选择和组织经验。


UniREditBench:基于统一推理的图像编辑基准

英文摘要

Recent advances in multi-modal generative models have driven substantial improvements in image editing. However, current generative models still struggle with handling diverse and complex image editing tasks that require implicit reasoning, underscoring the need for a comprehensive benchmark to systematically assess their performance across various reasoning scenarios. Existing benchmarks primarily focus on single-object attribute transformation in realistic scenarios, which, while effective, encounter two key challenges: (1) they largely overlook multi-object interactions as well as game-world scenarios that involve human-defined rules, which are common in real-life applications; (2) they only rely on textual references to evaluate the generated images, potentially leading to systematic misjudgments, especially in complex reasoning scenarios. To this end, this work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation. It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions. To improve evaluation reliability, we introduce multimodal dual-reference evaluation, providing both textual and ground-truth image references for each sample assessment. Furthermore, we design an automated multi-scenario data synthesis pipeline and construct UniREdit-Data-100K, a large-scale synthetic dataset with high-quality chain-of-thought (CoT) reasoning annotations. We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings. Through thorough benchmarking of both open-source and closed-source image editing models, we reveal their strengths and weaknesses across various aspects.

中文摘要

多模态生成模型的最新进展推动了图像编辑的显着改进。然而,当前的生成模型仍然难以处理需要隐式推理的多样化且复杂的图像编辑任务,这凸显了需要一个全面的基准来系统地评估其在各种推理场景中的性能。现有的基准主要关注现实场景中的单对象属性转换,虽然有效,但遇到两个关键挑战:(1)它们在很大程度上忽视了多对象交互以及涉及人类定义规则的游戏世界场景,这在现实生活应用中很常见;(2)它们仅依赖文本参考来评估生成的图像,可能导致系统性误判,尤其是在复杂的推理场景中。为此,本工作提出了UniREditBench,一个基于推理的图像编辑评估的统一基准。它包含 2,700 个精心策划的样本,涵盖 8 个主要维度和 18 个子维度的现实和游戏世界场景。为了提高评估的可靠性,我们引入了多模态双参考评估,为每个样本评估提供文本和真实图像参考。此外,我们设计了一个自动化的多场景数据合成管道,并构建了 UniREdit-Data-100K,这是一个具有高质量思想链(CoT)推理注释的大规模合成数据集。我们在此数据集上微调 Bagel 并开发 UniREdit-Bagel,展示了域内和分发外设置的重大改进。通过对开源和闭源图像编辑模型进行彻底的基准测试,我们揭示了它们在各个方面的优点和缺点。


UniLumos:快速、统一的图像和视频重新照明与物理合理的反馈

英文摘要

Relighting is a crucial task with both practical demand and artistic value, and recent diffusion models have shown strong potential by enabling rich and controllable lighting effects. However, as they are typically optimized in semantic latent space, where proximity does not guarantee physical correctness in visual space, they often produce unrealistic results, such as overexposed highlights, misaligned shadows, and incorrect occlusions. We address this with UniLumos, a unified relighting framework for both images and videos that brings RGB-space geometry feedback into a flow matching backbone. By supervising the model with depth and normal maps extracted from its outputs, we explicitly align lighting effects with the scene structure, enhancing physical plausibility. Nevertheless, this feedback requires high-quality outputs for supervision in visual space, making standard multi-step denoising computationally expensive. To mitigate this, we employ path consistency learning, allowing supervision to remain effective even under few-step training regimes. To enable fine-grained relighting control and supervision, we design a structured six-dimensional annotation protocol capturing core illumination attributes. Building upon this, we propose LumosBench, a disentangled attribute-level benchmark that evaluates lighting controllability via large vision-language models, enabling automatic and interpretable assessment of relighting precision across individual dimensions. Extensive experiments demonstrate that UniLumos achieves state-of-the-art relighting quality with significantly improved physical consistency, while delivering a 20x speedup for both image and video relighting. Code is available at https://github.com/alibaba-damo-academy/Lumos-Custom.

中文摘要

重新照明是一项兼具实用需求和艺术价值的关键任务,最近的扩散模型通过实现丰富且可控的照明效果而显示出强大的潜力。然而,由于它们通常在语义潜在空间中进行优化,而邻近性并不能保证视觉空间中的物理正确性,因此它们通常会产生不切实际的结果,例如过度曝光的高光、未对齐的阴影和不正确的遮挡。我们通过 UniLumos 解决了这个问题,这是一个适用于图像和视频的统一重新照明框架,可将 RGB 空间几何反馈引入流匹配主干。通过使用从输出中提取的深度和法线贴图来监督模型,我们明确地将光照效果与场景结构对齐,从而增强了物理合理性。然而,这种反馈需要高质量的输出来进行视觉空间的监督,这使得标准的多步去噪计算成本高昂。为了缓解这种情况,我们采用路径一致性学习,即使在几步训练制度下,监督也能保持有效。为了实现细粒度的重新照明控制和监督,我们设计了一个结构化的六维注释协议来捕获核心照明属性。在此基础上,我们提出了 LumosBench,这是一种解开的属性级基准,可通过大型视觉语言模型评估照明可控性,从而能够自动且可解释地评估各个维度的重新照明精度。大量实验表明,UniLumos 实现了最先进的重新照明质量,显着提高了物理一致性,同时将图像和视频重新照明的速度提高了 20 倍。代码可在 https://github.com/alibaba-damo-academy/Lumos-Custom 获取。


视觉模型在图结构理解方面的力量未被充分认识

英文摘要

Graph Neural Networks operate through bottom-up message-passing, fundamentally differing from human visual perception, which intuitively captures global structures first. We investigate the underappreciated potential of vision models for graph understanding, finding they achieve performance comparable to GNNs on established benchmarks while exhibiting distinctly different learning patterns. These divergent behaviors, combined with limitations of existing benchmarks that conflate domain features with topological understanding, motivate our introduction of GraphAbstract. This benchmark evaluates models’ ability to perceive global graph properties as humans do: recognizing organizational archetypes, detecting symmetry, sensing connectivity strength, and identifying critical elements. Our results reveal that vision models significantly outperform GNNs on tasks requiring holistic structural understanding and maintain generalizability across varying graph scales, while GNNs struggle with global pattern abstraction and degrade with increasing graph size. This work demonstrates that vision models possess remarkable yet underutilized capabilities for graph structural understanding, particularly for problems requiring global topological awareness and scale-invariant reasoning. These findings open new avenues to leverage this underappreciated potential for developing more effective graph foundation models for tasks dominated by holistic pattern recognition.

中文摘要

图神经网络通过自下而上的消息传递进行操作,这与人类视觉感知有根本不同,人类视觉感知首先直观地捕获全局结构。我们研究了视觉模型在图理解方面未被充分认识的潜力,发现它们在既定基准上实现了与 GNN 相当的性能,同时表现出明显不同的学习模式。这些不同的行为,再加上现有基准的局限性(将领域特征与拓扑理解混为一谈),促使我们引入 GraphAbstract。该基准评估模型像人类一样感知全局图属性的能力:识别组织原型、检测对称性、感知连接强度和识别关键元素。我们的结果表明,视觉模型在需要整体结构理解并在不同图尺度上保持泛化性的任务上显着优于 GNN,而 GNN 则难以进行全局模式抽象,并随着图大小的增加而退化。这项工作表明,视觉模型在图结构理解方面具有显着但未充分利用的能力,特别是对于需要全局拓扑意识和尺度不变推理的问题。这些发现开辟了新的途径,可以利用这种未被充分认识的潜力,为整体模式识别主导的任务开发更有效的图基础模型。


ROVER:全模态生成的互惠跨模态推理基准

英文摘要

Unified multimodal models (UMMs) have emerged as a powerful paradigm for seamlessly unifying text and image understanding and generation. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning, i.e., textual benchmarks emphasize language-based reasoning, while visual benchmarks emphasize reasoning outcomes manifested in the pixels. We introduce ROVER to address this pressing need to test reciprocal cross-modal reasoning, the use of one modality to guide, verify, or refine outputs in the other, an ability central to the vision of unified multimodal intelligence. ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, which contains 1312 tasks grounded in 1876 images, spanning two complementary settings. Verbally-augmented reasoning for visual generation evaluates whether models can use verbal prompts and reasoning chains to guide faithful image synthesis. Visually-augmented reasoning for verbal generation evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes for question answering. Experiments on 17 unified models reveal two key findings: (i) Cross-modal reasoning determines visual generation quality, with interleaved models significantly outperforming non-interleaved ones; notably, combining strong unimodal models fails to achieve comparable reasoning. (ii) Models show dissociation between physical and symbolic reasoning: they succeed at interpreting perceptual concepts literally but fail to construct visual abstractions for symbolic tasks, where faulty reasoning harms performance. These results highlight reciprocal cross-modal reasoning as a critical frontier for enabling true omnimodal generation.

中文摘要

统一多模态模型 (UMM) 已成为无缝统一文本和图像理解和生成的强大范例。然而,主流的评估是孤立地对待这些能力的,例如具有多模态输入和输出的任务主要通过单模态推理进行评分,即文本基准强调基于语言的推理,而视觉基准则强调像素中体现的推理结果。我们引入 ROVER 是为了解决测试互惠跨模态推理的迫切需求,即使用一种模态来指导、验证或完善另一种模态的输出,这是统一多模态智能愿景的核心能力。ROVER 是一个人工注释的基准测试,明确针对交互跨模态推理,其中包含基于 1876 个图像的 1312 个任务,跨越两个互补的设置。用于视觉生成的言语增强推理评估模型是否可以使用言语提示和推理链来指导忠实的图像合成。用于言语生成的视觉增强推理评估模型是否可以生成中间可视化,以加强其自身的问答推理过程。对 17 个统一模型的实验揭示了两个关键发现:(i)跨模态推理决定视觉生成质量,交错模型的性能显着优于非交错模型;值得注意的是,结合强大的单峰模型无法实现可比较的推理。(ii) 模型显示物理推理和符号推理之间的分离:它们成功地从字面上解释感知概念,但无法为符号任务构建视觉抽象,而错误的推理会损害性能。这些结果强调了交互跨模态推理是实现真正全模态生成的关键前沿。


MR-Align:大型推理模型的元推理知情事实对齐

英文摘要

Large reasoning models (LRMs) show strong capabilities in complex reasoning, yet their marginal gains on evidence-dependent factual questions are limited. We find this limitation is partially attributable to a reasoning-answer hit gap, where the model identifies the correct facts during reasoning but fails to incorporate them into the final response, thereby reducing factual fidelity. To address this issue, we propose MR-ALIGN, a Meta-Reasoning informed alignment framework that enhances factuality without relying on external verifiers. MR-ALIGN quantifies state transition probabilities along the model’s thinking process and constructs a transition-aware implicit reward that reinforces beneficial reasoning patterns while suppressing defective ones at the atomic thinking segments. This re-weighting reshapes token-level signals into probability-aware segment scores, encouraging coherent reasoning trajectories that are more conducive to factual correctness. Empirical evaluations across four factual QA datasets and one long-form factuality benchmark show that MR-ALIGN consistently improves accuracy and truthfulness while reducing misleading reasoning. These results highlight that aligning the reasoning process itself, rather than merely the outputs, is pivotal for advancing factuality in LRMs.

中文摘要

大型推理模型(LRM)在复杂推理方面表现出强大的能力,但它们在依赖证据的事实问题上的边际收益有限。我们发现这种限制部分归因于推理-答案命中差距,即模型在推理过程中识别出正确的事实,但未能将它们纳入最终的响应中,从而降低了事实保真度。为了解决这个问题,我们提出了 MR-ALIGN,这是一种元推理知情对齐框架,可以在不依赖外部验证者的情况下增强事实性。MR-ALIGN 量化模型思维过程中的状态转换概率,并构建一个转换感知的隐式奖励,以强化有益的推理模式,同时抑制原子思维部分的缺陷推理模式。这种重新加权将标记级信号重塑为概率感知的分段分数,鼓励更有利于事实正确性的连贯推理轨迹。对四个事实 QA 数据集和一个长格式事实性基准的实证评估表明,MR-ALIGN 不断提高准确性和真实性,同时减少误导性推理。这些结果强调,调整推理过程本身,而不仅仅是输出,对于提高 LRM 的真实性至关重要。


通过 FP16 克服训练-推理不匹配问题

英文摘要

Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to FP16 effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks. We hope these findings motivate a broader reconsideration of precision trade-offs in RL fine-tuning.

中文摘要

由于训练和推理策略之间的数值不匹配,大型语言模型 (LLM) 的强化学习 (RL) 微调经常会出现不稳定的情况。虽然之前的工作试图通过算法修正或工程调整来缓解这个问题,但我们表明其根本原因在于浮点精度本身。广泛采用的 BF16 尽管具有较大的动态范围,但会引入较大的舍入误差,从而破坏了训练和推理之间的一致性。在这项工作中,我们证明了简单地恢复到 FP16 可以有效消除这种不匹配。更改很简单,只需更改几行代码即可得到现代框架的全面支持,并且无需修改模型架构或学习算法。我们的结果表明,统一使用 FP16 可以在不同的任务、算法和框架中产生更稳定的优化、更快的收敛以及更强的性能。我们希望这些发现能够激发人们对强化学习微调中的精度权衡进行更广泛的重新考虑。


PHUMA:基于物理的人形运动数据集

英文摘要

Motion imitation is a promising approach for humanoid locomotion, enabling agents to acquire humanlike behaviors. Existing methods typically rely on high-quality motion capture datasets such as AMASS, but these are scarce and expensive, limiting scalability and diversity. Recent studies attempt to scale data collection by converting large-scale internet videos, exemplified by Humanoid-X. However, they often introduce physical artifacts such as floating, penetration, and foot skating, which hinder stable imitation. In response, we introduce PHUMA, a Physically-grounded HUMAnoid locomotion dataset that leverages human video at scale, while addressing physical artifacts through careful data curation and physics-constrained retargeting. PHUMA enforces joint limits, ensures ground contact, and eliminates foot skating, producing motions that are both large-scale and physically reliable. We evaluated PHUMA in two sets of conditions: (i) imitation of unseen motion from self-recorded test videos and (ii) path following with pelvis-only guidance. In both cases, PHUMA-trained policies outperform Humanoid-X and AMASS, achieving significant gains in imitating diverse motions. The code is available at https://davian-robotics.github.io/PHUMA.

中文摘要

运动模仿是一种很有前途的人形运动方法,使智能体能够获得类人行为。现有方法通常依赖于高质量的动作捕捉数据集,例如 AMASS,但这些数据集稀缺且昂贵,限制了可扩展性和多样性。最近的研究试图通过转换大规模互联网视频来扩展数据收集,以 Humanoid-X 为例。然而,它们经常引入物理伪影,例如漂浮、穿透和脚滑,这阻碍了稳定的模仿。作为回应,我们引入了 PHUMA,一个基于物理的 HUMAnoid 运动数据集,它大规模利用人类视频,同时通过仔细的数据管理和物理约束的重定向来解决物理伪影。PHUMA 强制关节限制,确保地面接触,并消除脚滑,产生大规模且物理可靠的运动。我们在两组条件下评估 PHUMA:(i) 模仿自录测试视频中看不见的运动;(ii) 仅使用骨盆引导进行路径跟踪。在这两种情况下,经过 PHUMA 训练的策略都优于 Humanoid-X 和 AMASS,在模仿各种运动方面取得了显着的成果。该代码可在 https://davian-robotics.github.io/PHUMA 获取。


Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐