【论文速递】2025年第47周(Nov-16-22)(Robotics/Embodied AI/LLM)
本报告介绍了 Kandinsky 5.0,这是一系列用于高分辨率图像和 10 秒视频合成的最先进的基础模型。该框架包括三个核心模型系列:Kandinsky 5.0 Image Lite - 一系列 6B 参数图像生成模型、Kandinsky 5.0 Video Lite - 快速、轻量级的 2B 参数文本到视频和图像到视频模型,以及 Kandinsky 5.0 Video Pro - 实现卓越视频
中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- Kandinsky 5.0:图像和视频生成的一系列基础模型
- MiroThinker:通过模型、上下文和交互式扩展突破开源研究代理的性能界限
- Souper-模型:简单的算术如何解锁最先进的 LLM 表现
- P1:通过强化学习掌握物理奥林匹克竞赛
- VIDEOP2R:从感知到推理的视频理解
- SAM 3D: 3Dfy Anything in Images
- Agent0:通过工具集成推理从零数据中释放自我进化代理
- 认真思考:选择性潜在迭代改进推理语言模型
- Uni-MoE-2.0-Omni:利用高级 MoE、训练和数据扩展以语言为中心的全模态大型模型
- DoPE:旋转位置嵌入去噪
- 视频推理:首次通过迷宫求解任务评估视频模型的推理能力
- AraLingBench 用于评估大型语言模型的阿拉伯语语言能力的人工注释基准
- Part-X-MLLM:零件感知 3D 多模态大语言模型
- MMaDA-Parallel:用于思维感知编辑和生成的多模态大扩散语言模型
- 回到基础:让去噪生成模型去噪
- 一种风格值得一个代码:用离散风格空间解锁代码到风格的图像生成
- 成为一名优秀的人工智能研究代理人需要什么?研究创意多样性的作用
- GroupRank:强化学习驱动的分组重排序范式
- V-ReasonBench:面向视频生成模型的统一推理基准套件
- Step-Audio-R1技术报告
- 第一帧是视频内容定制的最佳选择
- PhysX-Anything:来自单个图像的模拟就绪物理 3D 资产
- 使用多模态基础模型扩展空间智能
- WEAVE:释放上下文交错理解和生成并对其进行基准测试
- VisPlay:来自图像的自我进化视觉语言模型
- TiViBench:视频生成模型的视频思考推理基准测试
- 大型语言模型满足极端的多标签分类:扩展和多模态框架
- 改进方法,而不是提示:法学硕士越狱攻击的进化综合
- 虚拟宽度网络
Kandinsky 5.0:图像和视频生成的一系列基础模型
- 标题: Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
- 作者: Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev, Alexey Letunovskiy, Maria Kovaleva, Nikolai Vaulin, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Nikita Kiselev, Alexander Varlamov, Dmitrii Mikhailov, Vladimir Polovnikov, Andrey Shutkin, Ilya Vasiliev, Julia Agafonova, Anastasiia Kargapoltseva, Anna Dmitrienko, Anastasia Maltseva, Anna Averchenkova, Olga Kim, Tatiana Nikulina, Denis Dimitrov
- 日期: 2025-11-19
- ArXiv主页: https://arxiv.org/abs/2511.14993
- 论文链接: https://arxiv.org/pdf/2511.14993
- 项目链接: https://kandinskylab.ai/
- gitHub仓库: https://github.com/kandinskylab/kandinsky-5
英文摘要
This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.
中文摘要
本报告介绍了 Kandinsky 5.0,这是一系列用于高分辨率图像和 10 秒视频合成的最先进的基础模型。该框架包括三个核心模型系列:Kandinsky 5.0 Image Lite - 一系列 6B 参数图像生成模型、Kandinsky 5.0 Video Lite - 快速、轻量级的 2B 参数文本到视频和图像到视频模型,以及 Kandinsky 5.0 Video Pro - 实现卓越视频生成质量的 19B 参数模型。我们对多阶段训练管道的数据管理生命周期(包括收集、处理、过滤和聚类)进行了全面审查,其中涉及广泛的预训练,并结合了质量增强技术,例如自监督微调 (SFT) 和基于强化学习 (RL) 的后期训练。我们还提出了新颖的架构、训练和推理优化,使 Kandinsky 5.0 能够在各种任务中实现高生成速度和最先进的性能,正如人类评估所证明的那样。作为一个大规模、公开可用的生成框架,Kandinsky 5.0 充分利用其预训练和后续阶段的潜力,以适应广泛的生成应用。我们希望这份报告,连同我们开源代码和培训检查点的发布,将大大促进研究界高质量生成模型的开发和可及性。
MiroThinker:通过模型、上下文和交互式扩展突破开源研究代理的性能界限
- 标题: MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling
- 作者: MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, Yue Deng, Yunjie Fu, Junqi Ge, Chenxia Han, Tammy Huang, Zhenhang Huang, Jerry Jiao, Shilei Jiang, Tianyu Jiao, Xiaoqi Jian, Lei Lei, Ruilin Li, Ryan Luo, Tiantong Li, Xiang Lin, Ziyuan Liu, Zhiqi Li, Jie Ni, Qiang Ren, Pax Sun, Shiqian Su, Chenxin Tao, Bin Wang, Hellen Wang, Haonan Wang, James Wang, Jin Wang, Jojo Wang, Letian Wang, Shizun Wang, Weizhi Wang, Zixuan Wang, Jinfan Xu, Sen Xing, Chenyu Yang, Hai Ye, Jiaheng Yu, Yue Yu, Muyan Zhong, Tianchen Zhao, Xizhou Zhu, Yanpeng Zhou, Yifan Zhang, Zhi Zhu
- 日期: 2025-11-14
- ArXiv主页: https://arxiv.org/abs/2511.11793
- 论文链接: https://arxiv.org/pdf/2511.11793
- 项目链接: https://dr.miromind.ai/
- gitHub仓库: https://github.com/MiroMindAI/MiroThinker
英文摘要
We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.
中文摘要
我们推出了 MiroThinker v1.0,这是一个开源研究代理,旨在提高工具增强推理和信息搜索功能。与之前仅扩大模型大小或上下文长度的代理不同,MiroThinker 探索了模型级别的交互扩展,系统地训练模型以处理更深、更频繁的代理与环境交互,作为性能改进的第三个维度。与 LLM 测试时间扩展不同的是,LLM 测试时间扩展是孤立运行的,并且存在较长推理链导致退化的风险,而交互式扩展则利用环境反馈和外部信息获取来纠正错误并完善轨迹。通过强化学习,该模型实现了高效的交互扩展:通过 256K 上下文窗口,每个任务最多可以执行 600 次工具调用,从而实现持续的多轮推理和复杂的现实世界研究工作流程。在四个代表性基准测试(GAIA、HLE、BrowseComp 和 BrowseComp-ZH)中,72B 变体的准确率分别高达 81.9%、37.7%、47.1% 和 55.6%,超越了之前的开源代理并接近 GPT-5 等商业同行的高水平。我们的分析表明,MiroThinker 始终受益于交互式扩展:随着模型进行更深入、更频繁的代理-环境交互,研究性能可预测地提高,这表明交互深度表现出类似于模型大小和上下文长度的扩展行为。这些发现将交互扩展确立为构建下一代开放研究代理、补充模型能力和上下文窗口的第三个关键维度。
Souper-模型:简单的算术如何解锁最先进的 LLM 表现
-
标题: Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance
-
作者: Shalini Maiti, Amar Budhiraja, Bhavul Gauri, Gaurav Chaurasia, Anton Protopopov, Alexis Audran-Reiss, Michael Slater, Despoina Magka, Tatiana Shavrina, Roberta Raileanu, Yoram Bachrach
-
日期: 2025-11-17
-
ArXiv主页: https://arxiv.org/abs/2511.13254
英文摘要
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their training remains resource- and time-intensive, requiring massive compute power and careful orchestration of training procedures. Model souping-the practice of averaging weights from multiple models of the same architecture-has emerged as a promising pre- and post-training technique that can enhance performance without expensive retraining. In this paper, we introduce Soup Of Category Experts (SoCE), a principled approach for model souping that utilizes benchmark composition to identify optimal model candidates and applies non-uniform weighted averaging to maximize performance. Contrary to previous uniform-averaging approaches, our method leverages the observation that benchmark categories often exhibit low inter-correlations in model performance. SoCE identifies “expert” models for each weakly-correlated category cluster and combines them using optimized weighted averaging rather than uniform weights. We demonstrate that the proposed method improves performance and robustness across multiple domains, including multilingual capabilities, tool calling, and math and achieves state-of-the-art results on the Berkeley Function Calling Leaderboard.
中文摘要
大型语言模型 (LLM) 在不同领域展示了卓越的能力,但它们的训练仍然是资源和时间密集型的,需要大量的计算能力和精心编排的训练过程。模型汤化(对同一架构的多个模型的权重进行平均的做法)已成为一种有前途的训练前和训练后技术,可以在无需昂贵的重新训练的情况下提高性能。在本文中,我们介绍了类别专家汤 (SoCE),这是一种模型汤的原理方法,它利用基准组合来识别最佳模型候选,并应用非均匀加权平均来最大化性能。与以前的均匀平均方法相反,我们的方法利用了基准类别通常在模型性能中表现出较低的相互相关性的观察结果。SoCE 为每个弱相关类别集群识别“专家”模型,并使用优化的加权平均而不是统一权重将它们组合起来。我们证明,所提出的方法提高了多个领域的性能和鲁棒性,包括多语言功能、工具调用和数学,并在伯克利函数调用排行榜上取得了最先进的结果。
P1:通过强化学习掌握物理奥林匹克竞赛
- 标题: P1: Mastering Physics Olympiads with Reinforcement Learning
- 作者: Jiacheng Chen, Qianjia Cheng, Fangchen Yu, Haiyuan Wan, Yuchen Zhang, Shenghe Zheng, Junchi Yao, Qingyang Zhang, Haonan He, Yun Luo, Yufeng Zhao, Futing Wang, Li Sheng, Chengxing Xie, Yuxin Zuo, Yizhuo Li, Wenxauan Zeng, Yulun Wu, Rui Huang, Dongzhan Zhou, Kai Chen, Yu Qiao, Lei Bai, Yu Cheng, Ning Ding, Bowen Zhou, Peng Ye, Ganqu Cui
- 日期: 2025-11-17
- ArXiv主页: https://arxiv.org/abs/2511.13612
- 论文链接: https://arxiv.org/pdf/2511.13612
- 项目链接: https://prime-rl.github.io/P1/
- gitHub仓库: https://github.com/PRIME-RL/P1
英文摘要
Recent progress in large language models (LLMs) has moved the frontier from puzzle-solving to science-grade reasoning-the kind needed to tackle problems whose answers must stand against nature, not merely fit a rubric. Physics is the sharpest test of this shift, which binds symbols to reality in a fundamental way, serving as the cornerstone of most modern technologies. In this work, we manage to advance physics research by developing large language models with exceptional physics reasoning capabilities, especially excel at solving Olympiad-level physics problems. We introduce P1, a family of open-source physics reasoning models trained entirely through reinforcement learning (RL). Among them, P1-235B-A22B is the first open-source model with Gold-medal performance at the latest International Physics Olympiad (IPhO 2025), and wins 12 gold medals out of 13 international/regional physics competitions in 2024/2025. P1-30B-A3B also surpasses almost all other open-source models on IPhO 2025, getting a silver medal. Further equipped with an agentic framework PhysicsMinions, P1-235B-A22B+PhysicsMinions achieves overall No.1 on IPhO 2025, and obtains the highest average score over the 13 physics competitions. Besides physics, P1 models also present great performance on other reasoning tasks like math and coding, showing the great generalibility of P1 series.
中文摘要
大型语言模型 (LLM) 的最新进展已将前沿从解决难题转移到科学级推理,即解决其答案必须违背自然而不仅仅是符合标准的问题所需的那种。物理学是对这种转变最尖锐的考验,它从根本上将符号与现实结合起来,成为大多数现代技术的基石。在这项工作中,我们通过开发具有卓越物理推理能力的大型语言模型来推进物理研究,特别是擅长解决奥林匹克级别的物理问题。我们介绍 P1,这是一系列完全通过强化学习 (RL) 训练的开源物理推理模型。其中,P1-235B-A22B是首个在最新一届国际物理奥林匹克(IPhO 2025)上获得金牌表现的开源模型,并在2024/2025年13项国际/地区物理比赛中获得12枚金牌。P1-30B-A3B也在IPhO 2025上超越了几乎所有其他开源模型,获得了银牌。进一步搭载代理框架PhysicsMinions,P1-235B-A22B+PhysicsMinions在IPhO 2025上取得总分第一,并获得13项物理竞赛的最高平均分。除了物理之外,P1 模型在数学和编码等其他推理任务上也表现出色,显示了 P1 系列的强大通用性。
VIDEOP2R:从感知到推理的视频理解
- 标题: VIDEOP2R: Video Understanding from Perception to Reasoning
- 作者: Yifan Jiang, Yueying Wang, Rui Zhao, Toufiq Parag, Zhimin Chen, Zhenyu Liao, Jayakrishnan Unnikrishnan
- 日期: 2025-11-14
- ArXiv主页: https://arxiv.org/abs/2511.11113
- 论文链接: https://arxiv.org/pdf/2511.11113
英文摘要
Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model’s perception output is information-sufficient for downstream reasoning.
中文摘要
强化微调(RFT)是一个由监督微调(SFT)和强化学习(RL)组成的两阶段框架,在提高大型语言模型(LLM)的推理能力方面显示出了有希望的结果。然而,将 RFT 扩展到大型视频语言模型 (LVLM) 仍然具有挑战性。我们提出了 VideoP2R,一种新颖的过程感知视频 RFT 框架,它通过将感知和推理建模为不同的过程来增强视频推理。在 SFT 阶段,我们开发了一个三步管道来生成 VideoP2R-CoT-162K,这是一个用于感知和推理的高质量、流程感知的思想链 (CoT) 数据集。在强化学习阶段,我们引入了一种新颖的过程感知组相对策略优化(PA-GRPO)算法,该算法为感知和推理提供单独的奖励。大量实验表明,VideoP2R 在七个视频推理和理解基准测试中的六个上实现了最先进的 (SotA) 性能。消融研究进一步证实了我们的过程感知建模和 PA-GRPO 的有效性,并证明模型的感知输出对于下游推理来说信息充足。
SAM 3D: 3Dfy Anything in Images
- 标题: SAM 3D: 3Dfy Anything in Images
- 作者: SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, Jitendra Malik
- 日期: 2025-11-20
- ArXiv主页: https://arxiv.org/abs/2511.16624
- 论文链接: https://arxiv.org/pdf/2511.16624
- 项目链接: https://ai.meta.com/sam3d/
- gitHub仓库: https://github.com/facebookresearch/sam-3d-objects
英文摘要
We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D “data barrier”. We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.
中文摘要
我们提出了 SAM 3D,这是一种用于基于视觉的 3D 对象重建的生成模型,可从单个图像预测几何形状、纹理和布局。SAM 3D 擅长处理自然图像,其中遮挡和场景混乱很常见,来自上下文的视觉识别线索发挥着更大的作用。我们通过人类和模型在环管道来实现这一目标,用于注释对象形状、纹理和姿势,以前所未有的规模提供基于视觉的 3D 重建数据。我们在现代多阶段训练框架中从这些数据中学习,该框架将综合预训练与现实世界对齐相结合,打破了 3D“数据障碍”。与最近的工作相比,我们取得了显着的进展,在对现实世界物体和场景的人类偏好测试中,胜率至少为 5:1。我们将发布代码和模型权重、在线演示以及用于野外 3D 对象重建的新的具有挑战性的基准。
Agent0:通过工具集成推理从零数据中释放自我进化代理
-
标题: Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning
-
作者: Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, Huaxiu Yao
-
日期: 2025-11-20
-
ArXiv主页: https://arxiv.org/abs/2511.16043
-
gitHub仓库: https://github.com/aiming-lab/Agent0
英文摘要
Large Language Model (LLM) Agents, often trained with Reinforcement Learning (RL), are constrained by a dependency on human-curated data, limiting scalability and tethering AI to human knowledge. Existing self-evolution frameworks offer an alternative but are typically restricted by the model’s inherent capabilities and single-round interactions, hindering the development of complex curricula involving tool use or dynamic reasoning. We introduce Agent0, a fully autonomous framework that evolves high-performing agents without external data through multi-step co-evolution and seamless tool integration. Agent0 establishes a symbiotic competition between two agents initialized from the same base LLM: a curriculum agent that proposes increasingly challenging frontier tasks, and an executor agent that learns to solve them. We integrate external tools to enhance the executor’s problem-solving capacity; this improvement, in turn, pressures the curriculum agent to construct more complex, tool-aware tasks. Through this iterative process, Agent0 establishes a self-reinforcing cycle that continuously produces high-quality curricula. Empirically, Agent0 substantially boosts reasoning capabilities, improving the Qwen3-8B-Base model by 18% on mathematical reasoning and 24% on general reasoning benchmarks. Code is available at https://github.com/aiming-lab/Agent0.
中文摘要
通常使用强化学习 (RL) 进行训练的大型语言模型 (LLM) 代理受到对人类管理数据的依赖的限制,限制了可扩展性并将人工智能与人类知识联系在一起。现有的自进化框架提供了一种替代方案,但通常受到模型固有功能和单轮交互的限制,阻碍了涉及工具使用或动态推理的复杂课程的开发。我们引入了 Agent0,这是一个完全自主的框架,通过多步协同进化和无缝工具集成,无需外部数据即可进化出高性能代理。Agent0 在从同一基础 LLM 初始化的两个代理之间建立了共生竞争:一个提出越来越具有挑战性的前沿任务的课程代理,以及一个学习解决这些任务的执行代理。我们整合外部工具,增强执行者解决问题的能力;这种改进反过来又迫使课程代理构建更复杂的、工具感知的任务。通过这个迭代过程,Agent0建立了一个自我强化的循环,不断产出高质量的课程。从经验来看,Agent0 大幅提升了推理能力,将 Qwen3-8B-Base 模型在数学推理上提高了 18%,在一般推理基准上提高了 24%。代码可在 https://github.com/aiming-lab/Agent0 获取。
认真思考:选择性潜在迭代改进推理语言模型
-
标题: Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models
-
作者: Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang, Yu Wang
-
日期: 2025-11-11
-
ArXiv主页: https://arxiv.org/abs/2511.08577
-
gitHub仓库: https://github.com/thu-nics/TaH
英文摘要
Improving reasoning capabilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Prior work proposes recurrent transformers, which allocate a fixed number of extra iterations per token to improve generation quality. After the first, standard forward pass, instead of verbalization, last-layer hidden states are fed back as inputs for additional iterations to refine token predictions. Yet we identify a latent overthinking phenomenon: easy token predictions that are already correct after the first pass are sometimes revised into errors in additional iterations. To address this, we propose Think-at-Hard (TaH), a dynamic latent thinking method that iterates deeper only at hard tokens. It employs a lightweight neural decider to trigger latent iterations only at tokens that are likely incorrect after the standard forward pass. During latent iterations, Low-Rank Adaptation (LoRA) modules shift the LLM objective from general next-token prediction to focused hard-token refinement. We further introduce a duo-causal attention mechanism that extends attention from the token sequence dimension to an additional iteration depth dimension. This enables cross-iteration information flow while maintaining full sequential parallelism. Experiments show that TaH boosts LLM reasoning performance across five challenging benchmarks while maintaining the same parameter count. Compared with baselines that iterate twice for all output tokens, TaH delivers 8.1-11.3% accuracy gains while exempting 94% of tokens from the second iteration. Against strong single-iteration Qwen3 models finetuned with the same data, it also delivers 4.0-5.0% accuracy gains. When allowing less than 3% additional parameters from LoRA and the iteration decider, the gains increase to 8.5-12.6% and 5.3-5.4%, respectively. Our code is available at https://github.com/thu-nics/TaH.
中文摘要
提高大型语言模型(LLM)的推理能力,特别是在参数约束下,对于实际应用至关重要。之前的工作提出了循环变压器,它为每个令牌分配固定数量的额外迭代,以提高生成质量。在第一次标准前向传递之后,最后一层隐藏状态不是用语言表达,而是作为额外迭代的输入进行反馈,以完善标记预测。然而,我们发现了一个潜在的过度思考现象:在第一次通过后已经正确的简单标记预测有时会在额外的迭代中被修改为错误。为了解决这个问题,我们提出了 Think-at-Hard (TaH),这是一种动态潜在思维方法,仅在硬标记上进行更深入的迭代。它采用轻量级神经决策器,仅在标准前向传递后可能不正确的标记处触发潜在迭代。在潜在迭代期间,低秩适应 (LoRA) 模块将 LLM 目标从一般的下一个令牌预测转变为集中的硬令牌细化。我们进一步引入了一种双因果注意力机制,将注意力从令牌序列维度扩展到额外的迭代深度维度。这使得交叉迭代信息流成为可能,同时保持完全的顺序并行性。实验表明,TaH 在五个具有挑战性的基准测试中提高了 LLM 推理性能,同时保持相同的参数数量。与对所有输出令牌迭代两次的基线相比,TaH 提供了 8.1-11.3% 的准确度增益,同时使 94% 的令牌免于第二次迭代。与使用相同数据进行微调的强大单次迭代 Qwen3 模型相比,它还提供了 4.0-5.0% 的准确度增益。当允许来自 LoRA 和迭代决策器的附加参数少于 3% 时,增益分别增加到 8.5-12.6% 和 5.3-5.4%。我们的代码可在 https://github.com/thu-nics/TaH 获取。
Uni-MoE-2.0-Omni:利用高级 MoE、训练和数据扩展以语言为中心的全模态大型模型
- 标题: Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
- 作者: Yunxin Li, Xinyu Chen, Shenyuan Jiang, Haoyuan Shi, Zhenyu Liu, Xuanyu Zhang, Nanhao Deng, Zhenran Xu, Yicheng Ma, Meishan Zhang, Baotian Hu, Min Zhang
- 日期: 2025-11-16
- ArXiv主页: https://arxiv.org/abs/2511.12609
- 论文链接: https://arxiv.org/pdf/2511.12609
- 项目链接: https://idealistxy.github.io/Uni-MoE-v2.github.io/
- gitHub仓库: https://github.com/HITsz-TMG/Uni-MoE
英文摘要
We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee’s Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.
中文摘要
我们推出来自 Lychee 家族的 Uni-MoE 2.0。作为完全开源的全模态大模型(OLM),它在以语言为中心的多模态理解、推理和生成方面大大推进了 Lychee 的 Uni-MoE 系列。基于 Qwen2.5-7B 密集架构,我们通过三个核心贡献从头开始构建 Uni-MoE-2.0-Omni:动态容量专家混合 (MoE) 设计、通过迭代强化策略增强的渐进训练策略以及精心策划的多模态数据匹配技术。它能够进行全模式理解,并生成图像、文本和语音。在架构上,我们的新 MoE 框架使用共享、路由和空专家来平衡 10 个跨模态输入的计算效率和能力,而我们的全模态 3D RoPE 确保自注意力层中的时空跨模态对齐。对于训练,在跨模态预训练之后,我们使用渐进式监督微调策略,激活特定模态的专家,并通过平衡数据组合和迭代 GSPO-DPO 方法进行增强,以稳定 RL 训练并改进推理。在数据方面,基础模型在大约 75B 个开源多模态数据标记上进行训练,配备了特殊的语音和图像生成标记,使其能够通过根据语言线索调节其输出来学习这些生成任务。对 85 个基准的广泛评估表明,我们的模型实现了 SOTA 或与领先的 OLM 相比具有高度竞争力的性能,在 76 个基准中的 50 多个基准上超越了 Qwen2.5-Omni(使用 1.2T 代币训练)。主要优势包括视频理解(8 项平均+7%)、全模态理解(4 项平均+7%)和视听推理(+4%)。它还推进了长格式语音处理(将 WER 降低了 4.2%),并在低级图像处理和跨 5 个指标的可控生成方面处于领先地位。
DoPE:旋转位置嵌入去噪
- 标题: DoPE: Denoising Rotary Position Embedding
- 作者: Jing Xiong, Liyang Fan, Hui Shen, Zunhai Su, Min Yang, Lingpeng Kong, Ngai Wong
- 日期: 2025-11-12
- ArXiv主页: https://arxiv.org/abs/2511.09146
- 论文链接: https://arxiv.org/pdf/2511.09146
- 项目链接: https://The-physical-picture-of-LLMs.github.io
英文摘要
Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length extrapolation. We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional Encoding (DoPE), a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map. Leveraging the noise characteristics of the feature map, we further reparameterize it with a parameter-free Gaussian distribution to achieve robust extrapolation. Our method theoretically reveals the underlying cause of the attention sink phenomenon and its connection to truncated matrix entropy. Experiments on needle-in-a-haystack and many-shot in-context learning tasks demonstrate that DoPE significantly improves retrieval accuracy and reasoning stability across extended contexts (up to 64K tokens). The results show that the denoising strategy for positional embeddings effectively mitigates attention sinks and restores balanced attention patterns, providing a simple yet powerful solution for improving length generalization. Our project page is Project: https://The-physical-picture-of-LLMs.github.io
中文摘要
Transformer 模型中的旋转位置嵌入 (RoPE) 具有削弱长度外推的固有限制。我们将具有位置编码的注意力图重新解释为噪声特征图,并提出了去噪位置编码(DoPE),这是一种基于截断矩阵熵的免训练方法,用于检测特征图中的异常频带。利用特征图的噪声特征,我们进一步用无参数高斯分布对其进行重新参数化,以实现鲁棒外推。我们的方法从理论上揭示了注意力集中现象的根本原因及其与截断矩阵熵的联系。大海捞针和多次上下文学习任务的实验表明,DoPE 显着提高了扩展上下文(最多 64K 个标记)的检索准确性和推理稳定性。结果表明,位置嵌入的去噪策略有效地减轻了注意力下沉并恢复了平衡的注意力模式,为提高长度泛化提供了一种简单而强大的解决方案。我们的项目页面是Project:https://The-physical-picture-of-LLMs.github.io
视频推理:首次通过迷宫求解任务评估视频模型的推理能力
- 标题: Reasoning via Video: The First Evaluation of Video Models’ Reasoning Abilities through Maze-Solving Tasks
- 作者: Cheng Yang, Haiyuan Wan, Yiran Peng, Xin Cheng, Zhaoyang Yu, Jiayi Zhang, Junchi Yu, Xinlei Yu, Xiawu Zheng, Dongzhan Zhou, Chenglin Wu
- 日期: 2025-11-19
- ArXiv主页: https://arxiv.org/abs/2511.15065
- 论文链接: https://arxiv.org/pdf/2511.15065
- 项目链接: https://imyangc7.github.io/VRBench_Web/
- gitHub仓库: https://github.com/ImYangC7/VR-Bench
英文摘要
Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR-Bench – a comprehensive benchmark designed to systematically evaluate video models’ reasoning capabilities. Grounded in maze-solving tasks that inherently require spatial planning and multi-step reasoning, VR-Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis demonstrates that SFT can efficiently elicit the reasoning ability of video model. Video models exhibit stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios, tasks, and levels of complexity. We further discover a test-time scaling effect, where diverse sampling during inference improves reasoning reliability by 10–20%. These findings highlight the unique potential and scalability of reasoning via video for spatial reasoning tasks.
中文摘要
视频模型在具有连贯运动动力学的高保真视频生成方面取得了显着的成功。类似于语言建模中从文本生成到基于文本推理的发展,视频模型的发展促使我们问:视频模型可以通过视频生成进行推理吗?与离散文本语料库相比,视频推理具有明确的空间布局和时间连续性,是空间推理的理想基础。在这项工作中,我们通过视频范式探索推理,并介绍 VR-Bench——一个旨在系统评估视频模型推理能力的综合基准。VR-Bench 基于本质上需要空间规划和多步骤推理的迷宫解决任务,包含 7,920 个程序生成的视频,涵盖五种迷宫类型和不同的视觉风格。我们的实证分析表明,SFT 可以有效地引出视频模型的推理能力。视频模型在推理过程中表现出更强的空间感知,优于领先的 VLM,并且在不同的场景、任务和复杂程度之间具有良好的泛化能力。我们进一步发现了测试时间缩放效应,推理过程中的多样化采样将推理可靠性提高了 10–20%。这些发现凸显了通过视频进行空间推理任务推理的独特潜力和可扩展性。
AraLingBench 用于评估大型语言模型的阿拉伯语语言能力的人工注释基准
-
标题: AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models
-
作者: Mohammad Zbib, Hasan Abed Al Kader Hammoud, Sina Mukalled, Nadine Rizk, Fatima Karnib, Issam Lakkis, Ammar Mohanna, Bernard Ghanem
-
日期: 2025-11-18
-
ArXiv主页: https://arxiv.org/abs/2511.14295
英文摘要
We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.
中文摘要
我们推出 AraLingBench:一个完全由人类注释的基准,用于评估大型语言模型 (LLM) 的阿拉伯语语言能力。该基准涵盖五个核心类别:语法、词法、拼写、阅读理解和句法,通过 150 个专家设计的多项选择题来直接评估结构语言理解。对 35 名阿拉伯语和双语法学硕士的评估表明,当前模型表现出很强的表面能力,但在更深层次的语法和句法推理方面存在困难。AraLingBench 强调了基于知识的基准测试的高分与真正的语言掌握之间持续存在的差距,表明许多模型通过记忆或模式识别而不是真实的理解来取得成功。通过分离和衡量基本语言技能,AraLingBench 为培养阿拉伯语法学硕士提供了一个诊断框架。完整的评估代码可在 GitHub 上公开获取。
Part-X-MLLM:零件感知 3D 多模态大语言模型
- 标题: Part-X-MLLM: Part-aware 3D Multimodal Large Language Model
- 作者: Chunshi Wang, Junliang Ye, Yunhan Yang, Yang Li, Zizhuo Lin, Jun Zhu, Zhuo Chen, Yawei Luo, Chunchao Guo
- 日期: 2025-11-17
- ArXiv主页: https://arxiv.org/abs/2511.13647
- 论文链接: https://arxiv.org/pdf/2511.13647
- 项目链接: https://chunshi.wang/Part-X-MLLM/
- gitHub仓库: https://github.com/AiEson/Part-X-MLLM
英文摘要
We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing high-quality, structured plans, enabling state-of-the-art performance in grounded Q&A, compositional generation, and localized editing through one unified interface. Project page: https://chunshi.wang/Part-X-MLLM/
中文摘要
我们引入了 Part-X-MLLM,这是一种原生 3D 多模态大语言模型,通过将不同的 3D 任务制定为结构化可执行语法中的程序来统一这些任务。给定 RGB 点云和自然语言提示,我们的模型自回归生成单个连贯的标记序列,编码部分级边界框、语义描述和编辑命令。这种结构化输出可作为通用接口来驱动下游几何感知模块以进行基于零件的生成和编辑。通过将符号规划与几何合成分离,我们的方法允许通过单一的、本地语言的前端来控制任何兼容的几何引擎。我们预训练双编码器架构,以将结构与语义分开,并在大规模、以部分为中心的数据集上对模型进行指令调整。实验表明,我们的模型擅长生成高质量的结构化计划,通过一个统一的界面在扎根问答、构图生成和本地化编辑方面实现最先进的性能。项目页面:https://chunshi.wang/Part-X-MLLM/
MMaDA-Parallel:用于思维感知编辑和生成的多模态大扩散语言模型
- 标题: MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation
- 作者: Ye Tian, Ling Yang, Jiongfan Yang, Anran Wang, Yu Tian, Jiani Zheng, Haochen Wang, Zhiyang Teng, Zhuochen Wang, Yinjie Wang, Yunhai Tong, Mengdi Wang, Xiangtai Li
- 日期: 2025-11-12
- ArXiv主页: https://arxiv.org/abs/2511.09611
- 论文链接: https://arxiv.org/pdf/2511.09611
- 项目链接: https://tyfeld.github.io/mmadaparellel.github.io/
- gitHub仓库: https://github.com/tyfeld/MMaDA-Parallel
英文摘要
While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. Our code is open-sourced at https://github.com/tyfeld/MMaDA-Parallel
中文摘要
虽然思考意识生成旨在提高复杂任务的性能,但我们确定了一种关键的故障模式,其中现有的顺序自回归方法可能会由于错误传播而自相矛盾地降低性能。为了系统地分析这个问题,我们提出了 ParaBench,这是一个旨在评估文本和图像输出模式的新基准。我们使用 ParaBench 进行的分析表明,这种性能下降与生成的推理和最终图像之间的对齐不良密切相关。为了解决这个问题,我们提出了一种并行多模态扩散框架 MMaDA-Parallel,它可以在整个去噪轨迹中实现文本和图像之间的连续、双向交互。MMaDA-Parallel 通过监督微调进行训练,然后通过并行强化学习 (ParaRL) 进行进一步优化,这是一种沿轨迹应用语义奖励以强制跨模式一致性的新颖策略。实验验证,我们的模型显着提高了跨模态对齐和语义一致性,与最先进的模型 Bagel 相比,在 ParaBench 上的输出对齐实现了 6.9% 的改进,为思维感知图像合成建立了更强大的范例。我们的代码是开源的 https://github.com/tyfeld/MMaDA-Parallel
回到基础:让去噪生成模型去噪
-
标题: Back to Basics: Let Denoising Generative Models Denoise
-
作者: Tianhong Li, Kaiming He
-
日期: 2025-11-17
-
ArXiv主页: https://arxiv.org/abs/2511.13720
-
gitHub仓库: https://github.com/LTH14/JiT
英文摘要
Today’s denoising diffusion models do not “denoise” in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than “Just image Transformers”, or JiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.
中文摘要
当今的去噪扩散模型并不进行经典意义上的“去噪”,即它们不直接预测干净的图像。相反,神经网络预测噪声或噪声量。在本文中,我们认为预测干净数据和预测噪声量有本质上的不同。根据流形假设,自然数据应该位于低维流形上,而噪声量则不然。有了这个假设,我们提倡直接预测干净数据的模型,这允许明显容量不足的网络在非常高维的空间中有效运行。我们证明,像素上的简单、大补丁 Transformer 可以成为强大的生成模型:不使用分词器,无需预训练,也没有额外的损失。我们的方法在概念上只不过是“Just image Transformers”,或者我们所说的 JiT。我们在 ImageNet 上以 256 和 512 的分辨率使用 JiT 报告了具有 16 和 32 的大补丁大小的竞争结果,其中预测高维噪声量可能会灾难性地失败。随着我们的网络映射回流形的基础,我们的研究回归基础并追求一种独立的范式,用于原始自然数据上基于 Transformer 的扩散。
一种风格值得一个代码:用离散风格空间解锁代码到风格的图像生成
- 标题: A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space
- 作者: Huijie Liu, Shuhao Cui, Haoxiang Cao, Shuai Ma, Kai Wu, Guoliang Kang
- 日期: 2025-11-13
- ArXiv主页: https://arxiv.org/abs/2511.10555
- 论文链接: https://arxiv.org/pdf/2511.10555
- 项目链接: https://Kwai-Kolors.github.io/CoTyle/
- gitHub仓库: https://github.com/Kwai-Kolors/CoTyle
英文摘要
Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we affirm that a style is worth one numerical code by introducing the novel task, code-to-style image generation, which produces images with novel, consistent visual styles conditioned solely on a numerical style code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Unlike existing methods, our method offers unparalleled simplicity and diversity, unlocking a vast space of reproducible styles from minimal input. Extensive experiments validate that CoTyle effectively turns a numerical code into a style controller, demonstrating a style is worth one code.
中文摘要
创新的视觉风格是艺术创作的基石,但生成新颖且一致的视觉风格仍然是一个重大挑战。现有的生成方法通常依赖于冗长的文本提示、参考图像或参数高效的微调来指导风格感知图像的生成,但往往会遇到风格一致性、有限的创造力和复杂的风格表示等问题。在本文中,我们通过引入新颖的任务“代码到风格图像生成”来确认一种风格值得一个数字代码,该任务仅根据数字风格代码生成具有新颖、一致的视觉风格的图像。迄今为止,该领域主要仅由业界探索(例如 Midjourney),学术界还没有开源研究。为了填补这一空白,我们提出了 CoTyle,这是该任务的第一个开源方法。具体来说,我们首先从图像集合中训练离散风格码本以提取风格嵌入。这些嵌入充当文本到图像扩散模型(T2I-DM)生成风格图像的条件。随后,我们在离散样式嵌入上训练自回归样式生成器以对其分布进行建模,从而允许合成新颖的样式嵌入。在推理过程中,样式生成器将数字样式代码映射到唯一的样式嵌入,该嵌入指导 T2I-DM 生成相应样式的图像。与现有方法不同,我们的方法提供了无与伦比的简单性和多样性,以最少的输入解锁了广阔的可复制样式空间。大量实验验证了 CoTyle 有效地将数字代码转换为样式控制器,证明一种样式值得一个代码。
成为一名优秀的人工智能研究代理人需要什么?研究创意多样性的作用
- 标题: What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity
- 作者: Alexis Audran-Reiss, Jordi Armengol Estapé, Karen Hambardzumyan, Amar Budhiraja, Martin Josifoski, Edan Toledo, Rishi Hazra, Despoina Magka, Michael Shvartsman, Parth Pathak, Justine T Kao, Lucia Cipolina-Kun, Bhavul Gauri, Jean-Christophe Gagnon-Audet, Emanuel Tewolde, Jenny Zhang, Taco Cohen, Yossi Adi, Tatiana Shavrina, Yoram Bachrach
- 日期: 2025-11-19
- ArXiv主页: https://arxiv.org/abs/2511.15593
- 论文链接: https://arxiv.org/pdf/2511.15593
英文摘要
AI research agents offer the promise to accelerate scientific progress by automating the design, implementation, and training of machine learning models. However, the field is still in its infancy, and the key factors driving the success or failure of agent trajectories are not fully understood. We examine the role that ideation diversity plays in agent performance. First, we analyse agent trajectories on MLE-bench, a well-known benchmark to evaluate AI research agents, across different models and agent scaffolds. Our analysis reveals that different models and agent scaffolds yield varying degrees of ideation diversity, and that higher-performing agents tend to have increased ideation diversity. Further, we run a controlled experiment where we modify the degree of ideation diversity, demonstrating that higher ideation diversity results in stronger performance. Finally, we strengthen our results by examining additional evaluation metrics beyond the standard medal-based scoring of MLE-bench, showing that our findings still hold across other agent performance metrics.
中文摘要
人工智能研究代理有望通过自动化机器学习模型的设计、实施和训练来加速科学进步。然而,该领域仍处于起步阶段,驱动智能体轨迹成功或失败的关键因素尚不完全清楚。我们研究了观念多样性在代理人绩效中所扮演的角色。首先,我们在 MLE-bench 上分析不同模型和代理支架上的代理轨迹,MLE-bench 是评估人工智能研究代理的著名基准。我们的分析表明,不同的模型和代理支架会产生不同程度的构思多样性,并且表现较高的代理往往会增加构思多样性。此外,我们进行了一项对照实验,修改创意多样性的程度,证明较高的创意多样性会带来更强的绩效。最后,我们通过检查 MLE-bench 标准基于奖牌的评分之外的其他评估指标来加强我们的结果,这表明我们的研究结果仍然适用于其他代理绩效指标。
GroupRank:强化学习驱动的分组重排序范式
-
标题: GroupRank: A Groupwise Reranking Paradigm Driven by Reinforcement Learning
-
作者: Duolin Sun, Meixiu Long, Dan Yang, Yihan Jiao, Zhehao Tan, Jie Feng, Junjie Wang, Yue Shen, Peng Wei, Jian Wang, Jinjie Gu
-
日期: 2025-11-10
-
ArXiv主页: https://arxiv.org/abs/2511.11653
-
gitHub仓库: https://github.com/AQ-MedAI/Diver
英文摘要
Large Language Models have shown strong potential as rerankers to enhance the overall performance of RAG systems. However, existing reranking paradigms are constrained by a core theoretical and practical dilemma: Pointwise methods, while simple and highly flexible, evaluate documents independently, making them prone to the Ranking Myopia Trap, overlooking the relative importance between documents. In contrast, Listwise methods can perceive the global ranking context, but suffer from inherent List Rigidity, leading to severe scalability and flexibility issues when handling large candidate sets. To address these challenges, we propose Groupwise, a novel reranking paradigm. In this approach, the query and a group of candidate documents are jointly fed into the model, which performs within-group comparisons to assign individual relevance scores to each document. This design retains the flexibility of Pointwise methods while enabling the comparative capability of Listwise methods. We further adopt GRPO for model training, equipped with a heterogeneous reward function that integrates ranking metrics with a distributional reward aimed at aligning score distributions across groups. To overcome the bottleneck caused by the scarcity of high quality labeled data, we further propose an innovative pipeline for synthesizing high quality retrieval and ranking data. The resulting data can be leveraged not only for training the reranker but also for training the retriever. Extensive experiments validate the effectiveness of our approach. On two reasoning intensive retrieval benchmarks, BRIGHT and R2MED.
中文摘要
大型语言模型作为重新排序器已显示出强大的潜力,可提高 RAG 系统的整体性能。然而,现有的重排序范式受到核心理论和实践困境的限制:点式方法虽然简单且高度灵活,但独立评估文档,使其容易陷入排名近视陷阱,忽略文档之间的相对重要性。相比之下,Listwise 方法可以感知全局排名上下文,但受到固有的列表刚性的影响,导致在处理大型候选集时出现严重的可扩展性和灵活性问题。为了应对这些挑战,我们提出了 Groupwise,一种新颖的重新排名范式。在这种方法中,查询和一组候选文档被联合输入到模型中,该模型执行组内比较,为每个文档分配单独的相关性分数。该设计保留了 Pointwise 方法的灵活性,同时实现了 Listwise 方法的比较能力。我们进一步采用 GRPO 进行模型训练,配备异构奖励函数,将排名指标与分布奖励相结合,旨在协调各组之间的分数分布。为了克服高质量标记数据稀缺造成的瓶颈,我们进一步提出了一种用于合成高质量检索和排序数据的创新管道。生成的数据不仅可以用于训练重排序器,还可以用于训练检索器。大量的实验验证了我们方法的有效性。在两个推理密集型检索基准 BRIGHT 和 R2MED 上。
V-ReasonBench:面向视频生成模型的统一推理基准套件
- 标题: V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models
- 作者: Yang Luo, Xuanlei Zhao, Baijiong Lin, Lingting Zhu, Liyao Tang, Yuqi Liu, Ying-Cong Chen, Shengju Qian, Xin Wang, Yang You
- 日期: 2025-11-20
- ArXiv主页: https://arxiv.org/abs/2511.16668
- 论文链接: https://arxiv.org/pdf/2511.16668
- 项目链接: https://oahzxl.github.io/VReasonBench/
- gitHub仓库: https://github.com/yangluo7/V-ReasonBench
英文摘要
Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.
中文摘要
Veo-3 等生成视频模型的最新进展显示出令人惊讶的零样本推理能力,从而对系统性和可靠的评估产生了日益增长的需求。我们推出了 V-ReasonBench,这是一个旨在评估四个关键维度的视频推理的基准:结构化问题解决、空间认知、基于模式的推理和物理动力学。该基准测试是根据合成图像序列和真实世界图像序列构建的,并提供了一组多样化的可验证答案的任务,这些任务是可重复的、可扩展的且明确的。对六种最先进的视频模型的评估揭示了明显的维度差异,在结构、空间、基于模式和物理推理方面存在巨大差异。我们进一步将视频模型与强大的图像模型进行比较,分析常见的幻觉行为,并研究视频持续时间如何影响帧链推理。总体而言,V-ReasonBench 为测量视频推理提供了一个统一且可重复的框架,旨在支持开发具有更可靠、更符合人类推理技能的模型。
Step-Audio-R1技术报告
- 标题: Step-Audio-R1 Technical Report
- 作者: Fei Tian, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Yuxin Li, Daijiao Liu, Yayue Deng, Donghang Wu, Jun Chen, Liang Zhao, Chengyuan Yao, Hexin Liu, Eng Siong Chng, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu
- 日期: 2025-11-19
- ArXiv主页: https://arxiv.org/abs/2511.15848
- 论文链接: https://arxiv.org/pdf/2511.15848
- 项目链接: https://stepaudiollm.github.io/step-audio-r1/
- gitHub仓库: https://github.com/stepfun-ai/Step-Audio-R1
英文摘要
Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.
中文摘要
推理模型的最新进展通过扩展的思想链审议在文本和视觉领域取得了显着的成功。然而,音频语言模型中仍然存在一个令人困惑的现象:它们在很少或没有推理的情况下始终表现得更好,这就提出了一个基本问题:音频智能真的能从深思熟虑中受益吗?我们推出了 Step-Audio-R1,这是第一个成功解锁音频领域推理能力的音频推理模型。通过我们提出的模态推理蒸馏(MGRD)框架,Step-Audio-R1 学习生成与音频相关的推理链,这些推理链真正扎根于声学特征,而不是产生不连贯的审议幻觉。我们的模型展示了强大的音频推理能力,超越了 Gemini 2.5 Pro,并在涵盖语音、环境声音和音乐的全面音频理解和推理基准上实现了与最先进的 Gemini 3 Pro 相当的性能。这些结果表明,当适当锚定时,推理是一种跨模式的可转移能力,将扩展的审议从一种负担转变为音频智能的强大资产。通过建立第一个成功的音频推理模型,Step-Audio-R1 为构建真正的多模态推理系统开辟了新的途径,该系统能够深入思考所有感官模式。
第一帧是视频内容定制的最佳选择
- 标题: First Frame Is the Place to Go for Video Content Customization
- 作者: Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermuller, Brandon Y. Feng, Yiannis Aloimonos
- 日期: 2025-11-19
- ArXiv主页: https://arxiv.org/abs/2511.15700
- 论文链接: https://arxiv.org/pdf/2511.15700
- 项目链接: http://firstframego.github.io
- gitHub仓库: https://github.com/zli12321/FFGO-Video-Customization
英文摘要
What role does the first frame play in video generation models? Traditionally, it’s viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it’s possible to achieve robust and generalized video content customization in diverse scenarios, using only 20-50 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.
中文摘要
第一帧在视频生成模型中扮演什么角色?传统上,它被视为视频的时空起点,仅仅是后续动画的种子。在这项工作中,我们揭示了一个根本不同的观点:视频模型隐式地将第一帧视为概念性内存缓冲区,用于存储视觉实体以供以后在生成过程中重用。利用这种洞察力,我们表明,仅使用 20-50 个训练示例,无需架构更改或大规模微调,即可在不同场景中实现稳健且通用的视频内容定制。这揭示了用于基于参考的视频定制的视频生成模型的强大但被忽视的功能。
PhysX-Anything:来自单个图像的模拟就绪物理 3D 资产
- 标题: PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image
- 作者: Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, Ziwei Liu
- 日期: 2025-11-17
- ArXiv主页: https://arxiv.org/abs/2511.13648
- 论文链接: https://arxiv.org/pdf/2511.13648
- 项目链接: https://physx-anything.github.io/
- gitHub仓库: https://github.com/ziangcao0312/PhysX-Anything
英文摘要
3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2x and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.
中文摘要
3D 建模正在从静态视觉表示转向可直接用于模拟和交互的物理、铰接资产。然而,大多数现有的 3D 生成方法忽视了关键的物理和关节属性,从而限制了它们在具体人工智能中的实用性。为了弥补这一差距,我们引入了 PhysX-Anything,这是第一个可用于模拟的物理 3D 生成框架,只要给定单个野外图像,即可生成具有显式几何、关节和物理属性的高质量可用于模拟的 3D 资源。具体来说,我们提出了第一个基于 VLM 的物理 3D 生成模型,以及有效标记几何的新 3D 表示。它将令牌数量减少了 193 倍,在标准 VLM 令牌预算内实现显式几何学习,而无需在微调过程中引入任何特殊令牌,并显着提高生成质量。此外,为了克服现有物理 3D 数据集的有限多样性,我们构建了一个新的数据集 PhysX-Mobility,它将先前物理 3D 数据集中的对象类别扩展了 2 倍以上,并包含超过 2K 个具有丰富物理注释的常见现实世界对象。对 PhysX-Mobility 和野外图像的大量实验表明,PhysX-Anything 具有强大的生成性能和稳健的泛化能力。此外,在 MuJoCo 风格的环境中基于模拟的实验验证了我们的模拟就绪资产可以直接用于接触丰富的机器人策略学习。我们相信 PhysX-Anything 可以极大地增强广泛的下游应用程序的能力,特别是在具体的人工智能和基于物理的模拟方面。
使用多模态基础模型扩展空间智能
- 标题: Scaling Spatial Intelligence with Multimodal Foundation Models
- 作者: Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang
- 日期: 2025-11-17
- ArXiv主页: https://arxiv.org/abs/2511.13719
- 论文链接: https://arxiv.org/pdf/2511.13719
- 项目链接: https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B
- gitHub仓库: https://github.com/OpenSenseNova/SenseNova-SI
英文摘要
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.
中文摘要
尽管取得了显着的进步,多模态基础模型在空间智能方面仍然表现出令人惊讶的缺陷。在这项工作中,我们探索扩大多模态基础模型,以培养 SenseNova-SI 系列中的空间智能,该模型建立在已建立的多模态基础上,包括视觉理解模型(即 Qwen3-VL 和 InternVL3)以及统一理解和生成模型(即 Bagel)。我们采用原则性方法,通过系统地管理 SenseNova-SI-8M 来构建高性能和强大的空间智能:在严格的空间能力分类下的 800 万个不同的数据样本。SenseNova-SI 在广泛的空间智能基准测试中展现了前所未有的性能:VSI-Bench 上为 68.7%,MMSI 上为 43.3%,MindCube 上为 85.6%,ViewSpatial 上为 54.6%,SITE 上为 50.1%,同时保持了强大的一般多模态理解(例如,MMBench-En 上为 84.9%)。更重要的是,我们分析了数据扩展的影响,讨论了多样化数据训练带来的新兴泛化能力的早期迹象,分析了过度拟合和语言捷径的风险,提出了空间思维链推理的初步研究,并验证了潜在的下游应用。SenseNova-SI是一个正在进行的项目,本报告将不断更新。所有新训练的多模式基础模型都会公开发布,以促进该方向的进一步研究。
WEAVE:释放上下文交错理解和生成并对其进行基准测试
- 标题: WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
- 作者: Wei Chow, Jiachun Pan, Yongyuan Liang, Mingze Zhou, Xue Song, Liyu Jia, Saining Zhang, Siliang Tang, Juncheng Li, Fengda Zhang, Weijia Wu, Hanwang Zhang, Tat-Seng Chua
- 日期: 2025-11-14
- ArXiv主页: https://arxiv.org/abs/2511.11434
- 论文链接: https://arxiv.org/pdf/2511.11434
- 项目链接: https://weichow23.github.io/weave/
- gitHub仓库: https://github.com/weichow23/weave
英文摘要
Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation. However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing. To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation. Our suite consists of two complementary parts. WEAVE-100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-annotated benchmark with 100 tasks based on 480 images, featuring a hybrid VLM judger evaluation framework based on both the reference image and the combination of the original image with editing instructions that assesses models’ abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains. Experiments demonstrate that training on WEAVE-100k enables vision comprehension, image editing, and comprehension-generation collaboration capabilities. Furthermore, it facilitates UMMs to develop emergent visual-memory capabilities, while extensive evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation and editing. We believe WEAVE provides a view and foundation for studying in-context interleaved comprehension and generation for multi-modal community.
中文摘要
统一多模态模型(UMM)的最新进展在视觉理解和生成方面取得了令人瞩目的进展。然而,现有的数据集和基准主要关注单轮交互,未能捕捉现实世界图像创建和编辑的多轮、上下文相关的性质。为了解决这一差距,我们推出了 WEAVE,这是第一个用于上下文交错的跨模态理解和生成的套件。我们的套件由两个互补的部分组成。WEAVE-100k 是一个包含 100K 交错样本的大型数据集,涵盖超过 370K 对话回合和 500K 图像,涵盖需要对历史背景进行推理的理解、编辑和生成任务。WEAVEBench 是一个人工注释的基准测试,包含基于 480 张图像的 100 个任务,具有基于参考图像以及原始图像与编辑指令相结合的混合 VLM 判断器评估框架,可评估模型在多轮生成、视觉记忆和跨不同领域的世界知识推理方面的能力。实验表明,在 WEAVE-100k 上进行训练可以实现视觉理解、图像编辑和理解生成协作功能。此外,它有助于 UMM 开发新兴的视觉记忆功能,而对 WEAVEBench 的广泛评估暴露了当前多轮、上下文感知图像生成和编辑方法的持续局限性和挑战。我们相信 WEAVE 为研究多模态社区的上下文交错理解和生成提供了一个观点和基础。
VisPlay:来自图像的自我进化视觉语言模型
- 标题: VisPlay: Self-Evolving Vision-Language Models from Images
- 作者: Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, Yonghui Yang
- 日期: 2025-11-19
- ArXiv主页: https://arxiv.org/abs/2511.15661
- 论文链接: https://arxiv.org/pdf/2511.15661
- 项目链接: https://bruno686.github.io/VisPlay/
- gitHub仓库: https://github.com/bruno686/VisPlay
英文摘要
Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/
中文摘要
强化学习 (RL) 提供了一个原则框架,用于改进复杂推理任务的视觉语言模型 (VLM)。然而,现有的强化学习方法通常依赖于人工注释的标签或特定于任务的启发式方法来定义可验证的奖励,这两种方法成本高昂且难以扩展。我们引入了 VisPlay,这是一种自我进化的 RL 框架,它使 VLM 能够使用大量未标记的图像数据自主提高其推理能力。从单一基本 VLM 开始,VisPlay 将模型分配为两个交互角色:一个图像条件提问者,用于制定具有挑战性但可回答的视觉问题,以及一个多模态推理器,用于生成银色响应。这些角色与组相对策略优化(GRPO)联合训练,它结合了多样性和难度奖励,以平衡生成问题的复杂性与银级答案的质量。VisPlay 可在两个模型系列之间高效扩展。当在 Qwen2.5-VL 和 MiMo-VL 上进行训练时,VisPlay 在包括 MM-Vet 和 MMMU 在内的八个基准上实现了视觉推理、构图概括和幻觉减少方面的持续改进,展示了一条通往自我进化多模态智能的可扩展路径。项目页面位于 https://bruno686.github.io/VisPlay/
TiViBench:视频生成模型的视频思考推理基准测试
- 标题: TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models
- 作者: Harold Haodong Chen, Disen Lan, Wen-Jie Shu, Qingyang Liu, Zihan Wang, Sirui Chen, Wenkai Cheng, Kanghao Chen, Hongfei Zhang, Zixin Zhang, Rongjin Guo, Yu Cheng, Ying-Cong Chen
- 日期: 2025-11-17
- ArXiv主页: https://arxiv.org/abs/2511.13704
- 论文链接: https://arxiv.org/pdf/2511.13704
- 项目链接: https://haroldchen19.github.io/TiViBench-Page/
- gitHub仓库: https://github.com/EnVision-Research/TiViBench
英文摘要
The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3’s chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.
中文摘要
视频生成模型的快速发展已将其重点从产生视觉上合理的输出转移到处理需要物理合理性和逻辑一致性的任务。然而,尽管最近取得了 Veo 3 的框架链推理等突破,但仍不清楚这些模型是否能够表现出类似于大型语言模型 (LLM) 的推理能力。现有的基准主要评估视觉保真度和时间连贯性,未能捕获高阶推理能力。为了弥补这一差距,我们提出了 TiViBench,这是一个专门设计用于评估图像到视频(I2V)生成模型的推理能力的分层基准。TiViBench 系统地评估四个维度的推理:i) 结构推理和搜索,ii) 空间和视觉模式推理,iii) 符号和逻辑推理,以及 iv) 行动规划和任务执行,涵盖 3 个难度级别的 24 个不同的任务场景。通过广泛的评估,我们表明商业模型(例如 Sora 2、Veo 3.1)表现出更强的推理潜力,而开源模型则显示出尚未开发的潜力,但仍受到有限的训练规模和数据多样性的阻碍。为了进一步释放这一潜力,我们引入了 VideoTPO,这是一种受偏好优化启发的简单而有效的测试时策略。通过对生成的候选人进行 LLM 自我分析来识别优势和劣势,VideoTPO 显着提高了推理性能,而无需额外的培训、数据或奖励模型。TiViBench 和 VideoTPO 共同为评估和推进视频生成模型的推理铺平了道路,为这一新兴领域的未来研究奠定了基础。
大型语言模型满足极端的多标签分类:扩展和多模态框架
-
标题: Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework
-
作者: Diego Ortego, Marlon Rodríguez, Mario Almagro, Kunal Dahiya, David Jiménez, Juan C. SanMiguel
-
日期: 2025-11-17
-
ArXiv主页: https://arxiv.org/abs/2511.13189
-
gitHub仓库: https://github.com/DiegoOrtego/vixml
英文摘要
Foundation models have revolutionized artificial intelligence across numerous domains, yet their transformative potential remains largely untapped in Extreme Multi-label Classification (XMC). Queries in XMC are associated with relevant labels from extremely large label spaces, where it is critical to strike a balance between efficiency and performance. Therefore, many recent approaches efficiently pose XMC as a maximum inner product search between embeddings learned from small encoder-only transformer architectures. In this paper, we address two important aspects in XMC: how to effectively harness larger decoder-only models, and how to exploit visual information while maintaining computational efficiency. We demonstrate that both play a critical role in XMC separately and can be combined for improved performance. We show that a few billion-size decoder can deliver substantial improvements while keeping computational overhead manageable. Furthermore, our Vision-enhanced eXtreme Multi-label Learning framework (ViXML) efficiently integrates foundation vision models by pooling a single embedding per image. This limits computational growth while unlocking multi-modal capabilities. Remarkably, ViXML with small encoders outperforms text-only decoder in most cases, showing that an image is worth billions of parameters. Finally, we present an extension of existing text-only datasets to exploit visual metadata and make them available for future benchmarking. Comprehensive experiments across four public text-only datasets and their corresponding image enhanced versions validate our proposals’ effectiveness, surpassing previous state-of-the-art by up to +8.21% in P@1 on the largest dataset. ViXML’s code is available at https://github.com/DiegoOrtego/vixml.
中文摘要
基础模型已经彻底改变了众多领域的人工智能,但其变革潜力在极端多标签分类 (XMC) 中仍未得到充分利用。XMC中的查询与极其庞大的标签空间中的相关标签相关联,在效率和性能之间取得平衡至关重要。因此,许多最近的方法有效地将 XMC 视为从小型仅编码器变压器架构学习的嵌入之间的最大内积搜索。在本文中,我们讨论了 XMC 的两个重要方面:如何有效地利用更大的仅解码器模型,以及如何在保持计算效率的同时利用视觉信息。我们证明两者在 XMC 中分别发挥着关键作用,并且可以结合起来以提高性能。我们证明,数十亿大小的解码器可以提供显着的改进,同时保持计算开销可控。此外,我们的视觉增强 eXtreme 多标签学习框架 (ViXML) 通过池化每个图像的单个嵌入来有效地集成基础视觉模型。这在释放多模式功能的同时限制了计算增长。值得注意的是,在大多数情况下,具有小型编码器的 ViXML 的性能优于纯文本解码器,这表明图像价值数十亿个参数。最后,我们提出了现有纯文本数据集的扩展,以利用视觉元数据并使它们可用于未来的基准测试。在四个公共纯文本数据集及其相应的图像增强版本上进行的综合实验验证了我们建议的有效性,在最大数据集上的 P@1 中超越了之前最先进的技术高达 +8.21%。ViXML 的代码可从 https://github.com/DiegoOrtego/vixml 获取。
改进方法,而不是提示:法学硕士越狱攻击的进化综合
-
标题: Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs
-
作者: Yunhao Chen, Xin Wang, Juncheng Li, Yixu Wang, Jie Li, Yan Teng, Yingchun Wang, Xingjun Ma
-
日期: 2025-11-16
-
ArXiv主页: https://arxiv.org/abs/2511.12710
英文摘要
Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet they share a fundamental limitation: their jailbreak logic is confined to selecting, combining, or refining pre-existing attack strategies. This binds their creativity and leaves them unable to autonomously invent entirely new attack mechanisms. To overcome this gap, we introduce EvoSynth, an autonomous framework that shifts the paradigm from attack planning to the evolutionary synthesis of jailbreak methods. Instead of refining prompts, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute novel, code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite its own attack logic in response to failure. Through extensive experiments, we demonstrate that EvoSynth not only establishes a new state-of-the-art by achieving an 85.5% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5, but also generates attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research in this new direction of evolutionary synthesis of jailbreak methods. Code is available at: https://github.com/dongdongunique/EvoSynth.
中文摘要
大型语言模型(LLM)的自动化红队框架已经变得越来越复杂,但它们有一个基本的局限性:它们的越狱逻辑仅限于选择、组合或改进预先存在的攻击策略。这限制了他们的创造力,使他们无法自主发明全新的攻击机制。为了克服这一差距,我们引入了 EvoSynth,这是一个自主框架,它将范式从攻击计划转变为越狱方法的进化综合。EvoSynth 没有细化提示,而是采用多代理系统来自主设计、发展和执行新颖的基于代码的攻击算法。至关重要的是,它具有代码级自我纠正循环,允许它迭代地重写自己的攻击逻辑以响应失败。通过大量实验,我们证明 EvoSynth 不仅针对 Claude-Sonnet-4.5 等高度鲁棒的模型实现了 85.5% 的攻击成功率 (ASR),建立了新的最先进技术,而且还生成比现有方法更加多样化的攻击。我们发布了我们的框架,以促进未来在越狱方法进化综合这一新方向上的研究。代码位于:https://github.com/dongdongunique/EvoSynth。
虚拟宽度网络
- 标题: Virtual Width Networks
- 作者: Seed, Baisheng Li, Banggu Wu, Bole Ma, Bowen Xiao, Chaoyi Zhang, Cheng Li, Chengyi Wang, Chenyin Xu, Chi Zhang, Chong Hu, Daoguang Zan, Defa Zhu, Dongyu Xu, Du Li, Faming Wu, Fan Xia, Ge Zhang, Guang Shi, Haobin Chen, Hongyu Zhu, Hongzhi Huang, Huan Zhou, Huanzhang Dou, Jianhui Duan, Jianqiao Lu, Jianyu Jiang, Jiayi Xu, Jiecao Chen, Jin Chen, Jin Ma, Jing Su, Jingji Chen, Jun Wang, Jun Yuan, Juncai Liu, Jundong Zhou, Kai Hua, Kai Shen, Kai Xiang, Kaiyuan Chen, Kang Liu, Ke Shen, Liang Xiang, Lin Yan, Lishu Luo, Mengyao Zhang, Ming Ding, Mofan Zhang, Nianning Liang, Peng Li, Penghao Huang, Pengpeng Mu, Qi Huang, Qianli Ma, Qiyang Min, Qiying Yu, Renming Pang, Ru Zhang, Shen Yan, Shen Yan, Shixiong Zhao, Shuaishuai Cao, Shuang Wu, Siyan Chen, Siyu Li, Siyuan Qiao, Tao Sun, Tian Xin, Tiantian Fan, Ting Huang, Ting-Han Fan, Wei Jia, Wenqiang Zhang, Wenxuan Liu, Xiangzhong Wu, Xiaochen Zuo, Xiaoying Jia, Ximing Yang, Xin Liu, Xin Yu, Xingyan Bin, Xintong Hao, Xiongcai Luo, Xujing Li, Xun Zhou, Yanghua Peng, Yangrui Chen, Yi Lin, Yichong Leng, Yinghao Li, Yingshuan Song, Yiyuan Ma, Yong Shan, Yongan Xiang, Yonghui Wu, Yongtao Zhang, Yongzhen Yao, Yu Bao, Yuehang Yang, Yufeng Yuan, Yunshui Li, Yuqiao Xian, Yutao Zeng, Yuxuan Wang, Zehua Hong, Zehua Wang, Zengzhi Wang, Zeyu Yang, Zhengqiang Yin, Zhenyi Lu, Zhexi Zhang, Zhi Chen, Zhi Zhang, Zhiqi Lin, Zihao Huang, Zilin Xu, Ziyun Wei, Zuo Wang
- 日期: 2025-11-14
- ArXiv主页: https://arxiv.org/abs/2511.11238
- 论文链接: https://arxiv.org/pdf/2511.11238
英文摘要
We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8-times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and the convergence-speedup ratio increases, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.
中文摘要
我们引入虚拟宽度网络(VWN),这是一个框架,可以提供更广泛的表示的好处,而不会产生增加隐藏大小的二次成本。VWN 将表示宽度与主干宽度解耦,扩展嵌入空间,同时保持主干计算几乎恒定。在我们的大规模实验中,8 倍扩展使下一个令牌的优化速度加快了 2 倍以上,而下一个 2 令牌预测的优化速度加快了 3 倍。随着损失差距的扩大和收敛加速比的增加,其相对于训练的优势会放大,这表明 VWN 不仅具有令牌效率,而且随着规模的扩大而变得越来越有效。此外,我们确定了虚拟宽度和损失减少之间的近似对数线性缩放关系,为探索虚拟宽度缩放作为大模型效率的新维度提供了初步的经验基础和动机。
更多推荐


所有评论(0)