【论文速递】2025年第46周(Nov-09-15)(Robotics/Embodied AI/LLM)
我们推出 Lumine,这是第一个用于开发多面手智能体的开放配方,能够在具有挑战性的 3D 开放世界环境中实时完成长达数小时的复杂任务。Lumine 采用类人交互范式,在视觉语言模型的支持下,以端到端的方式统一感知、推理和行动。它以 5 Hz 的频率处理原始像素,以产生精确的 30 Hz 键盘鼠标操作,并仅在必要时自适应地调用推理。
中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- Lumine:在 3D 开放世界中构建多面手智能体的开放配方
- 小模型,大逻辑:多样性驱动的优化在 VibeThinker-1.5B 中激发大模型推理能力
- 潜在的一小步,像素的一大步:适用于您的扩散模型的快速潜在高档适配器
- TiDAR:在扩散中思考,在自回归中讨论
- 禁止计算机使用代理进行人体演示
- Depth Anything 3:从任意视图恢复视觉空间
- HaluMem:评估智能体记忆系统中的幻觉
- PAN:通用、可交互、长视野世界模拟的世界模型
- IterResearch:通过马尔可夫状态重建重新思考长期智能体
- MADD:多代理药物发现乐团
- 移动时间:通过双时钟降噪生成免训练运动控制视频
- 好到坏:论法学硕士未能扮演反派角色
- DRIVE:强化学习的数据管理最佳实践,在竞争性代码生成中提供可验证的奖励
- 视觉空间调整
- 大型语言模型的黑盒策略蒸馏
- DeepEyesV2:迈向代理多模态模型
- 对话系统中的自适应多代理响应细化
- Motif 2 12.7B技术报告
- UniVA:面向开源下一代视频多面手的通用视频代理
- KLASS:掩模扩散模型中的 KL 引导快速推理
- The Station:人工智能驱动探索的开放世界环境
- VeriCoT:通过逻辑一致性检查进行神经符号思维链验证
- 未走的路:RLVR 确实向校长学习
- 超越英语:法学硕士迈向包容性和可扩展的多语言机器翻译
- Wasm:构建结构化阿拉伯语交错多模态语料库的管道
Lumine:在 3D 开放世界中构建多面手智能体的开放配方
- 标题: Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds
- 作者: Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, Yujia Qin, Bo An, Libin Liu, Guang Shi
- 日期: 2025-11-12
- ArXiv主页: https://arxiv.org/abs/2511.08892
- 论文链接: https://arxiv.org/pdf/2511.08892
- 项目链接: https://www.lumine-ai.org/
英文摘要
We introduce Lumine, the first open recipe for developing generalist agents capable of completing hours-long complex missions in real time within challenging 3D open-world environments. Lumine adopts a human-like interaction paradigm that unifies perception, reasoning, and action in an end-to-end manner, powered by a vision-language model. It processes raw pixels at 5 Hz to produce precise 30 Hz keyboard-mouse actions and adaptively invokes reasoning only when necessary. Trained in Genshin Impact, Lumine successfully completes the entire five-hour Mondstadt main storyline on par with human-level efficiency and follows natural language instructions to perform a broad spectrum of tasks in both 3D open-world exploration and 2D GUI manipulation across collection, combat, puzzle-solving, and NPC interaction. In addition to its in-domain performance, Lumine demonstrates strong zero-shot cross-game generalization. Without any fine-tuning, it accomplishes 100-minute missions in Wuthering Waves and the full five-hour first chapter of Honkai: Star Rail. These promising results highlight Lumine’s effectiveness across distinct worlds and interaction dynamics, marking a concrete step toward generalist agents in open-ended environments.
中文摘要
我们推出 Lumine,这是第一个用于开发多面手智能体的开放配方,能够在具有挑战性的 3D 开放世界环境中实时完成长达数小时的复杂任务。Lumine 采用类人交互范式,在视觉语言模型的支持下,以端到端的方式统一感知、推理和行动。它以 5 Hz 的频率处理原始像素,以产生精确的 30 Hz 键盘鼠标操作,并仅在必要时自适应地调用推理。在 Genshin Impact 中接受培训后,Lumine 成功完成了整个 5 小时的 Mondstadt 主要故事情节,其效率达到了人类水平,并遵循自然语言指令,在 3D 开放世界探索和 2D GUI 操作方面执行广泛的任务,包括收集、战斗、解谜和 NPC 交互。除了域内性能外,Lumine 还表现出强大的零样本跨游戏泛化能力。无需任何微调,它就可以完成《呼啸波涛》中 100 分钟的任务以及《崩坏:星轨》整整 5 个小时的第一章。这些有希望的结果凸显了 Lumine 在不同世界和交互动态中的有效性,标志着在开放环境中向通才代理迈出了具体的一步。
小模型,大逻辑:多样性驱动的优化在 VibeThinker-1.5B 中激发大模型推理能力
- 标题: Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B
- 作者: Sen Xu, Yi Zhou, Wei Wang, Jixin Min, Zhibin Yin, Yingwei Dai, Shixi Liu, Lianyu Pang, Yirong Chen, Junlin Zhang
- 日期: 2025-11-09
- ArXiv主页: https://arxiv.org/abs/2511.06221
- 论文链接: https://arxiv.org/pdf/2511.06221
- 项目链接: https://github.com/WeiboAI/VibeThinker
- gitHub仓库: https://github.com/WeiboAI/VibeThinker
英文摘要
Challenging the prevailing consensus that small models inherently lack robust reasoning, this report introduces VibeThinker-1.5B, a 1.5B-parameter dense model developed via our Spectrum-to-Signal Principle (SSP). This challenges the prevailing approach of scaling model parameters to enhance capabilities, as seen in models like DeepSeek R1 (671B) and Kimi k2 (>1T). The SSP framework first employs a Two-Stage Diversity-Exploring Distillation (SFT) to generate a broad spectrum of solutions, followed by MaxEnt-Guided Policy Optimization (RL) to amplify the correct signal. With a total training cost of only $7,800, VibeThinker-1.5B demonstrates superior reasoning capabilities compared to closed-source models like Magistral Medium and Claude Opus 4, and performs on par with open-source models like GPT OSS-20B Medium. Remarkably, it surpasses the 400x larger DeepSeek R1 on three math benchmarks: AIME24 (80.3 vs. 79.8), AIME25 (74.4 vs. 70.0), and HMMT25 (50.4 vs. 41.7). This is a substantial improvement over its base model (6.7, 4.3, and 0.6, respectively). On LiveCodeBench V6, it scores 51.1, outperforming Magistral Medium’s 50.3 and its base model’s 0.0. These findings demonstrate that small models can achieve reasoning capabilities comparable to large models, drastically reducing training and inference costs and thereby democratizing advanced AI research.
中文摘要
本报告挑战了小型模型本质上缺乏稳健推理的普遍共识,介绍了 VibeThinker-1.5B,这是一种通过我们的频谱到信号原理 (SSP) 开发的 1.5B 参数密集模型。这对缩放模型参数以增强功能的流行方法提出了挑战,如 DeepSeek R1 (671B) 和 Kimi k2 (>1T) 等模型中所示。SSP 框架首先采用两阶段多样性探索蒸馏 (SFT) 来生成广泛的解决方案,然后采用 MaxEnt 引导策略优化 (RL) 来放大正确的信号。VibeThinker-1.5B 的总训练成本仅为 7,800 美元,与 Magistral Medium 和 Claude Opus 4 等闭源模型相比,表现出了卓越的推理能力,并且性能与 GPT OSS-20B Medium 等开源模型相当。值得注意的是,它在三个数学基准上超过了 400 倍的 DeepSeek R1:AIME24(80.3 vs. 79.8)、AIME25(74.4 vs. 70.0)和 HMMT25(50.4 vs. 41.7)。这比其基本模型(分别为 6.7、4.3 和 0.6)有了很大的改进。在 LiveCodeBench V6 上,它的得分为 51.1,优于 Magistral Medium 的 50.3 及其基本模型的 0.0。这些发现表明,小型模型可以实现与大型模型相当的推理能力,大大降低训练和推理成本,从而使先进的人工智能研究民主化。
潜在的一小步,像素的一大步:适用于您的扩散模型的快速潜在高档适配器
-
标题: One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models
-
作者: Aleksandr Razin, Danil Kazantsev, Ilya Makarov
-
日期: 2025-11-13
-
ArXiv主页: https://arxiv.org/abs/2511.10629
-
gitHub仓库: https://github.com/vaskers5/LUA
英文摘要
Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator’s latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.
中文摘要
扩散模型很难扩展到超出其训练分辨率,因为直接高分辨率采样速度慢且成本高,而事后图像超分辨率 (ISR) 通过解码后的操作引入了伪影和额外的延迟。我们提出了 Latent Upscaler Adapter (LUA),这是一个轻量级模块,可在最终 VAE 解码步骤之前直接对生成器的潜在代码执行超分辨率。LUA 作为嵌入式组件进行集成,无需修改基础模型或额外的扩散阶段,并通过潜在空间中的单个前馈通道实现高分辨率合成。具有特定比例像素洗牌头的共享 Swin 式主干支持 2 倍和 4 倍因子,并与图像空间 SR 基线保持兼容,以近 3 倍的解码和升级时间实现可比较的感知质量(从 512 像素生成 1024 像素仅增加 +0.42 秒,而使用相同 SwinIR 架构的像素空间 SR 为 1.87 秒)。此外,LUA 在不同 VAE 的潜在空间中表现出很强的泛化能力,使其易于部署,而无需为每个新解码器从头开始重新训练。大量实验表明,LUA 与原生高分辨率生成的保真度紧密匹配,同时为现代扩散管道中的可扩展、高保真图像合成提供了实用且有效的途径。
TiDAR:在扩散中思考,在自回归中讨论
- 标题: TiDAR: Think in Diffusion, Talk in Autoregression
- 作者: Jingyu Liu, Xin Dong, Zhifan Ye, Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang, Pavlo Molchanov
- 日期: 2025-11-12
- ArXiv主页: https://arxiv.org/abs/2511.08923
- 论文链接: https://arxiv.org/pdf/2511.08923
- 项目链接: https://tidarlm.github.io
英文摘要
Diffusion language models hold the promise of fast parallel generation, while autoregressive (AR) models typically excel in quality due to their causal structure aligning naturally with language modeling. This raises a fundamental question: can we achieve a synergy with high throughput, higher GPU utilization, and AR level quality? Existing methods fail to effectively balance these two aspects, either prioritizing AR using a weaker model for sequential drafting (speculative decoding), leading to lower drafting efficiency, or using some form of left-to-right (AR-like) decoding logic for diffusion, which still suffers from quality degradation and forfeits its potential parallelizability. We introduce TiDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks. This design exploits the free GPU compute density, achieving a strong balance between drafting and verification capacity. Moreover, TiDAR is designed to be serving-friendly (low overhead) as a standalone model. We extensively evaluate TiDAR against AR models, speculative decoding, and diffusion variants across generative and likelihood tasks at 1.5B and 8B scales. Thanks to the parallel drafting and sampling as well as exact KV cache support, TiDAR outperforms speculative decoding in measured throughput and surpasses diffusion models like Dream and Llada in both efficiency and quality. Most notably, TiDAR is the first architecture to close the quality gap with AR models while delivering 4.71x to 5.91x more tokens per second.
中文摘要
扩散语言模型有望实现快速并行生成,而自回归 (AR) 模型通常由于其因果结构与语言建模自然一致而在质量上表现出色。这就提出了一个基本问题:我们能否实现高吞吐量、更高 GPU 利用率和 AR 级别质量的协同作用?现有方法未能有效平衡这两方面,要么优先考虑 AR,使用较弱的模型进行顺序绘图(推测解码),导致绘图效率较低,要么使用某种形式的从左到右(类似 AR)解码逻辑进行扩散,但仍然会遭受质量下降并丧失其潜在的并行性。我们引入了 TiDAR,一种序列级混合架构,它在 Diffusion 中起草 token(Thinking)并自动回归采样最终输出(Talking)——所有这些都在使用专门设计的结构化注意力掩模的单个前向传递中进行。该设计利用了免费的 GPU 计算密度,在绘图和验证能力之间实现了强有力的平衡。此外,TiDAR 作为独立模型被设计为服务友好(低开销)。我们针对 AR 模型、推测性解码和 1.5B 和 8B 尺度的生成和似然任务的扩散变体广泛评估 TiDAR。得益于并行绘图和采样以及精确的 KV 缓存支持,TiDAR 在测量吞吐量方面优于推测解码,并在效率和质量方面优于 Dream 和 Llada 等扩散模型。最值得注意的是,TiDAR 是第一个缩小与 AR 模型质量差距的架构,同时每秒提供 4.71 倍到 5.91 倍的代币。
禁止计算机使用代理进行人体演示
- 标题: Grounding Computer Use Agents on Human Demonstrations
- 作者: Aarash Feizi, Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Kaixin Li, Rabiul Awal, Xing Han Lù, Johan Obando-Ceron, Juan A. Rodriguez, Nicolas Chapados, David Vazquez, Adriana Romero-Soriano, Reihaneh Rabbany, Perouz Taslakian, Christopher Pal, Spandana Gella, Sai Rajeswar
- 日期: 2025-11-10
- ArXiv主页: https://arxiv.org/abs/2511.07332
- 论文链接: https://arxiv.org/pdf/2511.07332
- 项目链接: https://groundcua.github.io/
英文摘要
Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. To address this gap, we introduce GroundCUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. Using GroundCUA, we develop the GroundNext family of models that map instructions to their target UI elements. At both 3B and 7B scales, GroundNext achieves state-of-the-art results across five benchmarks using supervised fine-tuning, while requiring less than one-tenth the training data of prior work. Reinforcement learning post-training further improves performance, and when evaluated in an agentic setting on the OSWorld benchmark using o3 as planner, GroundNext attains comparable or superior results to models trained with substantially more data,. These results demonstrate the critical role of high-quality, expert-driven datasets in advancing general-purpose computer-use agents.
中文摘要
构建可靠的计算机使用代理需要基础:将自然语言指令准确连接到正确的屏幕元素。虽然网络和移动交互存在大量数据集,但桌面环境的高质量资源有限。为了解决这一差距,我们引入了 GroundCUA,这是一个根据专家人体演示构建的大型桌面接地数据集。它涵盖 12 个类别的 87 个应用程序,包括 56K 屏幕截图,每个屏幕元素都经过仔细注释,总共超过 356 万条经过人工验证的注释。从这些演示中,我们生成了各种指令来捕获广泛的现实世界任务,为模型训练提供高质量的数据。使用 GroundCUA,我们开发了 GroundNext 系列模型,将指令映射到目标 UI 元素。在 3B 和 7B 尺度上,GroundNext 使用监督微调在五个基准上实现了最先进的结果,同时所需的训练数据不到先前工作的十分之一。训练后强化学习进一步提高了性能,并且当使用 o3 作为规划器在 OSWorld 基准上的代理环境中进行评估时,GroundNext 获得了与使用更多数据训练的模型相当或更好的结果。这些结果证明了高质量、专家驱动的数据集在推进通用计算机使用代理方面的关键作用。
Depth Anything 3:从任意视图恢复视觉空间
- 标题: Depth Anything 3: Recovering the Visual Space from Any Views
- 作者: Haotong Lin, Sili Chen, Junhao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, Bingyi Kang
- 日期: 2025-11-13
- ArXiv主页: https://arxiv.org/abs/2511.10647
- 论文链接: https://arxiv.org/pdf/2511.10647
- 项目链接: https://depth-anything-3.github.io/
- gitHub仓库: https://github.com/ByteDance-Seed/depth-anything-3
英文摘要
We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.
中文摘要
我们提出了 Depth Anything 3 (DA3),这是一个模型,可以根据任意数量的视觉输入预测空间一致的几何形状,无论是否有已知的相机姿势。为了追求最小化建模,DA3 产生了两个关键见解:单个普通变压器(例如,vanilla DINO 编码器)足以作为主干,无需架构专门化;单个深度射线预测目标消除了复杂的多任务学习的需要。通过我们的师生培训范例,该模型达到了与 Depth Anything 2 (DA2) 相当的细节和概括水平。我们建立了一个新的视觉几何基准,涵盖相机姿态估计、任意视图几何和视觉渲染。在此基准测试中,DA3 在所有任务上都创下了新的最先进水平,相机姿态准确度比之前的 SOTA VGGT 平均高出 44.3%,几何准确度高出 25.1%。此外,它在单目深度估计方面优于 DA2。所有模型都专门在公共学术数据集上进行训练。
HaluMem:评估智能体记忆系统中的幻觉
-
标题: HaluMem: Evaluating Hallucinations in Memory Systems of Agents
-
作者: Ding Chen, Simin Niu, Kehang Li, Peng Liu, Xiangping Zheng, Bo Tang, Xinchi Li, Feiyu Xiong, Zhiyu Li
-
日期: 2025-11-05
-
ArXiv主页: https://arxiv.org/abs/2511.03506
-
gitHub仓库: https://github.com/MemTensor/HaluMem
英文摘要
Memory systems are key components that enable AI systems such as LLMs and AI agents to achieve long-term learning and sustained interaction. However, during memory storage and retrieval, these systems frequently exhibit memory hallucinations, including fabrication, errors, conflicts, and omissions. Existing evaluations of memory hallucinations are primarily end-to-end question answering, which makes it difficult to localize the operational stage within the memory system where hallucinations arise. To address this, we introduce the Hallucination in Memory Benchmark (HaluMem), the first operation level hallucination evaluation benchmark tailored to memory systems. HaluMem defines three evaluation tasks (memory extraction, memory updating, and memory question answering) to comprehensively reveal hallucination behaviors across different operational stages of interaction. To support evaluation, we construct user-centric, multi-turn human-AI interaction datasets, HaluMem-Medium and HaluMem-Long. Both include about 15k memory points and 3.5k multi-type questions. The average dialogue length per user reaches 1.5k and 2.6k turns, with context lengths exceeding 1M tokens, enabling evaluation of hallucinations across different context scales and task complexities. Empirical studies based on HaluMem show that existing memory systems tend to generate and accumulate hallucinations during the extraction and updating stages, which subsequently propagate errors to the question answering stage. Future research should focus on developing interpretable and constrained memory operation mechanisms that systematically suppress hallucinations and improve memory reliability.
中文摘要
记忆系统是使法学硕士和人工智能代理等人工智能系统能够实现长期学习和持续交互的关键组件。然而,在记忆存储和检索过程中,这些系统经常表现出记忆幻觉,包括捏造、错误、冲突和遗漏。现有的记忆幻觉评估主要是端到端的问答,这使得很难定位记忆系统中出现幻觉的操作阶段。为了解决这个问题,我们引入了内存基准幻觉(HaluMem),这是第一个针对内存系统量身定制的操作级幻觉评估基准。HaluMem定义了三个评估任务(记忆提取、记忆更新和记忆问答)来全面揭示交互不同操作阶段的幻觉行为。为了支持评估,我们构建了以用户为中心的多轮人机交互数据集 HaluMem-Medium 和 HaluMem-Long。两者均包含约 15k 记忆点和 3.5k 多种类型的问题。每个用户的平均对话长度达到 1.5k 和 2.6k 回合,上下文长度超过 1M token,从而能够评估不同上下文规模和任务复杂性的幻觉。基于HaluMem的实证研究表明,现有的记忆系统在提取和更新阶段容易产生和积累幻觉,随后将错误传播到问答阶段。未来的研究应集中于开发可解释和受约束的记忆操作机制,系统地抑制幻觉并提高记忆可靠性。
PAN:通用、可交互、长视野世界模拟的世界模型
- 标题: PAN: A World Model for General, Interactable, and Long-Horizon World Simulation
- 作者: PAN Team, Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Yang, Kun Zhou, Davit Abrahamyan, Arif Ahmad, Ganesh Bannur, Junrong Chen, Kimi Chen, Mingkai Deng, Ruobing Han, Xinqi Huang, Haoqiang Kang, Zheqi Li, Enze Ma, Hector Ren, Yashowardhan Shinde, Rohan Shingre, Ramsundar Tanikella, Kaiming Tao, Dequan Yang, Xinle Yu, Cong Zeng, Binglin Zhou, Zhengzhong Liu, Zhiting Hu, Eric P. Xing
- 日期: 2025-11-12
- ArXiv主页: https://arxiv.org/abs/2511.09057
- 论文链接: https://arxiv.org/pdf/2511.09057
- 项目链接: https://ifm.mbzuai.ac.ae/pan/
英文摘要
A world model enables an intelligent agent to imagine, predict, and reason about how the world evolves in response to its actions, and accordingly to plan and strategize. While recent video generation models produce realistic visual sequences, they typically operate in the prompt-to-full-video manner without causal control, interactivity, or long-horizon consistency required for purposeful reasoning. Existing world modeling efforts, on the other hand, often focus on restricted domains (e.g., physical, game, or 3D-scene dynamics) with limited depth and controllability, and struggle to generalize across diverse environments and interaction formats. In this work, we introduce PAN, a general, interactable, and long-horizon world model that predicts future world states through high-quality video simulation conditioned on history and natural language actions. PAN employs the Generative Latent Prediction (GLP) architecture that combines an autoregressive latent dynamics backbone based on a large language model (LLM), which grounds simulation in extensive text-based knowledge and enables conditioning on language-specified actions, with a video diffusion decoder that reconstructs perceptually detailed and temporally coherent visual observations, to achieve a unification between latent space reasoning (imagination) and realizable world dynamics (reality). Trained on large-scale video-action pairs spanning diverse domains, PAN supports open-domain, action-conditioned simulation with coherent, long-term dynamics. Extensive experiments show that PAN achieves strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning compared to other video generators and world models, taking a step towards general world models that enable predictive simulation of future world states for reasoning and acting.
中文摘要
世界模型使智能代理能够想象、预测和推理世界如何响应其行为而演变,并相应地制定计划和战略。虽然最近的视频生成模型产生逼真的视觉序列,但它们通常以从提示到完整视频的方式运行,没有因果控制、交互性或有目的推理所需的长期一致性。另一方面,现有的世界建模工作通常侧重于深度和可控性有限的受限领域(例如物理、游戏或 3D 场景动态),并且很难在不同的环境和交互格式中进行泛化。在这项工作中,我们介绍了 PAN,这是一种通用的、可交互的、长视野的世界模型,它通过以历史和自然语言动作为条件的高质量视频模拟来预测未来的世界状态。PAN 采用生成潜在预测 (GLP) 架构,该架构结合了基于大语言模型 (LLM) 的自回归潜在动态主干,该模型以广泛的基于文本的知识为基础进行模拟,并能够对语言指定的动作进行调节,与视频扩散解码器一起重建感知细节和时间连贯的视觉观察,以实现潜在空间推理(想象)和可实现的世界动态(现实)之间的统一。PAN 经过跨不同领域的大规模视频动作对的训练,支持具有连贯、长期动态的开放域、动作条件模拟。大量实验表明,与其他视频生成器和世界模型相比,PAN 在动作条件世界模拟、长视野预测和模拟推理方面取得了强大的性能,向通用世界模型迈出了一步,能够对未来世界状态进行预测模拟以进行推理和行动。
IterResearch:通过马尔可夫状态重建重新思考长期智能体
- 标题: IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction
- 作者: Guoxin Chen, Zile Qiao, Xuanzhong Chen, Donglei Yu, Haotian Xu, Wayne Xin Zhao, Ruihua Song, Wenbiao Yin, Huifeng Yin, Liwen Zhang, Kuan Li, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
- 日期: 2025-11-10
- ArXiv主页: https://arxiv.org/abs/2511.07327
- 论文链接: https://arxiv.org/pdf/2511.07327
英文摘要
Recent advances in deep-research agents have shown promise for autonomous knowledge construction through dynamic reasoning over external sources. However, existing approaches rely on a mono-contextual paradigm that accumulates all information in a single, expanding context window, leading to context suffocation and noise contamination that limit their effectiveness on long-horizon tasks. We introduce IterResearch, a novel iterative deep-research paradigm that reformulates long-horizon research as a Markov Decision Process with strategic workspace reconstruction. By maintaining an evolving report as memory and periodically synthesizing insights, our approach preserves consistent reasoning capacity across arbitrary exploration depths. We further develop Efficiency-Aware Policy Optimization (EAPO), a reinforcement learning framework that incentivizes efficient exploration through geometric reward discounting and enables stable distributed training via adaptive downsampling. Extensive experiments demonstrate that IterResearch achieves substantial improvements over existing open-source agents with average +14.5pp across six benchmarks and narrows the gap with frontier proprietary systems. Remarkably, our paradigm exhibits unprecedented interaction scaling, extending to 2048 interactions with dramatic performance gains (from 3.5% to 42.5%), and serves as an effective prompting strategy, improving frontier models by up to 19.2pp over ReAct on long-horizon tasks. These findings position IterResearch as a versatile solution for long-horizon reasoning, effective both as a trained agent and as a prompting paradigm for frontier models.
中文摘要
深度研究代理的最新进展显示出通过对外部资源进行动态推理来构建自主知识的希望。然而,现有的方法依赖于单一上下文范式,将所有信息积累在一个不断扩大的上下文窗口中,导致上下文窒息和噪声污染,从而限制了它们在长期任务中的有效性。我们引入了 IterResearch,这是一种新颖的迭代深度研究范式,它将长期研究重新表述为具有战略工作空间重建的马尔可夫决策过程。通过将不断变化的报告作为记忆进行维护并定期综合见解,我们的方法可以在任意探索深度上保持一致的推理能力。我们进一步开发了效率感知策略优化(EAPO),这是一种强化学习框架,通过几何奖励折扣激励高效探索,并通过自适应下采样实现稳定的分布式训练。大量实验表明,IterResearch 比现有开源代理取得了实质性改进,在六个基准测试中平均提高了 +14.5pp,并缩小了与前沿专有系统的差距。值得注意的是,我们的范式展现了前所未有的交互扩展,扩展到 2048 个交互,性能大幅提升(从 3.5% 到 42.5%),并且作为一种有效的激励策略,在长期任务上比 ReAct 提高了高达 19.2pp 的前沿模型。这些发现使 IterResearch 成为长期推理的通用解决方案,既可以作为训练有素的智能体,也可以作为前沿模型的提示范例。
MADD:多代理药物发现乐团
-
标题: MADD: Multi-Agent Drug Discovery Orchestra
-
作者: Gleb V. Solovev, Alina B. Zhidkovskaya, Anastasia Orlova, Nina Gubina, Anastasia Vepreva, Rodion Golovinskii, Ilya Tonkii, Ivan Dubrovsky, Ivan Gurev, Dmitry Gilemkhanov, Denis Chistiakov, Timur A. Aliev, Ivan Poddiakov, Galina Zubkova, Ekaterina V. Skorb, Vladimir Vinogradov, Alexander Boukhanovsky, Nikolay Nikitin, Andrei Dmitrenko, Anna Kalyuzhnaya, Andrey Savchenko
-
日期: 2025-11-11
-
ArXiv主页: https://arxiv.org/abs/2511.08217
-
gitHub仓库: https://github.com/sb-ai-lab/MADD
英文摘要
Hit identification is a central challenge in early drug discovery, traditionally requiring substantial experimental resources. Recent advances in artificial intelligence, particularly large language models (LLMs), have enabled virtual screening methods that reduce costs and improve efficiency. However, the growing complexity of these tools has limited their accessibility to wet-lab researchers. Multi-agent systems offer a promising solution by combining the interpretability of LLMs with the precision of specialized models and tools. In this work, we present MADD, a multi-agent system that builds and executes customized hit identification pipelines from natural language queries. MADD employs four coordinated agents to handle key subtasks in de novo compound generation and screening. We evaluate MADD across seven drug discovery cases and demonstrate its superior performance compared to existing LLM-based solutions. Using MADD, we pioneer the application of AI-first drug design to five biological targets and release the identified hit molecules. Finally, we introduce a new benchmark of query-molecule pairs and docking scores for over three million compounds to contribute to the agentic future of drug design.
中文摘要
命中鉴定是早期药物发现的一个核心挑战,传统上需要大量的实验资源。人工智能的最新进展,特别是大型语言模型(LLM),使得虚拟筛选方法能够降低成本并提高效率。然而,这些工具日益复杂,限制了湿实验室研究人员的使用。多智能体系统将法学硕士的可解释性与专业模型和工具的精度相结合,提供了一种有前景的解决方案。在这项工作中,我们提出了 MADD,这是一个多代理系统,可以根据自然语言查询构建和执行定制的命中识别管道。MADD 采用四个协调代理来处理从头化合物生成和筛选中的关键子任务。我们通过七个药物发现案例评估了 MADD,并证明了其与现有基于法学硕士的解决方案相比的卓越性能。利用 MADD,我们率先将人工智能优先的药物设计应用于五个生物靶点,并释放已识别的命中分子。最后,我们引入了超过 300 万种化合物的查询分子对和对接分数的新基准,为药物设计的代理未来做出贡献。
移动时间:通过双时钟降噪生成免训练运动控制视频
- 标题: Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising
- 作者: Assaf Singer, Noam Rotstein, Amir Mann, Ron Kimmel, Or Litany
- 日期: 2025-11-09
- ArXiv主页: https://arxiv.org/abs/2511.08633
- 论文链接: https://arxiv.org/pdf/2511.08633
- 项目链接: https://time-to-move.github.io/
- gitHub仓库: https://github.com/time-to-move/TTM
英文摘要
Diffusion-based video generation can create realistic videos, yet existing image- and text-based conditioning fails to offer precise motion control. Prior methods for motion-conditioned synthesis typically require model-specific fine-tuning, which is computationally expensive and restrictive. We introduce Time-to-Move (TTM), a training-free, plug-and-play framework for motion- and appearance-controlled video generation with image-to-video (I2V) diffusion models. Our key insight is to use crude reference animations obtained through user-friendly manipulations such as cut-and-drag or depth-based reprojection. Motivated by SDEdit’s use of coarse layout cues for image editing, we treat the crude animations as coarse motion cues and adapt the mechanism to the video domain. We preserve appearance with image conditioning and introduce dual-clock denoising, a region-dependent strategy that enforces strong alignment in motion-specified regions while allowing flexibility elsewhere, balancing fidelity to user intent with natural dynamics. This lightweight modification of the sampling process incurs no additional training or runtime cost and is compatible with any backbone. Extensive experiments on object and camera motion benchmarks show that TTM matches or exceeds existing training-based baselines in realism and motion control. Beyond this, TTM introduces a unique capability: precise appearance control through pixel-level conditioning, exceeding the limits of text-only prompting. Visit our project page for video examples and code: https://time-to-move.github.io/.
中文摘要
基于扩散的视频生成可以创建逼真的视频,但现有的基于图像和文本的调节无法提供精确的运动控制。先前的运动条件合成方法通常需要特定于模型的微调,这在计算上是昂贵的且具有限制性。我们引入了 Time-to-Move (TTM),这是一种免训练、即插即用的框架,用于通过图像到视频 (I2V) 扩散模型生成运动和外观控制的视频。我们的主要见解是使用通过用户友好的操作(例如剪切和拖动或基于深度的重投影)获得的粗略参考动画。受 SDEdit 使用粗略布局提示进行图像编辑的启发,我们将粗略动画视为粗略运动提示,并使该机制适应视频领域。我们通过图像调节来保留外观,并引入双时钟去噪,这是一种区域相关策略,可在运动指定区域中强制对齐,同时在其他地方允许灵活性,平衡用户意图的保真度与自然动态。这种对采样过程的轻量级修改不会产生额外的培训或运行时间成本,并且与任何骨干网兼容。对物体和相机运动基准的大量实验表明,TTM 在真实感和运动控制方面匹配或超过了现有的基于训练的基线。除此之外,TTM 还引入了一项独特的功能:通过像素级调节进行精确的外观控制,超越了纯文本提示的限制。请访问我们的项目页面以获取视频示例和代码:https://time-to-move.github.io/。
好到坏:论法学硕士未能扮演反派角色
- 标题: Too Good to be Bad: On the Failure of LLMs to Role-Play Villains
- 作者: Zihao Yi, Qingxuan Jiang, Ruotian Ma, Xingyu Chen, Qu Yang, Mengru Wang, Fanghua Ye, Ying Shen, Zhaopeng Tu, Xiaolong Li, Linus
- 日期: 2025-11-07
- ArXiv主页: https://arxiv.org/abs/2511.04962
- 论文链接: https://arxiv.org/pdf/2511.04962
- 项目链接: https://github.com/Tencent/digitalhuman/tree/main/RolePlay_Villain
英文摘要
Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as Deceitful'' and Manipulative’', often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.
中文摘要
大型语言模型 (LLM) 越来越多地承担创意生成的任务,包括模拟虚构人物。然而,他们描绘非亲社会、敌对角色的能力在很大程度上仍未得到检验。我们假设现代法学硕士的安全调整与真实地扮演道德模棱两可或邪恶角色的任务产生了根本冲突。为了调查这一点,我们引入了道德角色扮演基准,这是一个新的数据集,具有四级道德校准量表和用于严格评估的平衡测试集。我们要求最先进的法学硕士扮演从道德典范到纯粹恶棍的角色扮演角色。我们的大规模评估表明,随着角色道德的下降,角色扮演的保真度会持续、单调地下降。我们发现,模型最难应对与安全原则直接对立的特征,例如“欺骗性”和“操纵性”,通常用肤浅的攻击性代替微妙的恶意。此外,我们证明,一般聊天机器人的熟练程度并不能很好地预测恶棍角色扮演的能力,高度安全的模型表现尤其差。我们的工作为这一关键限制提供了第一个系统证据,强调了模型安全性和创意保真度之间的关键紧张关系。我们的基准和研究结果为开发更细致、上下文感知的对齐方法铺平了道路。
DRIVE:强化学习的数据管理最佳实践,在竞争性代码生成中提供可验证的奖励
-
标题: DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation
-
作者: Speed Zhu, Jianwei Cai, Guang Chen, Lulu Wu, Saiyong Yang, Wiggin Zhou
-
日期: 2025-11-09
-
ArXiv主页: https://arxiv.org/abs/2511.06307
英文摘要
Recent reasoning-first models (e.g., OpenAI o1, DeepSeek R1) have spurred a resurgence of interest in RLVR. Nevertheless, advances are dominated by mathematics (e.g., AIME), with competitive-programming code generation underexplored and data curation receiving less attention than RL algorithm design. We investigate how to construct RLVR datasets (i.e., RL prompts) and present practical training techniques that yield strong performance on competitive-programming code generation. Our pipeline begins with supervised fine-tuning (SFT) distilled from strong open-source models, augmented with general-purpose and reasoning-intensive data. RL then follows a two-stage process with executable, testcase-driven rewards: first, training on a large, uniformly distributed set of competitive-programming problems using Group Relative Policy Optimization (GRPO) with 8 rollouts per prompt and a relatively short response-generation window (e.g., 32k during SFT and 24k in this stage) to expand entropy and mitigate repetition and truncation; second, we perform Pre-GRPO: updating on a small, high-quality set of challenging problems with a large rollout budget (64 rollouts per prompt) under a hard-focus curriculum that continuously retains the most difficult instances throughout training. We implement our method on Qwen2.5-32B and evaluate on LeetCode and Codeforces weekly contests to avoid data leakage. The resulting model achieves state-of-the-art performance among models of similar scale and is comparable to leading systems such as DeepSeek v3.1 and Doubao-1.5-Thinking. We also examine scaling trends and observe strong RL scaling on an internal large-scale MoE model. Our study distills concise best practices for data curation, entropy expansion, and curriculum design in RLVR for competitive-programming code generation.
中文摘要
最近的推理优先模型(例如 OpenAI o1、DeepSeek R1)激起了人们对 RLVR 的兴趣。然而,进步主要由数学(例如 AIME)主导,竞争性编程代码生成尚未得到充分探索,数据管理比 RL 算法设计受到的关注更少。我们研究如何构建 RLVR 数据集(即 RL 提示)并提出实用的训练技术,这些技术可以在竞争性编程代码生成方面产生强大的性能。我们的管道始于从强大的开源模型中提取的监督微调(SFT),并通过通用和推理密集型数据进行了增强。然后,RL 遵循一个具有可执行的、测试用例驱动的奖励的两阶段过程:首先,使用组相对策略优化 (GRPO) 对大量均匀分布的竞争性编程问题进行训练,每个提示进行 8 次部署,并使用相对较短的响应生成窗口(例如,SFT 期间为 32k,此阶段为 24k),以扩大熵并减少重复和截断;其次,我们执行 Pre-GRPO:在重点课程下以大量部署预算(每个提示 64 次部署)更新一组小型、高质量的具有挑战性的问题,该课程在整个培训过程中不断保留最困难的实例。我们在 Qwen2.5-32B 上实现我们的方法,并在 LeetCode 和 Codeforces 每周竞赛中进行评估,以避免数据泄露。所得模型在类似规模的模型中实现了最先进的性能,可与 DeepSeek v3.1 和 Doubao-1.5-Thinking 等领先系统相媲美。我们还检查了扩展趋势,并在内部大型 MoE 模型上观察到强大的 RL 扩展。我们的研究提炼了 RLVR 中用于竞争性编程代码生成的数据管理、熵扩展和课程设计的简明最佳实践。
视觉空间调整
- 标题: Visual Spatial Tuning
- 作者: Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao
- 日期: 2025-11-07
- ArXiv主页: https://arxiv.org/abs/2511.05491
- 论文链接: https://arxiv.org/pdf/2511.05491
- 项目链接: https://yangr116.github.io/vst_project/
- gitHub仓库: https://github.com/Yangr116/VST
英文摘要
Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities. Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including 34.8% on MMSI-Bench and 61.2% on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.
中文摘要
从视觉输入中捕获空间关系是类人通用智能的基石。之前的几项研究试图通过添加额外的专家编码器来增强视觉语言模型(VLM)的空间感知,但这会带来额外的开销,并且通常会损害一般能力。为了增强一般建筑的空间能力,我们引入了视觉空间调整(VST),这是一个综合框架,用于培养具有类人视觉空间能力的 VLM,从空间感知到推理。我们首先尝试通过构建一个名为 VST-P 的大规模数据集来增强 VLM 中的空间感知,该数据集包含 410 万个样本,涵盖单个视图、多个图像和视频的 19 项技能。然后,我们提出了 VST-R,这是一个包含 135K 个样本的精选数据集,可指示模型在空间中进行推理。特别是,我们采用渐进式训练流程:通过监督微调来构建基础空间知识,然后通过强化学习来进一步提高空间推理能力。在不对一般功能产生副作用的情况下,所提出的 VST 在多个空间基准上始终达到最先进的结果,包括 MMSI-Bench 上的 34.8% 和 VSIBench 上的 61.2%。事实证明,视觉-语言-动作模型可以通过所提出的空间调整范式得到显着增强,为更加物理基础的人工智能铺平道路。
大型语言模型的黑盒策略蒸馏
- 标题: Black-Box On-Policy Distillation of Large Language Models
- 作者: Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, Furu Wei
- 日期: 2025-11-13
- ArXiv主页: https://arxiv.org/abs/2511.10643
- 论文链接: https://arxiv.org/pdf/2511.10643
- 项目链接: https://aka.ms/GAD-project
英文摘要
Black-box distillation creates student large language models (LLMs) by learning from a proprietary teacher model’s text outputs alone, without access to its internal logits or parameters. In this work, we introduce Generative Adversarial Distillation (GAD), which enables on-policy and black-box distillation. GAD frames the student LLM as a generator and trains a discriminator to distinguish its responses from the teacher LLM’s, creating a minimax game. The discriminator acts as an on-policy reward model that co-evolves with the student, providing stable, adaptive feedback. Experimental results show that GAD consistently surpasses the commonly used sequence-level knowledge distillation. In particular, Qwen2.5-14B-Instruct (student) trained with GAD becomes comparable to its teacher, GPT-5-Chat, on the LMSYS-Chat automatic evaluation. The results establish GAD as a promising and effective paradigm for black-box LLM distillation.
中文摘要
黑盒蒸馏通过单独学习专有教师模型的文本输出来创建学生大语言模型 (LLM),而无需访问其内部逻辑或参数。在这项工作中,我们引入了生成对抗蒸馏(GAD),它可以实现在线策略和黑盒蒸馏。GAD 将学生 LLM 构建为生成器,并训练判别器以区分其响应与教师 LLM 的响应,从而创建一个极小极大游戏。判别器充当策略奖励模型,与学生共同进化,提供稳定、自适应的反馈。实验结果表明,GAD 始终超越了常用的序列级知识蒸馏。特别是,经过 GAD 培训的 Qwen2.5-14B-Instruct(学生)在 LMSYS-Chat 自动评估上与其老师 GPT-5-Chat 相当。结果表明 GAD 是黑盒法学硕士蒸馏的一个有前途且有效的范例。
DeepEyesV2:迈向代理多模态模型
- 标题: DeepEyesV2: Toward Agentic Multimodal Model
- 作者: Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, Xing Yu
- 日期: 2025-11-07
- ArXiv主页: https://arxiv.org/abs/2511.05271
- 论文链接: https://arxiv.org/pdf/2511.05271
- 项目链接: https://visual-agent.github.io/
- gitHub仓库: https://github.com/Visual-Agent/DeepEyes
英文摘要
Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which inherently requires the integration of multiple capabilities, including perception, search, and reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enables complex tool combinations and allows model to selectively invoke tools based on context. We hope our study can provide guidance for community in developing agentic multimodal models.
中文摘要
代理多模态模型不仅应该理解文本和图像,还应该主动调用外部工具,例如代码执行环境和网络搜索,并将这些操作集成到推理中。在这项工作中,我们介绍了 DeepEyesV2,并从数据构建、训练方法和模型评估的角度探讨了如何构建代理多模态模型。我们观察到,仅直接强化学习无法诱导稳健的工具使用行为。这种现象催生了两阶段的训练流程:冷启动阶段用于建立工具使用模式,强化学习阶段用于进一步细化工具调用。我们策划了一个多样化、中等挑战性的训练数据集,特别包括使用工具有益的示例。我们进一步介绍了 RealX-Bench,这是一个旨在评估现实世界多模态推理的综合基准,其本质上需要集成多种能力,包括感知、搜索和推理。我们在 RealX-Bench 和其他代表性基准测试上评估 DeepEyesV2,证明其在现实世界理解、数学推理和搜索密集型任务方面的有效性。此外,DeepEyesV2展示了任务自适应工具调用,倾向于使用图像操作进行感知任务,使用数值计算进行推理任务。强化学习进一步实现了复杂的工具组合,并允许模型根据上下文有选择地调用工具。我们希望我们的研究能为社区开发代理多模式模型提供指导。
对话系统中的自适应多代理响应细化
- 标题: Adaptive Multi-Agent Response Refinement in Conversational Systems
- 作者: Soyeong Jeong, Aparna Elangovan, Emine Yilmaz, Oleg Rokhlenko
- 日期: 2025-11-11
- ArXiv主页: https://arxiv.org/abs/2511.08319
- 论文链接: https://arxiv.org/pdf/2511.08319
英文摘要
Large Language Models (LLMs) have demonstrated remarkable success in conversational systems by generating human-like responses. However, they can fall short, especially when required to account for personalization or specific knowledge. In real-life settings, it is impractical to rely on users to detect these errors and request a new response. One way to address this problem is to refine the response before returning it to the user. While existing approaches focus on refining responses within a single LLM, this method struggles to consider diverse aspects needed for effective conversations. In this work, we propose refining responses through a multi-agent framework, where each agent is assigned a specific role for each aspect. We focus on three key aspects crucial to conversational quality: factuality, personalization, and coherence. Each agent is responsible for reviewing and refining one of these aspects, and their feedback is then merged to improve the overall response. To enhance collaboration among them, we introduce a dynamic communication strategy. Instead of following a fixed sequence of agents, our approach adaptively selects and coordinates the most relevant agents based on the specific requirements of each query. We validate our framework on challenging conversational datasets, demonstrating that ours significantly outperforms relevant baselines, particularly in tasks involving knowledge or user’s persona, or both.
中文摘要
大型语言模型 (LLM) 通过生成类似人类的响应,在对话系统中取得了显着的成功。然而,它们可能会达不到要求,尤其是在需要考虑个性化或特定知识时。在现实生活中,依靠用户来检测这些错误并请求新的响应是不切实际的。解决此问题的一种方法是在将响应返回给用户之前对其进行优化。虽然现有的方法侧重于在单个法学硕士内完善回答,但这种方法很难考虑有效对话所需的不同方面。在这项工作中,我们建议通过多代理框架来完善响应,其中每个代理在每个方面都被分配了特定的角色。我们关注对对话质量至关重要的三个关键方面:事实性、个性化和连贯性。每个代理负责审查和完善其中一个方面,然后合并他们的反馈以改进整体响应。为了加强他们之间的协作,我们引入了动态沟通策略。我们的方法不是遵循固定的代理序列,而是根据每个查询的具体要求自适应地选择和协调最相关的代理。我们在具有挑战性的会话数据集上验证了我们的框架,证明我们的框架显着优于相关基线,特别是在涉及知识或用户角色或两者的任务中。
Motif 2 12.7B技术报告
- 标题: Motif 2 12.7B technical report
- 作者: Junghwan Lim, Sungmin Lee, Dongseok Kim, Taehyun Kim, Eunhwan Park, Jeesoo Lee, Jeongdoo Lee, Junhyeok Lee, Wai Ting Cheung, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Beomgyu Kim, Minjae Kim, Taewhan Kim, Youngrok Kim, Hyukjin Kweon, Haesol Lee, Kungyu Lee, Dongpin Oh, Yeongjae Park, Bokki Ryu, Dongjoo Weon
- 日期: 2025-11-07
- ArXiv主页: https://arxiv.org/abs/2511.07464
- 论文链接: https://arxiv.org/pdf/2511.07464
英文摘要
We introduce Motif-2-12.7B, a new open-weight foundation model that pushes the efficiency frontier of large language models by combining architectural innovation with system-level optimization. Designed for scalable language understanding and robust instruction generalization under constrained compute budgets, Motif-2-12.7B builds upon Motif-2.6B with the integration of Grouped Differential Attention (GDA), which improves representational efficiency by disentangling signal and noise-control attention pathways. The model is pre-trained on 5.5 trillion tokens spanning diverse linguistic, mathematical, scientific, and programming domains using a curriculum-driven data scheduler that gradually changes the data composition ratio. The training system leverages the MuonClip optimizer alongside custom high-performance kernels, including fused PolyNorm activations and the Parallel Muon algorithm, yielding significant throughput and memory efficiency gains in large-scale distributed environments. Post-training employs a three-stage supervised fine-tuning pipeline that successively enhances general instruction adherence, compositional understanding, and linguistic precision. Motif-2-12.7B demonstrates competitive performance across diverse benchmarks, showing that thoughtful architectural scaling and optimized training design can rival the capabilities of much larger models.
中文摘要
我们推出了 Motif-2-12.7B,这是一种新的开放权重基础模型,通过将架构创新与系统级优化相结合,推动大型语言模型的效率前沿。Motif-2-12.7B 专为在计算预算有限的情况下实现可扩展的语言理解和强大的指令泛化而设计,它以 Motif-2.6B 为基础,集成了分组差分注意 (GDA),通过解开信号和噪声控制注意路径来提高表示效率。该模型使用课程驱动的数据调度程序对跨越不同语言、数学、科学和编程领域的 5.5 万亿个令牌进行了预训练,该调度程序逐渐改变数据构成比例。该训练系统利用 MuonClip 优化器以及定制的高性能内核,包括融合 PolyNorm 激活和并行 Muon 算法,在大规模分布式环境中产生显着的吞吐量和内存效率增益。后期训练采用三阶段监督微调流程,连续增强一般指令的依从性、构图理解和语言准确性。Motif-2-12.7B 在不同的基准测试中展示了具有竞争力的性能,表明深思熟虑的架构扩展和优化的训练设计可以与更大模型的功能相媲美。
UniVA:面向开源下一代视频多面手的通用视频代理
- 标题: UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
- 作者: Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li, Yuechen Zhang, Shengqiong Wu, Xiaohan Wang, Jiebo Luo, Lizi Liao, Hao Fei
- 日期: 2025-11-11
- ArXiv主页: https://arxiv.org/abs/2511.08521
- 论文链接: https://arxiv.org/pdf/2511.08521
- 项目链接: https://univa.online/
- gitHub仓库: https://github.com/univa-agent/univa
英文摘要
While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation rightarrow multi-round editing rightarrow object segmentation rightarrow compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)
中文摘要
虽然专门的人工智能模型擅长生成或理解等独立视频任务,但现实世界的应用程序需要结合这些功能的复杂、迭代的工作流程。为了弥补这一差距,我们引入了 UniVA,这是一个面向下一代视频通才的开源、全能的多代理框架,它将视频理解、分割、编辑和生成统一到有凝聚力的工作流程中。UniVA 采用计划与行动双代理架构,可驱动高度自动化和主动的工作流程:计划代理解释用户意图并将其分解为结构化视频处理步骤,而执行代理通过模块化、基于 MCP 的工具服务器执行这些步骤(用于分析、生成、编辑、跟踪等)。通过分层多级记忆(全局知识、任务上下文和用户特定偏好),UniVA 维持长视野推理、上下文连续性和代理间通信,从而实现具有完全可追溯性的交互式和自我反思视频创建。这种设计支持迭代和任何条件的视频工作流程(例如,文本/图像/视频条件生成右箭头多轮编辑右箭头对象分割右箭头合成合成),而这些工作流程以前使用单一用途模型或整体视频语言模型实现起来很麻烦。我们还引入了 UniVA-Bench,这是一个涵盖理解、编辑、分割和生成的多步骤视频任务的基准套件,用于严格评估此类代理视频系统。UniVA 和 UniVA-Bench 都是完全开源的,旨在促进下一代多模式人工智能系统的交互式、代理和通用视频智能的研究。(https://univa.online/)
KLASS:掩模扩散模型中的 KL 引导快速推理
-
标题: KLASS: KL-Guided Fast Inference in Masked Diffusion Models
-
作者: Seo Hyun Kim, Sunwoo Hong, Hojung Jung, Youngrok Park, Se-Young Yun
-
日期: 2025-11-07
-
ArXiv主页: https://arxiv.org/abs/2511.05664
-
gitHub仓库: https://github.com/shkim0116/KLASS
英文摘要
Masked diffusion models have demonstrated competitive results on various tasks including language generation. However, due to its iterative refinement process, the inference is often bottlenecked by slow and static sampling speed. To overcome this problem, we introduce `KL-Adaptive Stability Sampling’ (KLASS), a fast yet effective sampling method that exploits token-level KL divergence to identify stable, high-confidence predictions. By unmasking multiple tokens in each iteration without any additional model training, our approach speeds up generation significantly while maintaining sample quality. On reasoning benchmarks, KLASS achieves up to 2.78times wall-clock speedups while improving performance over standard greedy decoding, attaining state-of-the-art results among diffusion-based samplers. We further validate KLASS across diverse domains, including text, image, and molecular generation, showing its effectiveness as a broadly applicable sampler across different models.
中文摘要
掩蔽扩散模型在包括语言生成在内的各种任务上都表现出了有竞争力的结果。然而,由于其迭代细化过程,推理常常受到缓慢且静态的采样速度的瓶颈。为了克服这个问题,我们引入了“KL 自适应稳定性采样”(KLASS),这是一种快速而有效的采样方法,利用令牌级 KL 散度来识别稳定、高置信度的预测。通过在每次迭代中揭开多个标记而无需任何额外的模型训练,我们的方法显着加快了生成速度,同时保持了样本质量。在推理基准测试中,KLASS 实现了高达 2.78 倍的挂钟加速,同时提高了标准贪婪解码的性能,在基于扩散的采样器中获得了最先进的结果。我们进一步在不同领域(包括文本、图像和分子生成)验证 KLASS,显示其作为跨不同模型的广泛适用采样器的有效性。
The Station:人工智能驱动探索的开放世界环境
- 标题: The Station: An Open-World Environment for AI-Driven Discovery
- 作者: Stephen Chung, Wenyu Du
- 日期: 2025-11-09
- ArXiv主页: https://arxiv.org/abs/2511.06309
- 论文链接: https://arxiv.org/pdf/2511.06309
- 项目链接: https://dualverse-ai.github.io/station_data/
- gitHub仓库: https://github.com/dualverse-ai/station
英文摘要
We introduce the STATION, an open-world multi-agent environment that models a miniature scientific ecosystem. Leveraging their extended context windows, agents in the Station can engage in long scientific journeys that include reading papers from peers, formulating hypotheses, submitting code, performing analyses, and publishing results. Importantly, there is no centralized system coordinating their activities - agents are free to choose their own actions and develop their own narratives within the Station. Experiments demonstrate that AI agents in the Station achieve new state-of-the-art performance on a wide range of benchmarks, spanning from mathematics to computational biology to machine learning, notably surpassing AlphaEvolve in circle packing. A rich tapestry of narratives emerges as agents pursue independent research, interact with peers, and build upon a cumulative history. From these emergent narratives, novel methods arise organically, such as a new density-adaptive algorithm for scRNA-seq batch integration. The Station marks a first step towards autonomous scientific discovery driven by emergent behavior in an open-world environment, representing a new paradigm that moves beyond rigid optimization.
中文摘要
我们介绍 STATION,一个开放世界的多智能体环境,用于模拟微型科学生态系统。利用扩展的上下文窗口,空间站中的代理可以参与漫长的科学旅程,包括阅读同行的论文、制定假设、提交代码、执行分析和发布结果。重要的是,没有集中的系统来协调他们的活动——特工可以自由选择自己的行动并在空间站内发展自己的叙述。实验表明,Station 中的人工智能代理在从数学到计算生物学再到机器学习的广泛基准上实现了新的最先进的性能,尤其是在圆形包装方面超越了 AlphaEvolve。当特工们追求独立研究、与同行互动并建立在累积的历史基础上时,丰富的叙述就会出现。从这些新兴的叙述中,新的方法有机地出现,例如用于 scRNA-seq 批量集成的新密度自适应算法。该站标志着开放世界环境中的突发行为驱动的自主科学发现的第一步,代表了超越严格优化的新范式。
VeriCoT:通过逻辑一致性检查进行神经符号思维链验证
- 标题: VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks
- 作者: Yu Feng, Nathaniel Weir, Kaj Bostrom, Sam Bayless, Darion Cassel, Sapana Chaudhary, Benjamin Kiesl-Reiter, Huzefa Rangwala
- 日期: 2025-11-06
- ArXiv主页: https://arxiv.org/abs/2511.04662
- 论文链接: https://arxiv.org/pdf/2511.04662
英文摘要
LLMs can perform multi-step reasoning through Chain-of-Thought (CoT), but they cannot reliably verify their own logic. Even when they reach correct answers, the underlying reasoning may be flawed, undermining trust in high-stakes scenarios. To mitigate this issue, we introduce VeriCoT, a neuro-symbolic method that extracts and verifies formal logical arguments from CoT reasoning. VeriCoT formalizes each CoT reasoning step into first-order logic and identifies premises that ground the argument in source context, commonsense knowledge, or prior reasoning steps. The symbolic representation enables automated solvers to verify logical validity while the NL premises allow humans and systems to identify ungrounded or fallacious reasoning steps. Experiments on the ProofWriter, LegalBench, and BioASQ datasets show VeriCoT effectively identifies flawed reasoning, and serves as a strong predictor of final answer correctness. We also leverage VeriCoT’s verification signal for (1) inference-time self-reflection, (2) supervised fine-tuning (SFT) on VeriCoT-distilled datasets and (3) preference fine-tuning (PFT) with direct preference optimization (DPO) using verification-based pairwise rewards, further improving reasoning validity and accuracy.
中文摘要
LLM可以通过思想链(CoT)进行多步推理,但他们无法可靠地验证自己的逻辑。即使他们得出了正确的答案,潜在的推理也可能存在缺陷,从而破坏了对高风险场景的信任。为了缓解这个问题,我们引入了 VeriCoT,这是一种神经符号方法,可以从 CoT 推理中提取并验证形式逻辑论证。VeriCoT 将每个 CoT 推理步骤形式化为一阶逻辑,并确定将论证建立在源上下文、常识知识或先验推理步骤中的前提。符号表示使自动求解器能够验证逻辑有效性,而 NL 前提允许人类和系统识别不合理或错误的推理步骤。在 ProofWriter、LegalBench 和 BioASQ 数据集上进行的实验表明,VeriCoT 可以有效识别有缺陷的推理,并可作为最终答案正确性的有力预测器。我们还利用 VeriCoT 的验证信号进行(1)推理时间自我反思,(2)对 VeriCoT 蒸馏数据集进行监督微调(SFT),以及(3)使用基于验证的成对奖励进行偏好微调(PFT)和直接偏好优化(DPO),进一步提高推理的有效性和准确性。
未走的路:RLVR 确实向校长学习
- 标题: The Path Not Taken: RLVR Provably Learns Off the Principals
- 作者: Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, David Z. Pan, Zhangyang Wang, Yuandong Tian, Kai Sheng Tai
- 日期: 2025-11-11
- ArXiv主页: https://arxiv.org/abs/2511.08567
- 论文链接: https://arxiv.org/pdf/2511.08567
英文摘要
Reinforcement Learning with Verifiable Rewards (RLVR) reliably improves the reasoning performance of large language models, yet it appears to modify only a small fraction of parameters. We revisit this paradox and show that sparsity is a surface artifact of a model-conditioned optimization bias: for a fixed pretrained model, updates consistently localize to preferred parameter regions, highly consistent across runs and largely invariant to datasets and RL recipes. We mechanistically explain these dynamics with a Three-Gate Theory: Gate I (KL Anchor) imposes a KL-constrained update; Gate II (Model Geometry) steers the step off principal directions into low-curvature, spectrum-preserving subspaces; and Gate III (Precision) hides micro-updates in non-preferred regions, making the off-principal bias appear as sparsity. We then validate this theory and, for the first time, provide a parameter-level characterization of RLVR’s learning dynamics: RLVR learns off principal directions in weight space, achieving gains via minimal spectral drift, reduced principal-subspace rotation, and off-principal update alignment. In contrast, SFT targets principal weights, distorts the spectrum, and even lags RLVR. Together, these results provide the first parameter-space account of RLVR’s training dynamics, revealing clear regularities in how parameters evolve. Crucially, we show that RL operates in a distinct optimization regime from SFT, so directly adapting SFT-era parameter-efficient fine-tuning (PEFT) methods can be flawed, as evidenced by our case studies on advanced sparse fine-tuning and LoRA variants. We hope this work charts a path toward a white-box understanding of RLVR and the design of geometry-aware, RLVR-native learning algorithms, rather than repurposed SFT-era heuristics.
中文摘要
具有可验证奖励的强化学习(RLVR)可靠地提高了大型语言模型的推理性能,但它似乎只修改了一小部分参数。我们重新审视这个悖论,并表明稀疏性是模型条件优化偏差的表面产物:对于固定的预训练模型,更新一致地定位到首选参数区域,在运行中高度一致,并且对数据集和强化学习配方基本不变。我们用三门理论机械地解释这些动态:门 I(KL 锚点)施加 KL 约束更新;Gate II(模型几何)将主方向的步骤引导到低曲率、频谱保留的子空间;Gate III(精度)隐藏了非首选区域的微更新,使得偏离主体的偏差表现为稀疏性。然后,我们验证了这一理论,并首次提供了 RLVR 学习动态的参数级表征:RLVR 学习权重空间中的主方向,通过最小光谱漂移、减少主子空间旋转和偏离主更新对齐来实现增益。相比之下,SFT 以主权重为目标,扭曲了频谱,甚至落后于 RLVR。总之,这些结果提供了 RLVR 训练动态的第一个参数空间解释,揭示了参数如何演变的清晰规律。至关重要的是,我们表明 RL 在与 SFT 不同的优化机制中运行,因此直接采用 SFT 时代的参数高效微调 (PEFT) 方法可能存在缺陷,正如我们对高级稀疏微调和 LoRA 变体的案例研究所证明的那样。我们希望这项工作能够为白盒理解 RLVR 以及几何感知、RLVR 原生学习算法的设计开辟一条道路,而不是重新利用 SFT 时代的启发式算法。
超越英语:法学硕士迈向包容性和可扩展的多语言机器翻译
- 标题: Beyond English: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs
- 作者: Yingfeng Luo, Ziqiang Xu, Yuxuan Ouyang, Murun Yang, Dingyang Lin, Kaiyan Chang, Tong Zheng, Bei Li, Peinan Feng, Quan Du, Tong Xiao, Jingbo Zhu
- 日期: 2025-11-10
- ArXiv主页: https://arxiv.org/abs/2511.07003
- 论文链接: https://arxiv.org/pdf/2511.07003
- 项目链接: https://github.com/NiuTrans/LMT
- gitHub仓库: https://github.com/NiuTrans/LMT
英文摘要
Large language models have significantly advanced Multilingual Machine Translation (MMT), yet the broad language coverage, consistent translation quality, and English-centric bias remain open challenges. To address these challenges, we introduce LMT, a suite of Large-scale Multilingual Translation models centered on both Chinese and English, covering 60 languages and 234 translation directions. During development, we identify a previously overlooked phenomenon of directional degeneration, where symmetric multi-way fine-tuning data overemphasize reverse directions (X to En/Zh), leading to excessive many-to-one mappings and degraded translation quality. We propose Strategic Downsampling, a simple yet effective method to mitigate this degeneration. In addition, we design Parallel Multilingual Prompting (PMP), which leverages typologically related auxiliary languages to enhance cross-lingual transfer. Through rigorous data curation and refined adaptation strategies, LMT achieves SOTA performance among models of comparable language coverage, with our 4B model (LMT-60-4B) surpassing the much larger Aya-101-13B and NLLB-54B models by a substantial margin. We release LMT in four sizes (0.6B/1.7B/4B/8B) to catalyze future research and provide strong baselines for inclusive, scalable, and high-quality MMT \href{https://github.com/NiuTrans/LMT{https://github.com/NiuTrans/LMT}}.
中文摘要
大型语言模型显着推进了多语言机器翻译 (MMT),但广泛的语言覆盖范围、一致的翻译质量和以英语为中心的偏见仍然是开放的挑战。为了应对这些挑战,我们推出了LMT,一套以中英文为中心的大规模多语言翻译模型,涵盖60种语言、234个翻译方向。在开发过程中,我们发现了一种以前被忽视的方向退化现象,即对称多路微调数据过分强调反向方向(X 到 En/Zh),导致过多的多对一映射和翻译质量下降。我们提出了战略下采样,这是一种减轻这种退化的简单而有效的方法。此外,我们设计了并行多语言提示(PMP),它利用类型相关的辅助语言来增强跨语言迁移。通过严格的数据管理和精细的适应策略,LMT 在可比语言覆盖范围的模型中实现了 SOTA 性能,我们的 4B 模型 (LMT-60-4B) 大幅超越了更大的 Aya-101-13B 和 NLLB-54B 模型。我们发布了四种尺寸(0.6B/1.7B/4B/8B)的 LMT,以促进未来的研究,并为包容性、可扩展性和高质量的 MMT \href{https://github.com/NiuTrans/LMT{https://github.com/NiuTrans/LMT}} 提供强有力的基线。
Wasm:构建结构化阿拉伯语交错多模态语料库的管道
- 标题: Wasm: A Pipeline for Constructing Structured Arabic Interleaved Multimodal Corpora
- 作者: Khalil Hennara, Ahmad Bastati, Muhammad Hreden, Mohamed Motasim Hamed, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan
- 日期: 2025-11-10
- ArXiv主页: https://arxiv.org/abs/2511.07080
- 论文链接: https://arxiv.org/pdf/2511.07080
英文摘要
The performance of large language models (LLMs) and large multimodal models (LMMs) depends heavily on the quality and scale of their pre-training datasets. Recent research shows that large multimodal models trained on natural documents where images and text are interleaved outperform those trained only on image-text pairs across a wide range of benchmarks, leveraging advanced pre- trained models to enforce semantic alignment, image-sequence consistency, and textual coherence. For Arabic, however, the lack of high-quality multimodal datasets that preserve document structure has limited progress. In this paper, we present our pipeline Wasm for processing the Common Crawl dataset to create a new Arabic multimodal dataset that uniquely provides markdown output. Unlike existing Arabic corpora that focus solely on text extraction, our approach preserves the structural integrity of web content while maintaining flexibility for both text-only and multimodal pre-training scenarios. We provide a comprehensive comparative analysis of our data processing pipeline against those used for major existing datasets, highlighting the convergences in filtering strategies and justifying our specific design choices. To support future research, we publicly release a representative dataset dump along with the multimodal processing pipeline for Arabic.
中文摘要
大型语言模型 (LLM) 和大型多模态模型 (LMM) 的性能在很大程度上取决于其预训练数据集的质量和规模。最近的研究表明,在图像和文本交错的自然文档上训练的大型多模态模型在各种基准上优于仅在图像文本对上训练的模型,利用先进的预训练模型来强制语义对齐、图像序列一致性和文本连贯性。然而,对于阿拉伯语来说,缺乏保留文档结构的高质量多模式数据集限制了进展。在本文中,我们介绍了用于处理 Common Crawl 数据集的 Wasm 管道,以创建一个新的阿拉伯多模式数据集,该数据集独特地提供 Markdown 输出。与仅关注文本提取的现有阿拉伯语语料库不同,我们的方法保留了网络内容的结构完整性,同时保持纯文本和多模式预训练场景的灵活性。我们对我们的数据处理管道与用于主要现有数据集的数据处理管道进行了全面的比较分析,突出了过滤策略的收敛性并证明了我们特定设计选择的合理性。为了支持未来的研究,我们公开发布了具有代表性的数据集转储以及阿拉伯语的多模式处理管道。
更多推荐

所有评论(0)