中文使用 googletrans 翻译,翻译不对的地方以英文为准

目录

通过深入研究进行一般主体记忆

英文摘要

Memory is critical for AI agents, yet the widely-adopted static memory, aiming to create readily available memory in advance, is inevitably subject to severe information loss. To address this limitation, we propose a novel framework called general agentic memory (GAM). GAM follows the principle of “just-in time (JIT) compilation” where it focuses on creating optimized contexts for its client at runtime while keeping only simple but useful memory during the offline stage. To this end, GAM employs a duo-design with the following components. 1) Memorizer, which highlights key historical information using a lightweight memory, while maintaining complete historical information within a universal page-store. 2) Researcher, which retrieves and integrates useful information from the page-store for its online request guided by the pre-constructed memory. This design allows GAM to effectively leverage the agentic capabilities and test-time scalability of frontier large language models (LLMs), while also facilitating end-to-end performance optimization through reinforcement learning. In our experimental study, we demonstrate that GAM achieves substantial improvement on various memory-grounded task completion scenarios against existing memory systems.

中文摘要

内存对于人工智能代理至关重要,但广泛采用的静态内存旨在提前创建可用的内存,不可避免地会遭受严重的信息丢失。为了解决这个限制,我们提出了一种称为通用代理记忆(GAM)的新颖框架。GAM 遵循“即时(JIT)编译”的原则,它专注于在运行时为其客户端创建优化的上下文,同时在离线阶段仅保留简单但有用的内存。为此,GAM 采用了具有以下组件的双重设计。1)存储器,它使用轻量级存储器突出显示关键历史信息,同时在通用页面存储中维护完整的历史信息。2) 研究人员,从页面存储中检索并集成有用信息,以供预构建内存引导的在线请求。这种设计使 GAM 能够有效利用前沿大型语言模型 (LLM) 的代理功能和测试时可扩展性,同时还通过强化学习促进端到端性能优化。在我们的实验研究中,我们证明了 GAM 相对于现有的内存系统,在各种基于内存的任务完成场景上取得了实质性的改进。


SAM 3:用概念分割任何东西

  • 标题: SAM 3: Segment Anything with Concepts
  • 作者: Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane Momeni, Rishi Hazra, Shuangrui Ding, Sagar Vaze, Francois Porcher, Feng Li, Siyuan Li, Aishwarya Kamath, Ho Kei Cheng, Piotr Dollár, Nikhila Ravi, Kate Saenko, Pengchuan Zhang, Christoph Feichtenhofer
  • 日期: 2025-11-20
  • ArXiv主页: https://arxiv.org/abs/2511.16719
  • 论文链接: https://arxiv.org/pdf/2511.16719
  • 项目链接: https://ai.meta.com/sam3/
  • gitHub仓库: https://github.com/facebookresearch/sam3

英文摘要

We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., “yellow school bus”), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.

中文摘要

我们提出了分段任意模型 (SAM) 3,这是一个统一模型,可以根据概念提示检测、分割和跟踪图像和视频中的对象,我们将概念提示定义为简短的名词短语(例如“黄色校车”)、图像样本或两者的组合。提示概念分割 (PCS) 接受此类提示并返回所有匹配对象实例的分割掩码和唯一标识。为了推进 PCS,我们构建了一个可扩展的数据引擎,该引擎可生成具有 400 万个独特概念标签(包括图像和视频的硬底片)的高质量数据集。我们的模型由共享单个主干的图像级检测器和基于内存的视频跟踪器组成。识别和定位与存在头解耦,从而提高了检测精度。SAM 3 将图像和视频 PCS 中现有系统的准确性提高了一倍,并改进了先前 SAM 在视觉分割任务上的能力。我们开源了 SAM 3 以及新的 Segment Anything with Concepts (SA-Co) 基准,以实现快速概念分割。


GigaEvo:由法学硕士和进化算法支持的开源优化框架

英文摘要

Recent advances in LLM-guided evolutionary computation, particularly AlphaEvolve (Novikov et al., 2025; Georgiev et al., 2025), have demonstrated remarkable success in discovering novel mathematical constructions and solving challenging optimization problems. However, the high-level descriptions in published work leave many implementation details unspecified, hindering reproducibility and further research. In this report we present GigaEvo, an extensible open-source framework that enables researchers to study and experiment with hybrid LLM-evolution approaches inspired by AlphaEvolve. Our system provides modular implementations of key components: MAP-Elites quality-diversity algorithms, asynchronous DAG-based evaluation pipelines, LLM-driven mutation operators with insight generation and bidirectional lineage tracking, and flexible multi-island evolutionary strategies. In order to assess reproducibility and validate our implementation we evaluate GigaEvo on challenging problems from the AlphaEvolve paper: Heilbronn triangle placement, circle packing in squares, and high-dimensional kissing numbers. The framework emphasizes modularity, concurrency, and ease of experimentation, enabling rapid prototyping through declarative configuration. We provide detailed descriptions of system architecture, implementation decisions, and experimental methodology to support further research in LLM driven evolutionary methods. The GigaEvo framework and all experimental code are available at https://github.com/AIRI-Institute/gigaevo-core.

中文摘要

LLM 引导的进化计算的最新进展,特别是 AlphaEvolve(Novikov 等人,2025;Georgiev 等人,2025),在发现新颖的数学结构和解决具有挑战性的优化问题方面取得了显着的成功。然而,已发表的工作中的高级描述并未指定许多实施细节,阻碍了可重复性和进一步的研究。在本报告中,我们介绍了 GigaEvo,这是一个可扩展的开源框架,使研究人员能够研究和实验受 AlphaEvolve 启发的混合 LLM 进化方法。我们的系统提供关键组件的模块化实现:MAP-Elites 质量多样性算法、基于 DAG 的异步评估管道、具有洞察力生成和双向谱系跟踪的 LLM 驱动的突变算子,以及灵活的多岛进化策略。为了评估可重复性并验证我们的实现,我们针对 AlphaEvolve 论文中的挑战性问题评估了 GigaEvo:Heilbronn 三角形放置、正方形中的圆堆积以及高维接吻数。该框架强调模块化、并发性和易于实验,通过声明性配置实现快速原型设计。我们提供系统架构、实施决策和实验方法的详细描述,以支持法学硕士驱动的进化方法的进一步研究。GigaEvo 框架和所有实验代码均可在 https://github.com/AIRI-Institute/gigaevo-core 上获取。


多代理系统中的潜在协作

英文摘要

Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent’s internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.

中文摘要

多智能体系统 (MAS) 将大型语言模型 (LLM) 从独立的单模型推理扩展到协调的系统级智能。虽然现有的 LLM 代理依赖基于文本的中介进行推理和通信,但我们通过使模型能够在连续潜在空间内直接协作而向前迈出了一步。我们引入 LatentMAS,这是一个端到端的免培训框架,可实现 LLM 代理之间的纯粹潜在协作。在 LatentMAS 中,每个智能体首先通过最后一层隐藏嵌入执行自回归潜在想法生成。然后,共享的潜在工作记忆保存并传输每个代理的内部表示,确保无损的信息交换。我们提供的理论分析表明,与基于普通文本的 MAS 相比,LatentMAS 具有更高的表现力和无损信息保存,且复杂性大大降低。此外,对数学和科学推理、常识理解和代码生成等 9 个综合基准的实证评估表明,LatentMAS 的性能始终优于强大的单模型和基于文本的 MAS 基线,准确率提高了 14.6%,输出令牌使用量减少了 70.8%-83.7%,并提供了 4 倍到 4.3 倍的更快端到端推理速度。这些结果表明,我们新的潜在协作框架增强了系统级推理质量,同时无需任何额外培训即可显着提高效率。代码和数据在 https://github.com/Gen-Verse/LatentMAS 上完全开源。


GeoVista:用于地理定位的网络增强代理视觉推理

英文摘要

Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models. We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocalization performance. Experimental results show that GeoVista surpasses other open-source agentic models on the geolocalization task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.

中文摘要

目前对代理视觉推理的研究可以实现深入的多模态理解,但主要集中在图像处理工具上,与更通用的代理模型之间存在差距。在这项工作中,我们重新审视地理定位任务,它不仅需要细致入微的视觉基础,还需要网络搜索来确认或完善推理过程中的假设。由于现有的地理定位基准无法满足高分辨率图像的需求和深度代理推理的定位挑战,我们策划了 GeoBench,这是一个包含来自世界各地的照片和全景图以及不同城市的卫星图像子集的基准,以严格评估代理模型的地理定位能力。我们还提出了 GeoVista,一种代理模型,可在推理循环中无缝集成工具调用,包括用于放大感兴趣区域的图像放大工具和用于检索相关网络信息的网络搜索工具。我们为其开发了完整的训练流程,包括冷启动监督微调(SFT)阶段来学习推理模式和工具使用先验,然后是强化学习(RL)阶段以进一步增强推理能力。我们采用分层奖励来利用多级地理信息并提高整体地理定位性能。实验结果表明,GeoVista 在地理定位任务上大大超越了其他开源代理模型,并且在大多数指标上实现了与 Gemini-2.5-flash 和 GPT-5 等闭源模型相当的性能。


OpenMMReasoner:通过开放和通用的方法推动多模态推理的前沿

英文摘要

Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.

中文摘要

大型推理模型的最新进展激发了人们对将此类功能扩展到多模态领域的兴趣。然而,尽管视觉推理方面取得了显着进展,但缺乏透明和可重复的数据管理和培训策略仍然是可扩展研究的主要障碍。在这项工作中,我们介绍了 OpenMMReasoner,这是一种完全透明的两阶段方法,用于跨越监督微调 (SFT) 和强化学习 (RL) 的多模态推理。在SFT阶段,我们构建了874K样本的冷启动数据集,并经过严格的逐步验证,为推理能力提供了坚实的基础。随后的 RL 阶段利用跨不同领域的 74K 样本数据集来进一步增强和稳定这些能力,从而实现更加稳健和高效的学习过程。广泛的评估表明,我们的训练方法不仅超越了强大的基线,而且还强调了数据质量和训练设计在塑造多模式推理性能方面的关键作用。值得注意的是,我们的方法在九个多模态推理基准上比 Qwen2.5-VL-7B-Instruct 基线提高了 11.6%,为未来大规模多模态推理研究奠定了坚实的经验基础。我们在 https://github.com/EvolvingLMMs-Lab/OpenMMReasoner 开源了所有代码、管道和数据。


AutoEnv:用于测量跨环境代理学习的自动化环境

英文摘要

Humans naturally adapt to diverse environments by learning underlying rules across worlds with different dynamics, observations, and reward structures. In contrast, existing agents typically demonstrate improvements via self-evolving within a single domain, implicitly assuming a fixed environment distribution. Cross-environment learning has remained largely unmeasured: there is no standard collection of controllable, heterogeneous environments, nor a unified way to represent how agents learn. We address these gaps in two steps. First, we propose AutoEnv, an automated framework that treats environments as factorizable distributions over transitions, observations, and rewards, enabling low-cost (4.12 USD on average) generation of heterogeneous worlds. Using AutoEnv, we construct AutoEnv-36, a dataset of 36 environments with 358 validated levels, on which seven language models achieve 12-49% normalized reward, demonstrating the challenge of AutoEnv-36. Second, we formalize agent learning as a component-centric process driven by three stages of Selection, Optimization, and Evaluation applied to an improvable agent component. Using this formulation, we design eight learning methods and evaluate them on AutoEnv-36. Empirically, the gain of any single learning method quickly decrease as the number of environments increases, revealing that fixed learning methods do not scale across heterogeneous environments. Environment-adaptive selection of learning methods substantially improves performance but exhibits diminishing returns as the method space expands. These results highlight both the necessity and the current limitations of agent learning for scalable cross-environment generalization, and position AutoEnv and AutoEnv-36 as a testbed for studying cross-environment agent learning. The code is avaiable at https://github.com/FoundationAgents/AutoEnv.

中文摘要

人类通过学习具有不同动态、观察和奖励结构的世界的基本规则来自然地适应不同的环境。相比之下,现有的代理通常通过在单个域内自我进化来展示改进,隐含地假设固定的环境分布。跨环境学习在很大程度上仍然无法衡量:没有可控的异构环境的标准集合,也没有统一的方法来表示代理如何学习。我们分两步解决这些差距。首先,我们提出 AutoEnv,一个自动化框架,它将环境视为转换、观察和奖励的可分解分布,从而实现低成本(平均 4.12 美元)生成异构世界。使用 AutoEnv,我们构建了 AutoEnv-36,这是一个包含 36 个环境、358 个验证级别的数据集,其中 7 个语言模型实现了 12-49% 的标准化奖励,展示了 AutoEnv-36 的挑战。其次,我们将代理学习形式化为一个以组件为中心的过程,由应用于可改进代理组件的选择、优化和评估三个阶段驱动。使用这个公式,我们设计了八种学习方法并在 AutoEnv-36 上对其进行评估。根据经验,随着环境数量的增加,任何单一学习方法的增益都会迅速下降,这表明固定的学习方法不能跨异构环境扩展。学习方法的环境自适应选择极大地提高了性能,但随着方法空间的扩展,收益递减。这些结果强调了智能体学习对于可扩展的跨环境泛化的必要性和当前的局限性,并将 AutoEnv 和 AutoEnv-36 定位为研究跨环境智能体学习的测试平台。该代码可在 https://github.com/FoundationAgents/AutoEnv 获取。


揭示文本的内在维度:从学术摘要到创意故事

  • 标题: Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story
  • 作者: Vladislav Pedashenko, Laida Kushnareva, Yana Khassan Nibal, Eduard Tulchinskii, Kristian Kuznetsov, Vladislav Zharchinskii, Yury Maximov, Irina Piontkovskaya
  • 日期: 2025-11-19
  • ArXiv主页: https://arxiv.org/abs/2511.15210
  • 论文链接: https://arxiv.org/pdf/2511.15210

英文摘要

Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text “representationally simple” while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively “easy”, whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.

中文摘要

内在维度(ID)是现代法学硕士分析中的重要工具,为训练动态、缩放行为和数据集结构的研究提供信息,但其文本决定因素仍未得到充分探索。我们通过交叉编码器分析、语言特征和稀疏自动编码器 (SAE) 提供了第一个将 ID 置于可解释文本属性中的综合研究。在这项工作中,我们得出了三个关键发现。首先,ID 与基于熵的度量是互补的:在控制长度后,两者不相关,ID 捕获与预测质量正交的几何复杂性。其次,ID 表现出强大的流派分层:在所有测试的模型中,科学散文显示低 ID(~8),百科内容中等 ID(~9),创意/观点写作高 ID(~10.5)。这表明当代法学硕士认为科学文本“代表性简单”,而小说则需要额外的自由度。第三,使用 SAE,我们识别因果特征:科学信号(正式语气、报告模板、统计数据)减少 ID;人性化的信号(个性化、​​情感、叙述)会增加它。转向实验证实这些影响是因果关系。因此,对于当代模型来说,科学写作显得相对“容易”,而小说、观点和情感则增加了表征的自由度。我们的多方面分析为正确使用 ID 和基于 ID 的结果的合理解释提供了实用指导。


俄语架构的多模态评估

  • 标题: Multimodal Evaluation of Russian-language Architectures
  • 作者: Artem Chervyakov, Ulyana Isaeva, Anton Emelyanov, Artem Safin, Maria Tikhonova, Alexander Kharitonov, Yulia Lyakh, Petr Surovtsev, Denis Shevelev, Vildan Saburov, Vasily Konovalov, Elisei Rykov, Ivan Sviridov, Amina Miftakhova, Ilseyar Alimova, Alexander Panchenko, Alexander Kapitanov, Alena Fenogenova
  • 日期: 2025-11-19
  • ArXiv主页: https://arxiv.org/abs/2511.15552
  • 论文链接: https://arxiv.org/pdf/2511.15552
  • 项目链接: https://mera.a-ai.ru/en/multi

英文摘要

Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.

中文摘要

多模态大语言模型(MLLM)目前是研究关注的中心,在规模和能力方面显示出快速进步,但它们的智能、局限性和风险仍然没有得到充分的了解。为了解决这些问题,特别是在目前不存在多模态基准的俄语背景下,我们引入了 Mera Multi,这是一个针对俄语架构的开放式多模态评估框架。该基准测试基于指令,涵盖默认文本、图像、音频和视频模式,包括 18 个针对通用模型和特定模式架构(图像到文本、视频到文本和音频到文本)新构建的评估任务。我们的贡献包括:(i)多模式能力的通用分类法;(ii) 完全从头开始创建的 18 个数据集,注重俄罗斯文化和语言的特殊性、统一的提示和指标;(iii) 闭源和开源模型的基线结果;(iv) 防止基准泄漏的方法,包括水印和私人设备许可证。虽然我们目前的重点是俄语,但拟议的基准提供了一种可复制的方法,用于在类型多样的语言(尤其是斯拉夫语系)中构建多模式基准。


DeCo:用于端到端图像生成的频率解耦像素扩散

英文摘要

Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at https://github.com/Zehong-Ma/DeCo.

中文摘要

像素扩散旨在以端到端的方式直接在像素空间中生成图像。这种方法避免了 VAE 在两阶段潜在扩散中的局限性,提供了更高的模型容量。现有的像素扩散模型训练和推理速度缓慢,因为它们通常在单个扩散变压器 (DiT) 内对高频信号和低频语义进行建模。为了追求更有效的像素扩散范式,我们提出了频率解耦像素扩散框架。凭借将高频和低频分量的生成解耦的直觉,我们利用轻量级像素解码器来生成基于 DiT 语义指导的高频细节。因此,这使得 DiT 能够专注于低频语义建模。此外,我们引入了一种频率感知的流量匹配损失,它强调视觉上显着的频率,同时抑制不重要的频率。大量实验表明,DeCo 在像素扩散模型中实现了优越的性能,在 ImageNet 上获得了 1.62 (256x256) 和 2.22 (512x512) 的 FID,缩小了与潜在扩散方法的差距。此外,我们的预训练文本到图像模型在 GenEval 的系统级比较中取得了 0.86 的领先总分。代码可在 https://github.com/Zehong-Ma/DeCo 上公开获取。


DR Tulu:强化学习与不断发展的深度研究标准

英文摘要

Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.

中文摘要

深度研究模型执行多步骤研究,以产生长篇、明确的答案。然而,大多数开放式深度研究模型都是通过具有可验​​证奖励的强化学习(RLVR)在易于验证的短形式 QA 任务上进行训练,这并不能扩展到现实的长形式任务。我们通过强化学习与演化规则(RLER)来解决这个问题,我们在训练过程中构建和维护与策略模型共同演化的规则;这使得评价标准能够纳入模型新探索的信息,并提供有区别的、符合政策的反馈。使用 RLER,我们开发了 Deep Research Tulu (DR Tulu-8B),这是第一个直接训练用于开放式长篇深度研究的开放模型。在科学、医疗保健和一般领域的四个长期深度研究基准中,DR Tulu 的性能大大优于现有的开放深度研究模型,并匹配或超过专有深度研究系统,同时每个查询的体积更小且成本更低。为了促进未来的研究,我们发布了所有数据、模型和代码,包括用于深度研究系统的新的基于 MCP 的代理基础设施。


计算机使用代理作为生成用户界面的法官

英文摘要

Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans–prioritizing aesthetics and usability–forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.

中文摘要

计算机使用代理 (CUA) 通过图形用户界面 (GUI) 自主操作数字环境的能力越来越强。然而,大多数 GUI 仍然主要是为人类设计的——优先考虑美观和可用性——迫使智能体采用以人为本的行为,而这对于高效的任务执行来说是不必要的。与此同时,面向编码的语言模型 (Coder) 的快速发展已经改变了自动 GUI 设计。这就提出了一个根本性的问题:CUA作为评委能否辅助Coder进行自动GUI设计?为了进行调查,我们引入了 AUI-Gym,这是一个跨不同领域的 52 个应用程序的自动 GUI 开发基准。使用语言模型,我们合成了 1560 个模拟现实场景的任务。为了确保任务的可靠性,我们进一步开发了一个验证器,以编程方式检查每个任务是否在其环境中可执行。在此基础上,我们提出了一个 Coder-CUA 协作框架:Coder 充当设计师,生成和修改网站,而 CUA 充当法官,评估功能并完善设计。成功不是通过视觉外观来衡量的,而是通过任务解决能力和 CUA 导航成功率来衡量的。为了将 CUA 反馈转化为可用的指导,我们设计了一个 CUA 仪表板,将多步骤导航历史压缩为简洁的视觉摘要,为迭代重新设计提供可解释的指导。通过将代理定位为设计师和评委,我们的框架将界面设计转向代理原生的效率和可靠性。我们的工作朝着将代理从被动使用转向主动参与数字环境迈出了一步。我们的代码和数据集可在 https://github.com/showlab/AUI 获取。


MedSAM3:深入研究用医学概念细分任何事物

英文摘要

Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time-consuming manual annotation for new clinical application. Here, we propose MedSAM-3, a text promptable medical segmentation model for medical image and video segmentation. By fine-tuning the Segment Anything Model (SAM) 3 architecture on medical images paired with semantic conceptual labels, our MedSAM-3 enables medical Promptable Concept Segmentation (PCS), allowing precise targeting of anatomical structures via open-vocabulary text descriptions rather than solely geometric prompts. We further introduce the MedSAM-3 Agent, a framework that integrates Multimodal Large Language Models (MLLMs) to perform complex reasoning and iterative refinement in an agent-in-the-loop workflow. Comprehensive experiments across diverse medical imaging modalities, including X-ray, MRI, Ultrasound, CT, and video, demonstrate that our approach significantly outperforms existing specialist and foundation models. We will release our code and model at https://github.com/Joey-S-Liu/MedSAM3.

中文摘要

医学图像分割是生物医学发现的基础。现有方法缺乏通用性,并且需要对新的临床应用进行大量、耗时的手动注释。在这里,我们提出了 MedSAM-3,一种用于医学图像和视频分割的文本提示医学分割模型。通过对医学图像上的分段任意模型 (SAM) 3 架构与语义概念标签进行微调,我们的 MedSAM-3 可实现医学提示概念分割 (PCS),从而允许通过开放词汇文本描述(而不仅仅是几何提示)精确定位解剖结构。我们进一步介绍了 MedSAM-3 Agent,这是一个集成多模态大型语言模型 (MLLM) 的框架,可在代理循环工作流程中执行复杂的推理和迭代细化。跨不同医学成像模式(包括 X 射线、MRI、超声波、CT 和视频)的综合实验表明,我们的方法明显优于现有的专家和基础模型。我们将在 https://github.com/Joey-S-Liu/MedSAM3 发布我们的代码和模型。


Agent0-VL:探索用于工具集成视觉语言推理的自我进化代理

英文摘要

Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at https://github.com/aiming-lab/Agent0/Agent0-VL{this https URL}.

中文摘要

视觉语言智能体在多种多模态推理任务中取得了显着的进展;然而,他们的学习仍然受到人工注释监督的限制。最近的自我奖励方法试图通过允许模型充当自己的批评者或奖励提供者来克服这种限制。然而,纯粹基于文本的自我评估很难验证复杂的视觉推理步骤,并且经常遭受评估幻觉。为了应对这些挑战,受到工具集成推理最新进展的启发,我们提出了 Agent0-VL,这是一种自我进化的视觉语言代理,可以通过工具集成推理实现持续改进。Agent0-VL不仅将工具的使用融入到推理中,还融入了自我评估和自我修复,使模型能够通过基于证据的分析来反思、验证和完善其推理。它将两个协同角色统一在一个 LVLM 中:执行多轮工具集成推理的求解器,以及通过基于工具的批评生成结构化反馈和细粒度自我奖励的验证器。这些角色通过自我进化推理循环相互作用,其中基于工具的验证和强化学习共同调整推理和评估分布,以实现稳定的自我改进。通过这种零外部奖励的进化,Agent0-VL在没有任何人工注释或外部奖励模型的情况下调整其推理和验证行为,实现持续的自我完善。几何问题求解和可视化科学分析实验表明,Agent0-VL比基础模型提高了12.5%。我们的代码可在 https://github.com/aiming-lab/Agent0/Agent0-VL{this https URL} 获取。


Inferix:基于块扩散的下一代世界模拟推理引擎

英文摘要

World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM-centric vision foundation models. A key breakthrough empowering them is the semi-autoregressive (block-diffusion) decoding paradigm, which merges the strengths of diffusion and autoregressive methods by generating video tokens in block-applying diffusion within each block while conditioning on previous ones, resulting in more coherent and stable video sequences. Crucially, it overcomes limitations of standard video diffusion by reintroducing LLM-style KV Cache management, enabling efficient, variable-length, and high-quality generation. Therefore, Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation distinctly sets it apart from systems engineered for high-concurrency scenarios (like vLLM or SGLang) and from classic video diffusion models (such as xDiTs). Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation to accurately model world dynamics. Additionally, it supports efficient benchmarking through seamless integration of LV-Bench, a new fine-grained evaluation benchmark tailored for minute-long video generation scenarios. We hope the community will work together to advance Inferix and foster world model exploration.

中文摘要

世界模型作为代理人工智能、实体人工智能和游戏等领域的核心模拟器,能够生成长的、物理真实的、交互式的高质量视频。此外,扩展这些模型可以释放视觉感知、理解和推理方面的新兴能力,为超越当前以法学硕士为中心的视觉基础模型的新范式铺平道路。赋予它们能力的一个关键突破是半自回归(块扩散)解码范例,它通过在每个块内应用块扩散生成视频令牌,同时以先前的令牌为条件,融合了扩散和自回归方法的优点,从而产生更加连贯和稳定的视频序列。至关重要的是,它通过重新引入 LLM 风格的 KV 缓存管理来克服标准视频扩散的限制,从而实现高效、可变长度和高质量的生成。因此,Inferix 被专门设计为下一代推理引擎,通过优化的半自回归解码过程实现沉浸式世界合成。这种对世界模拟的专注使其与专为高并发场景(如 vLLM 或 SGLang)设计的系统以及经典视频扩散模型(如 xDiT)截然不同。Inferix 通过交互式视频流和分析进一步增强其产品,实现实时交互和真实模拟,以准确地模拟世界动态。此外,它还通过与 LV-Bench 的无缝集成来支持高效的基准测试,LV-Bench 是专为时长视频生成场景量身定制的全新细粒度评估基准。我们希望社区共同努力推进 Inferix 并促进世界模型探索。


视频生成模型是很好的潜在奖励模型

英文摘要

Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.

中文摘要

事实证明,奖励反馈学习(ReFL)对于使图像生成与人类偏好保持一致是有效的。然而,其扩展到视频生成面临着重大挑战。现有的视频奖励模型依赖于为像素空间输入设计的视觉语言模型,将 ReFL 优化限制在计算成本高昂的 VAE 解码后接近完成的去噪步骤。这种像素空间方法会产生大量的内存开销和增加的训练时间,并且其后期优化缺乏早期监督,仅改进视觉质量,而不是基本的运动动力学和结构连贯性。在这项工作中,我们表明预训练的视频生成模型自然适合在噪声潜在空间中进行奖励建模,因为它们被明确设计为在任意时间步处理噪声潜在表示,并通过其顺序建模功能本质上保留时间信息。因此,我们提出了过程奖励反馈学习(PRFL),这是一个完全在潜在空间中进行偏好优化的框架,无需 VAE 解码即可在整个去噪链中实现高效的梯度反向传播。大量实验表明,与 RGB ReFL 相比,PRFL 显着提高了与人类偏好的一致性,同时显着减少了内存消耗和训练时间。


ROOT:用于神经网络训练的鲁棒正交优化器

英文摘要

The optimization of large language models (LLMs) remains a critical challenge, particularly as model scaling exacerbates sensitivity to algorithmic imprecision and training instability. Recent advances in optimizers have improved convergence efficiency through momentum orthogonalization, but suffer from two key robustness limitations: dimensional fragility in orthogonalization precision and vulnerability to outlier-induced noise. To address these robustness challenges, we introduce ROOT, a Robust Orthogonalized Optimizer that enhances training stability through dual robustness mechanisms. First, we develop a dimension-robust orthogonalization scheme using adaptive Newton iterations with fine-grained coefficients tailored to specific matrix sizes, ensuring consistent precision across diverse architectural configurations. Second, we introduce an optimization-robust framework via proximal optimization that suppresses outlier noise while preserving meaningful gradient directions. Extensive experiments demonstrate that ROOT achieves significantly improved robustness, with faster convergence and superior final performance compared to both Muon and Adam-based optimizers, particularly in noisy and non-convex scenarios. Our work establishes a new paradigm for developing robust and precise optimizers capable of handling the complexities of modern large-scale model training. The code will be available at https://github.com/huawei-noah/noah-research/tree/master/ROOT.

中文摘要

大型语言模型 (LLM) 的优化仍然是一个严峻的挑战,特别是当模型扩展加剧了对算法不精确性和训练不稳定性的敏感性时。优化器的最新进展通过动量正交化提高了收敛效率,但存在两个关键的鲁棒性限制:正交化精度的维度脆弱性和易受异常值引起的噪声的影响。为了解决这些鲁棒性挑战,我们引入了 ROOT,一种鲁棒正交优化器,可通过双重鲁棒性机制增强训练稳定性。首先,我们使用自适应牛顿迭代和针对特定矩阵大小定制的细粒度系数来开发维度鲁棒的正交化方案,确保在不同的架构配置中保持一致的精度。其次,我们通过近端优化引入了一个稳健的优化框架,可以抑制异常噪声,同时保留有意义的梯度方向。大量实验表明,与基于 Muon 和 Adam 的优化器相比,ROOT 显着提高了鲁棒性,具有更快的收敛速度和卓越的最终性能,特别是在噪声和非凸场景中。我们的工作建立了一个新的范例,用于开发强大而精确的优化器,能够处理现代大规模模型训练的复杂性。代码可在 https://github.com/huawei-noah/noah-research/tree/master/ROOT 获取。


SteadyDancer:具有第一帧保留的协调一致的人类图像动画

英文摘要

Preserving first-frame identity while ensuring precise motion control is a fundamental challenge in human image animation. The Image-to-Motion Binding process of the dominant Reference-to-Video (R2V) paradigm overlooks critical spatio-temporal misalignments common in real-world applications, leading to failures such as identity drift and visual artifacts. We introduce SteadyDancer, an Image-to-Video (I2V) paradigm-based framework that achieves harmonized and coherent animation and is the first to ensure first-frame preservation robustly. Firstly, we propose a Condition-Reconciliation Mechanism to harmonize the two conflicting conditions, enabling precise control without sacrificing fidelity. Secondly, we design Synergistic Pose Modulation Modules to generate an adaptive and coherent pose representation that is highly compatible with the reference image. Finally, we employ a Staged Decoupled-Objective Training Pipeline that hierarchically optimizes the model for motion fidelity, visual quality, and temporal coherence. Experiments demonstrate that SteadyDancer achieves state-of-the-art performance in both appearance fidelity and motion control, while requiring significantly fewer training resources than comparable methods.

中文摘要

在确保精确运动控制的同时保留第一帧身份是人类图像动画的基本挑战。占主导地位的视频参考 (R2V) 范式的图像到运动绑定过程忽略了现实应用中常见的关键时空错位,导致身份漂移和视觉伪影等故障。我们推出了 SteadyDancer,这是一种基于图像到视频 (I2V) 范式的框架,可实现协调一致的动画,并且是第一个确保稳健保留第一帧的框架。首先,我们提出了一种条件协调机制来协调两个相互冲突的条件,从而在不牺牲保真度的情况下实现精确控制。其次,我们设计协同姿势调制模块来生成与参考图像高度兼容的自适应且连贯的姿势表示。最后,我们采用分阶段解耦目标训练管道,分层优化模型的运动保真度、视觉质量和时间连贯性。实验表明,SteadyDancer 在外观保真度和运动控制方面均实现了最先进的性能,同时与同类方法相比,所需的训练资源明显减少。


软自适应策略优化

英文摘要

Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.

中文摘要

强化学习(RL)在增强大型语言模型(LLM)的推理能力方面发挥着越来越重要的作用,但稳定且高性能的策略优化仍然具有挑战性。令牌级别的重要性比率通常表现出高方差(这种现象在专家混合模型中加剧),导致更新不稳定。现有的基于群体的策略优化方法,例如GSPO和GRPO,通过硬剪裁缓解了这个问题,使得难以保持稳定性和有效学习。我们提出了软自适应策略优化(SAPO),它用平滑的温控门取代了硬限幅,自适应地衰减离策略更新,同时保留有用的学习信号。与 GSPO 和 GRPO 相比,SAPO 既具有序列一致性又具有令牌自适应性。与 GSPO 一样,SAPO 保持了序列级的一致性,但其软门控形成了一个连续的信任区域,避免了 GSPO 中使用的脆弱的硬限幅带。当序列包含一些高度偏离策略的标记时,GSPO 会抑制该序列的所有梯度,而 SAPO 有选择地仅降低违规标记的权重,并保留来自接近策略标记的学习信号,从而提高样本效率。相对于 GRPO,SAPO 用平滑、温控的缩放取代了硬令牌级裁剪,从而实现信息更丰富、更稳定的更新。数学推理基准的实证结果表明,SAPO 在可比较的训练预算下表现出更高的训练稳定性和更高的 Pass@1 性能。此外,我们使用 SAPO 来训练 Qwen3-VL 模型系列,证明 SAPO 在不同的任务和不同的模型大小中都能产生一致的性能增益。总体而言,SAPO 为法学硕士的 RL 训练提供了更可靠、可扩展且有效的优化策略。


UltraFlux:跨不同宽高比的高质量原生 4K 文本到图像生成的数据模型协同设计

英文摘要

Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and-with a LLM prompt refiner-matches or surpasses the proprietary Seedream 4.0.

中文摘要

Diffusion Transformer 最近在 1K 分辨率左右提供了强大的文本到图像生成功能,但我们表明,将它们扩展到不同纵横比的原生 4K 会暴露出跨越位置编码、VAE 压缩和优化的紧密耦合故障模式。孤立地解决这些因素中的任何一个都会留下实质性的质量问题。因此,我们采取数据模型协同设计的观点,并引入 UltraFlux,这是一种基于 Flux 的 DiT,在 MultiAspect-4K-1M 上以 4K 进行本机训练,MultiAspect-4K-1M 是一个 100 万图像 4K 语料库,具有受控的多 AR 覆盖、双语字幕和丰富的 VLM/IQA 元数据,用于分辨率和 AR 感知采样。在模型方面,UltraFlux 将 (i) Resonance 2D RoPE 与 YaRN 结合起来,以进行 4K 训练窗口、频率和 AR 感知位置编码;(ii) 一种简单的、非对抗性的 VAE 训练后方案,可提高 4K 重建保真度;(iii) SNR-Aware Huber Wavelet 目标,可跨时间步长和频带重新平衡梯度;(iv)分阶段的审美课程学习策略,将高审美监督集中在由模型先验控制的高噪声步骤上。这些组件共同产生稳定、保留细节的 4K DiT,适用于宽、方形和高的 AR。在 4096 基准和多 AR 4K 设置的美学评估中,UltraFlux 在保真度、美观和对齐指标方面始终优于强大的开源基线,并且通过 LLM 提示细化器匹配或超越专有的 Seedream 4.0。


画布到图像:使用多模态控件生成合成图像

英文摘要

While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.

中文摘要

虽然现代扩散模型擅长生成高质量和多样化的图像,但它们仍然难以实现高保真构图和多模态控制,特别是当用户同时指定文本提示、主题参考、空间安排、姿势约束和布局注释时。我们引入了 Canvas-to-Image,这是一个统一的框架,它将这些异构控件整合到一个画布界面中,使用户能够生成忠实反映其意图的图像。我们的关键思想是将不同的控制信号编码到单个复合画布图像中,模型可以直接解释该图像以进行集成的视觉空间推理。我们进一步策划了一套多任务数据集,并提出了一种多任务画布训练策略,该策略优化扩散模型,以在统一的学习范式中共同理解异构控制并将其集成到文本到图像的生成中。这种联合训练使 Canvas-to-Image 能够跨多种控制模式进行推理,而不是依赖于特定于任务的启发式方法,并且它可以在推理过程中很好地推广到多控制场景。大量实验表明,在具有挑战性的基准(包括多人合成、姿势控制合成、布局约束生成和多控制生成)中,Canvas-to-Image 在身份保存和控制遵守方面显着优于最先进的方法。


iMontage:统一、多功能、高度动态的多对多图像生成

英文摘要

Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained content diversity from image data into this coherent temporal framework, we can generate image sets that feature both natural transitions and a far more expansive dynamic range. To this end, we introduce iMontage, a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. To achieve this, we propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, not only maintaining strong cross-image contextual consistency but also generating scenes with extraordinary dynamics that surpass conventional scopes. Find our homepage at: https://kr1sjfu.github.io/iMontage-web/.

中文摘要

预先训练的视频模型可以学习强大的先验知识,以生成高质量、时间连贯的内容。虽然这些模型在时间一致性方面表现出色,但它们的动态通常受到训练数据的连续性的限制。我们假设,通过将图像数据中丰富且不受约束的内容多样性注入到这个连贯的时间框架中,我们可以生成具有自然过渡和更广泛的动态范围的图像集。为此,我们引入了 iMontage,这是一个统一的框架,旨在将强大的视频模型重新转变为一体化图像生成器。该框架使用并生成可变长度的图像集,统一了广泛的图像生成和编辑任务。为了实现这一目标,我们提出了一种优雅且微创的适应策略,并辅以定制的数据管理流程和培训范例。这种方法允许模型获得广泛的图像处理能力,而不会破坏其宝贵的原始运动先验。iMontage 在多个主流多进多出任务中表现出色,不仅保持了强大的跨图像上下文一致性,而且还生成了超越传统范围的具有非凡动态的场景。找到我们的主页:https://kr1sjfu.github.io/iMontage-web/。


了解统一多模式模型中的信息生成是否有用?从分析到前进

英文摘要

Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at https://github.com/PKU-YuanGroup/UniSandBox

中文摘要

近年来,统一多模式模型取得了重大进展,但仍然存在一个基本问题:理解真的能够为一代人提供信息吗?为了研究这个问题,我们引入了 UniSandbox,这是一个解耦的评估框架,与受控的合成数据集配对,以避免数据泄漏并实现详细分析。我们的研究结果揭示了显着的理解代沟,这主要体现在两个关键维度:推理生成和知识转移。具体来说,对于推理生成任务,我们观察到理解模块中的显式思维链(CoT)有效地弥补了这一差距,并进一步证明自我训练方法可以成功地内化这种能力,从而在生成过程中实现隐式推理。此外,对于知识转移任务,我们发现 CoT 通过帮助检索新学习的知识来协助生成过程,并且还发现基于查询的架构本质上表现出影响这种转移的潜在类似 CoT 的属性。UniSandbox 为设计未来统一架构和培训策略提供了初步见解,真正弥合了理解和生成之间的差距。代码和数据可在 https://github.com/PKU-YuanGroup/UniSandBox 获取


STARFlow-V:具有归一化流的端到端视频生成建模

英文摘要

Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at https://github.com/apple/ml-starflow.

中文摘要

标准化流(NF)是基于端到端可能性的连续数据生成模型,最近随着图像生成方面的令人鼓舞的进展重新受到关注。然而,在视频生成领域,时空复杂性和计算成本要高得多,最先进的系统几乎完全依赖于基于扩散的模型。在这项工作中,我们通过展示 STARFlow-V 重新审视这个设计空间,这是一种基于流的归一化视频生成器,具有端到端学习、强大的因果预测和本机似然估计等显着优势。STARFlow-V 以最近提出的 STARFlow 为基础,在时空潜在空间中运行,具有全局局部架构,该架构将因果依赖性限制在全局潜在空间中,同时保留丰富的局部帧内交互。这可以缓解随着时间的推移而积累的误差,这是标准自回归扩散模型生成的常见陷阱。此外,我们提出了流分数匹配,它为模型配备了轻量级因果降噪器,以自回归方式提高视频生成的一致性。为了提高采样效率,STARFlow-V 采用视频感知雅可比迭代方案,将内部更新重新构建为可并行迭代,而不会破坏因果关系。由于可逆结构,同一模型可以原生支持文本到视频、图像到视频以及视频到视频生成任务。根据经验,STARFlow-V 相对于基于扩散的基线,通过实际采样吞吐量实现了强大的视觉保真度和时间一致性。据我们所知,这些结果首次证明 NF 能够生成高质量的自回归视频,使它们成为构建世界模型的一个有前途的研究方向。代码和生成的示例可在 https://github.com/apple/ml-starflow 获取。


GigaWorld-0:世界模型作为数据引擎,增强嵌入式人工智能的能力

英文摘要

World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8-precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld-0 generates high-quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain-0) trained on GigaWorld-0-generated data achieve strong real-world performance, significantly improving generalization and task success on physical robots without any real-world interaction during training.

中文摘要

世界模型正在成为可扩展、数据高效的具体人工智能的基础范例。在这项工作中,我们提出了 GigaWorld-0,一个统一的世界模型框架,明确设计为视觉-语言-动作(VLA)学习的数据引擎。GigaWorld-0 集成了两个协同组件: GigaWorld-0-Video,它利用大规模视频生成,在外观、相机视点和动作语义的细粒度控制下产生多样化、纹理丰富且时间连贯的体现序列;GigaWorld-0-3D,它结合了 3D 生成建模、3D 高斯喷射重建、物理可微分系统识别和可执行运动规划,以确保几何一致性和物理真实性。它们的联合优化能够实现视觉上引人注目、空间连贯、物理上合理且指令一致的具体交互数据的可扩展合成。通过我们高效的 GigaTrain 框架,大规模训练变得可行,该框架利用 FP8 精度和稀疏注意力来大幅减少内存和计算需求。我们进行的综合评估表明,GigaWorld-0在多个维度上生成了高质量、多样化、可控的数据。至关重要的是,在 GigaWorld-0 生成的数据上训练的 VLA 模型(例如 GigaBrain-0)实现了强大的现实世界性能,显着提高了物理机器人的泛化能力和任务成功率,而无需在训练期间进行任何现实世界交互。


视频内说明:视觉信号作为生成控制

英文摘要

Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.

中文摘要

大型视频生成模型最近表现出了强大的视觉功能,能够预测符合当前观察中的逻辑和物理线索的未来帧。在这项工作中,我们研究是否可以通过将帧中嵌入的视觉信号解释为指令(我们称之为视频内指令)来利用此类功能来生成可控的图像到视频。基于提示的控制提供本质上全局且粗略的文本描述,与此相反,视频内指令通过叠加文本、箭头或轨迹等元素将用户指导直接编码到视觉域中。通过向不同的对象分配不同的指令,可以在视觉主体与其预期动作之间实现明确的、空间感知的和明确的对应。对三种最先进的生成器(包括 Veo 3.1、Kling 2.5 和 Wan 2.2)的大量实验表明,视频模型可以可靠地解释和执行此类视觉嵌入指令,特别是在复杂的多对象场景中。


Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐