【论文速递】2025年第24周(Jun-08-14)(Robotics/Embodied AI/LLM)

万俟淋曦

889人浏览 · 2025-09-21 16:29:14

万俟淋曦 · 2025-09-21 16:29:14 发布

中文使用 googletrans 翻译，翻译不对的地方以英文为准

增强预训练
- 英文摘要
- 中文摘要
明天还会真的吗？多语言常绿问题分类，以提高值得信赖的质量检查
- 英文摘要
- 中文摘要
您所需要的自信：语言模型的几个射击rl微调
- 英文摘要
- 中文摘要
Lingshu：统一多模式医学理解和推理的通才基础模型
- 英文摘要
- 中文摘要
播种1.0：探索视频生成模型的边界
- 英文摘要
- 中文摘要
原因：一个370k的多代理生成的数据集用于推进医疗推理
- 英文摘要
- 中文摘要
minicpm4：端设备上的超高效LLM
- 英文摘要
- 中文摘要
PartCrafter：通过组成潜水扩散变压器生成结构化的3D网格
- 英文摘要
- 中文摘要
saffron-1：迈向LLM安全保证的推理缩放范例
- 英文摘要
- 中文摘要
LLMS中的地缘政治偏见：根据当代语言模型，什么是“好”和“坏”国家
- 英文摘要
- 中文摘要
裁判官
- 英文摘要
- 中文摘要
多元宇宙：您的语言模型秘密决定如何并行化和合并生成
- 英文摘要
- 中文摘要
COMFYUI-R1：探索工作流程的推理模型
- 英文摘要
- 中文摘要
SWE-FACTORY：您的自动化工厂发行解决培训数据和评估基准
- 英文摘要
- 中文摘要
时空LM：培训大型语言模型以进行结构化室内建模
- 英文摘要
- 中文摘要
实时互动视频生成的自回归对抗后训练
- 英文摘要
- 中文摘要
带有扩散模型的文本感知图像恢复
- 英文摘要
- 中文摘要
Oneig-bench：图像生成的Omni维度细微差别评估
- 英文摘要
- 中文摘要
模仿：与MCT驱动的剪辑一代自动化的多代理动画讲故事
- 英文摘要
- 中文摘要
利用自我注意力依赖于输入依赖的软提示LLMS
- 英文摘要
- 中文摘要
少数情况：高价值数据选择用于有效的多模式推理
- 英文摘要
- 中文摘要
Playerone：以自我为中心的世界模拟器
- 英文摘要
- 中文摘要
自回归语义视觉重建有助于VLM更好地理解
- 英文摘要
- 中文摘要
MORSE-500：一个可以控制的视频基准，以进行压力测试多模式推理
- 英文摘要
- 中文摘要
离散的音频令牌：不仅仅是调查！
- 英文摘要
- 中文摘要
自动回归与流程匹配：文本到音乐生成的建模范式的比较研究
- 英文摘要
- 中文摘要

增强预训练

标题: Reinforcement Pre-Training
作者: Qingxiu Dong, Li Dong, Yao Tang, Tianzhu Ye, Yutao Sun, Zhifang Sui, Furu Wei
日期: 2025-06-09
ArXiv主页: https://arxiv.org/abs/2506.08007
论文链接: https://arxiv.org/pdf/2506.08007

英文摘要

In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.

中文摘要

在这项工作中，我们将强化预训练（RPT）作为大型语言模型和强化学习（RL）的新缩放范式（RPT）。具体来说，我们将下一步预测重新构建为使用RL训练的推理任务，在该任务中，它可以在其中获得可验证的奖励，以正确预测给定上下文的下一代币。RPT提供了一种可扩展的方法来利用大量文本数据作为通用RL，而不是依靠特定于域的注释答案。通过激励下一言推理的能力，RPT显着提高了预测下一代币的语言建模准确性。此外，RPT为进一步加强进行微调提供了强大的预训练基础。缩放曲线表明，增加的训练始终如一地提高了下一步的预测准确性。结果将RPT定位为有效且有希望的缩放范式，以提高语言模型预训练。

明天还会真的吗？多语言常绿问题分类，以提高值得信赖的质量检查

标题: Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA
作者: Sergey Pletenev, Maria Marina, Nikolay Ivanov, Daria Galimzianova, Nikita Krayko, Mikhail Salnikov, Vasily Konovalov, Alexander Panchenko, Viktor Moskvoretskii
日期: 2025-05-27
ArXiv主页: https://arxiv.org/abs/2505.21115
论文链接: https://arxiv.org/pdf/2505.21115
项目链接: https://s-nlp.github.io/Evergreen-classification/
gitHub仓库: https://github.com/s-nlp/Evergreen-classification

英文摘要

Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions – whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o retrieval behavior.

中文摘要

大型语言模型（LLMS）通常会在有问题的答案（QA）任务中幻觉。造成问题的关键却又没有被忽视的因素是问题的时间性 - 无论它们是常绿（随着时间的流逝，答案保持稳定）还是可变的（答案变化）。在这项工作中，我们介绍了Evergreenqa，这是第一个带有常绿标签的多语言QA数据集，支持评估和培训。使用Evergreenqa，我们基于12 Modern LLMS来评估它们是明确（通过口头判断）还是隐式（通过不确定性信号）来编码问题的问题。我们还训练EG-E5，这是一种轻巧的多语言分类器，可以在此任务上实现SOTA性能。最后，我们在三个应用程序中演示了常绿分类的实际实用性：改进自我知识估计，过滤QA数据集以及解释GPT-4O检索行为。

您所需要的自信：语言模型的几个射击rl微调

标题: Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models
作者: Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, Ivan Oseledets
日期: 2025-06-05
ArXiv主页: https://arxiv.org/abs/2506.06395
论文链接: https://arxiv.org/pdf/2506.06395

英文摘要

Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model’s own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 16 samples per question and 10 or 20 training steps, RLSC improves accuracy by +13.4% on AIME2024, +21.2% on MATH500, +21.7% on Minerva Math, +20.8% on Olympiadbench, and +9.7% on AMC23. RLSC provides a simple, scalable post-training method for inference models, requiring only a small number of samples and unlabelled supervision.

中文摘要

大型语言模型（LLM）在推理方面表现出色，但是训练后对于使其行为与任务目标保持一致至关重要。现有的强化学习（RL）方法通常取决于昂贵的人类注释或外部奖励模型。我们建议通过自信（RLSC）进行加强学习，该学习将模型自身的信心用作奖励信号 - 阐明对标签，偏好模型或奖励工程的需求。RLSC应用于QWEN2.5-MATH-7B，每个问题只有16个样本和10或20个培训步骤，在AIME2024上将精度提高了 +13.4％，MATH500的 +21.2％，Minerva Math的 +21.7％，OlympiaDBench上的Math Math， +20.8％，AMC23上的OlympiaDDBENCH和 +9.7％。RLSC为推理模型提供了一种简单，可扩展的培训方法，仅需要少量样本和未标记的监督。

Lingshu：统一多模式医学理解和推理的通才基础模型

标题: Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
作者: LASA Team, Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, Yu Sun, Junao Shen, Chaojun Wang, Jie Tan, Deli Zhao, Tingyang Xu, Hao Zhang, Yu Rong
日期: 2025-06-08
ArXiv主页: https://arxiv.org/abs/2506.07044
论文链接: https://arxiv.org/pdf/2506.07044
项目链接: https://alibaba-damo-academy.github.io/lingshu/

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu’s medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks …

中文摘要

多模式的大语言模型（MLLM）在理解常见的视觉元素方面表现出了令人印象深刻的能力，这主要是由于它们的大规模数据集和高级培训策略。但是，由于医疗方案与一般领域中的数据和任务之间的固有差异，它们在医疗应用中的有效性仍然有限。具体而言，现有的医疗MLLM面临以下临界局限性：（1）由于次优数据策划过程而导致的幻觉效果增强了对医学知识的覆盖有限，（2）缺乏针对复杂医疗场景量身定制的推理能力。为了应对这些挑战，我们首先提出了一项全面的数据策划程序，（1）不仅从医学成像中，而且还从广泛的医学文本和通用域数据中有效地获取丰富的医学知识数据；（2）合成准确的医疗标题，视觉问题答案（VQA）和推理样本。结果，我们构建了一个充满广泛医学知识的多模式数据集。在策划数据的基础上，我们介绍了我们的医学专业MLLM：Lingshu。Lingshu接受了多阶段培训，以逐步嵌入医学专业知识并逐步增强其任务解决能力。此外，我们初步探讨了使用可验证的奖励范式应用强化学习的潜力，以增强Lingshu的医疗推理能力。此外，我们开发了Medevalkit，这是一个统一的评估框架，可为标准化，公平和有效的模型评估巩固领先的多模式和文本医学基准。我们评估了Lingshu在三个基本医疗任务，多模式质量检查，基于文本的QA和医学报告的生成方面的表现。结果表明，在大多数任务上，Lingshu始终优于现有的开源多模型。

播种1.0：探索视频生成模型的边界

标题: Seedance 1.0: Exploring the Boundaries of Video Generation Models
作者: Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, Xunsong Li, Yifu Li, Shanchuan Lin, Zhijie Lin, Jiawei Liu, Shu Liu, Xiaonan Nie, Zhiwu Qing, Yuxi Ren, Li Sun, Zhi Tian, Rui Wang, Sen Wang, Guoqiang Wei, Guohong Wu, Jie Wu, Ruiqi Xia, Fei Xiao, Xuefeng Xiao, Jiangqiao Yan, Ceyuan Yang, Jianchao Yang, Runkai Yang, Tao Yang, Yihang Yang, Zilyu Ye, Xuejiao Zeng, Yan Zeng, Heng Zhang, Yang Zhao, Xiaozheng Zheng, Peihao Zhu, Jiaxin Zou, Feilong Zuo
日期: 2025-06-10
ArXiv主页: https://arxiv.org/abs/2506.09113
论文链接: https://arxiv.org/pdf/2506.09113
项目链接: https://seed.bytedance.com/seedance

英文摘要

Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient architecture design with proposed training paradigm, which allows for natively supporting multi-shot generation and jointly learning of both text-to-video and image-to-video tasks. (iii) carefully-optimized post-training approaches leveraging fine-grained supervised fine-tuning, and video-specific RLHF with multi-dimensional reward mechanisms for comprehensive performance improvements; (iv) excellent model acceleration achieving ~10x inference speedup through multi-stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds (NVIDIA-L20). Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation having superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation.

中文摘要

扩散建模的显着突破已经推动了视频生成的快速改进，但是当前的基础模型仍然面临着同时平衡及时及时的关键挑战，运动的合理性和视觉质量。在本报告中，我们介绍了Seedance 1.0，这是一种高性能和推理高效的视频基础生成模型，该模型整合了几种核心技术改进：（i）多源数据策划增强，并具有精确和有意义的视频字幕，从而使各种场景的全面学习能够进行全面学习；（ii）具有拟议的培训范式的有效体系结构设计，可以在本地支持多拍的生成，并共同学习文本到视频和图像到视频任务。（iii）仔细优化的训练后方法利用细粒度的监督微调，以及具有多维奖励机制的特定于视频的RLHF，以进行全面的绩效改进；（iv）通过多阶段蒸馏策略和系统级优化实现〜10倍推理的出色模型加速度。播种1.0只能以41.4秒（NVIDIA-L20）以1080p分辨率生成5秒的视频。与最先进的视频生成模型相比，种子1.0以高质量和快速的视频生成脱颖而出，具有较高的时空流动性，具有结构稳定性，在复杂的多主体环境中精确的指导依从性，本机多刺激性叙事连贯性具有一致的主题表示。

原因：一个370k的多代理生成的数据集用于推进医疗推理

标题: ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
作者: Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Yu Rong, Wenbing Huang, Qifeng Bai, Tingyang Xu
日期: 2025-06-11
ArXiv主页: https://arxiv.org/abs/2506.09513
论文链接: https://arxiv.org/pdf/2506.09513
gitHub仓库: https://github.com/YuSun-Work/ReasonMed

英文摘要

Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is constructed through a multi-agent verification and refinement process, where we design an Error Refiner to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed, we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60%.

中文摘要

尽管基于推理的大语言模型（LLM）在数学和编程方面表现出色，但它们在知识密集型医学问答中的功能仍未得到充实。为了解决这个问题，我们介绍了最大的医学推理数据集原因，其中包括370k高质量的示例，这些示例是从各种LLMS产生的170万个初始推理路径中提取的。原因是通过多代理验证和完善过程构建的，在该过程中，我们通过识别和纠正验证者标记的易限制错误步骤来设计错误的炼油厂，以增强推理路径。利用理由，我们系统地研究了培训医学推理模型的最佳实践，并发现将详细的思想链（COT）推理与简洁的答案摘要结合起来，产生了最有效的微调策略。基于此策略，我们训练Reason Med-7B，该策略为低于10B模型设定了新的基准，在PubMedQA上的最佳优于4.17 \％，甚至超过Llama3.1-70B，占4.60 \％。

minicpm4：端设备上的超高效LLM

标题: MiniCPM4: Ultra-Efficient LLMs on End Devices
作者: MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, Ning Ding, Shengdan Fan, Yewei Fang, Zixuan Fu, Wenyu Guan, Yitong Guan, Junshao Guo, Yufeng Han, Bingxiang He, Yuxiang Huang, Cunliang Kong, Qiuzuo Li, Siyuan Li, Wenhao Li, Yanghao Li, Yishan Li, Zhen Li, Dan Liu, Biyuan Lin, Yankai Lin, Xiang Long, Quanyu Lu, Yaxi Lu, Peiyan Luo, Hongya Lyu, Litu Ou, Yinxu Pan, Zekai Qu, Qundong Shi, Zijun Song, Jiayuan Su, Zhou Su, Ao Sun, Xianghui Sun, Peijun Tang, Fangzheng Wang, Feng Wang, Shuo Wang, Yudong Wang, Yesai Wu, Zhenyu Xiao, Jie Xie, Zihao Xie, Yukun Yan, Jiarui Yuan, Kaihuo Zhang, Lei Zhang, Linyue Zhang, Xueren Zhang, Yudi Zhang, Hengyu Zhao, Weilin Zhao, Weilun Zhao, Yuanqian Zhao, Zhi Zheng, Ge Zhou, Jie Zhou, Wei Zhou, Zihan Zhou, Zixuan Zhou, Zhiyuan Liu, Guoyang Zeng, Chao Jia, Dahai Li, Maosong Sun
日期: 2025-06-09
ArXiv主页: https://arxiv.org/abs/2506.07900
论文链接: https://arxiv.org/pdf/2506.07900
项目链接: https://huggingface.co/collections/openbmb/minicpm4-6841ab29d180257e940baa9b
gitHub仓库: https://github.com/openbmb/minicpm

英文摘要

This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose CPM.cu that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Sufficient evaluation results show that MiniCPM4 outperforms open-source models of similar size across multiple benchmarks, highlighting both its efficiency and effectiveness. Notably, MiniCPM4-8B demonstrates significant speed improvements over Qwen3-8B when processing long sequences. Through further adaptation, MiniCPM4 successfully powers diverse applications, including trustworthy survey generation and tool use with model context protocol, clearly showcasing its broad usability.

中文摘要

本文介绍了MinicPM4，这是一种高效的大型语言模型（LLM），该模型明确设计为端侧设备。我们通过在四个关键方面的系统创新来实现这一效率：模型架构，培训数据，培训算法和推理系统。具体而言，就模型架构而言，我们提出了INFLLM V2，这是一种可训练的稀疏注意机制，可加速预填充和解码阶段，以进行长篇文化处理。关于培训数据，我们提出了UltraClean，这是一种有效，准确的预训练数据过滤和生成策略，以及Ultrachat V2，这是一个全面的监督微调数据集。这些数据集使仅使用8万亿个培训令牌可以实现令人满意的模型性能。关于培训算法，我们提出了模型Tunnel V2，以进行有效的训练前策略搜索，并通过引入块构成块的推出，以提高质量平衡的加固学习和数据有效的终止LLM，BITCPM。关于推理系统，我们提出了CPM.CU，该CPM.CU集成了稀疏的注意力，模型量化和投机采样，以实现有效的预填充和解码。为了满足不同的设备要求，MinicPM4分别有两个版本，分别为0.5B和8B参数。足够的评估结果表明，MiniCPM4的表现优于多个基准的开源模型，其尺寸相似，强调了其效率和有效性。值得注意的是，在处理长序列时，MinicPM4-8B在QWEN3-8B上表现出显着的速度提高。通过进一步的适应，MiniCPM4成功地为各种应用程序提供了动力，包括值得信赖的调查生成和与模型上下文协议一起使用的工具使用，清楚地展示了其广泛的可用性。

PartCrafter：通过组成潜水扩散变压器生成结构化的3D网格

标题: PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers
作者: Yuchen Lin, Chenguo Lin, Panwang Pan, Honglei Yan, Yiqiang Feng, Yadong Mu, Katerina Fragkiadaki
日期: 2025-06-05
ArXiv主页: https://arxiv.org/abs/2506.05573
论文链接: https://arxiv.org/pdf/2506.05573
项目链接: https://wgsxm.github.io/projects/partcrafter/
gitHub仓库: https://github.com/wgsxm/PartCrafter

英文摘要

We introduce PartCrafter, the first structured 3D generative model that jointly synthesizes multiple semantically meaningful and geometrically distinct 3D meshes from a single RGB image. Unlike existing methods that either produce monolithic 3D shapes or follow two-stage pipelines, i.e., first segmenting an image and then reconstructing each segment, PartCrafter adopts a unified, compositional generation architecture that does not rely on pre-segmented inputs. Conditioned on a single image, it simultaneously denoises multiple 3D parts, enabling end-to-end part-aware generation of both individual objects and complex multi-object scenes. PartCrafter builds upon a pretrained 3D mesh diffusion transformer (DiT) trained on whole objects, inheriting the pretrained weights, encoder, and decoder, and introduces two key innovations: (1) A compositional latent space, where each 3D part is represented by a set of disentangled latent tokens; (2) A hierarchical attention mechanism that enables structured information flow both within individual parts and across all parts, ensuring global coherence while preserving part-level detail during generation. To support part-level supervision, we curate a new dataset by mining part-level annotations from large-scale 3D object datasets. Experiments show that PartCrafter outperforms existing approaches in generating decomposable 3D meshes, including parts that are not directly visible in input images, demonstrating the strength of part-aware generative priors for 3D understanding and synthesis. Code and training data will be released.

中文摘要

我们介绍了Partcrafter，这是第一个结构化的3D生成模型，该模型共同综合了与单个RGB图像的多个语义上有意义和几何不同的3D网格。与现有的产生整体3D形状或遵循两个阶段管道的方法不同，即首先分割图像然后重建每个段，PartCrafter采用了不依赖于预分段输入的统一的，组成的生成体系结构。它以单个图像为条件，同时将多个3D零件降低，从而使单个对象和复杂的多对象场景的端到端零件感知生成。Partcrafter建立在经过训练的整个物体训练的经过验证的3D网格扩散变压器（DIT）上，从而继承了预审计的重量，编码器和解码器，并引入了两个关键的创新：（1）组成的潜在空间，每个3D部分都由一组固定的潜在潜伏的潜伏的潜在的潜在的潜在潜在的潜伏品代表；（2）一种层次注意机制，可以使各个部分和各个部分之间的结构信息流动，从而确保全球连贯性，同时在生成过程中保留零件级别的细节。为了支持零件级的监督，我们通过从大规模3D对象数据集挖掘零件级注释来策划新数据集。实验表明，Partcrafter在生成可分解的3D网格中的现有方法，包括在输入图像中不直接可见的部分，这表明了零件感知的生成先验的强度，以了解3D理解和合成。代码和培训数据将发布。

saffron-1：迈向LLM安全保证的推理缩放范例

标题: Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance
作者: Ruizhong Qiu, Gaotang Li, Tianxin Wei, Jingrui He, Hanghang Tong
日期: 2025-06-06
ArXiv主页: https://arxiv.org/abs/2506.06444
论文链接: https://arxiv.org/pdf/2506.06444
项目链接: https://q-rz.github.io/p/saffron
gitHub仓库: https://github.com/q-rz/saffron

英文摘要

Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods’ susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning capabilities but remains unexplored in the context of safety assurance. Addressing this gap, our work pioneers inference scaling for robust and effective LLM safety against emerging threats. We reveal that conventional inference scaling techniques, despite their success in reasoning tasks, perform poorly in safety contexts, even falling short of basic approaches like Best-of-N Sampling. We attribute this inefficiency to a newly identified challenge, the exploration–efficiency dilemma, arising from the high computational overhead associated with frequent process reward model (PRM) evaluations. To overcome this dilemma, we propose SAFFRON, a novel inference scaling paradigm tailored explicitly for safety assurance. Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations. To operationalize this paradigm, we further propose: (i) a partial supervision training objective for MRM, (ii) a conservative exploration constraint to prevent out-of-distribution explorations, and (iii) a Trie-based key–value caching strategy that facilitates cache sharing across sequences during tree search. Extensive experiments validate the effectiveness of our method. Additionally, we publicly release our trained multifurcation reward model (Saffron-1) and the accompanying token-level safety reward dataset (Safety4M) to accelerate future research in LLM safety. Our code, model, and data are publicly available at https://github.com/q-rz/saffron , and our project homepage is at https://q-rz.github.io/p/saffron .

中文摘要

现有的安全保证研究主要集中于训练相一致性，以将安全行为灌输到LLM中。但是，最近的研究暴露了这些方法对各种越狱攻击的敏感性。同时，推理缩放具有显着高级的LLM推理能力，但在安全保证的背景下仍未探索。在解决这一差距的情况下，我们的工作先驱者推论缩放了稳健有效的LLM安全性，以防止新兴威胁。我们揭示了传统的推理缩放技术在推理任务方面的成功，但在安全环境中的表现较差，甚至还没有基本的方法（例如最佳n采样）。我们将这种低效率归因于新确定的挑战，即探索 - 效率困境，这是由与频繁的过程奖励模型（PRM）评估相关的高计算开销引起的。为了克服这一难题，我们提出了藏红花，这是一种新颖的推理缩放范式，以明确定制用于安全保证。我们方法的核心是引入多重奖励模型（MRM），该模型大大减少了所需数量的奖励模型评估。为了实现此范式，我们进一步提出：（i）MRM的部分监督培训目标，（ii）一种保守的勘探限制，以防止分布式探索，以及（iii）基于TRIE的密钥 - 值 - 值缓存策略，可促进在树搜索过程中跨序列中的缓存共享。广泛的实验验证了我们方法的有效性。此外，我们公开发布了训练有素的多燃料奖励模型（saffron-1）和随附的令牌级安全奖励数据集（Safety 4M），以加速LLM安全的未来研究。我们的代码，模型和数据可在https://github.com/q-rz/saffron上公开获取，我们的项目主页位于https://q-rz.github.io/p/saffron上。

LLMS中的地缘政治偏见：根据当代语言模型，什么是“好”和“坏”国家

标题: Geopolitical biases in LLMs: what are the “good” and the “bad” countries according to contemporary language models
作者: Mikhail Salnikov, Dmitrii Korzh, Ivan Lazichny, Elvir Karimov, Artyom Iudin, Ivan Oseledets, Oleg Y. Rogov, Alexander Panchenko, Natalia Loukachevitch, Elena Tutubalina
日期: 2025-06-07
ArXiv主页: https://arxiv.org/abs/2506.06751
论文链接: https://arxiv.org/pdf/2506.06751
项目链接: https://airi-institute.github.io/geopolitical_llm_bias
gitHub仓库: https://github.com/AIRI-Institute/geopolitical_llm_bias

英文摘要

This paper evaluates geopolitical biases in LLMs with respect to various countries though an analysis of their interpretation of historical events with conflicting national perspectives (USA, UK, USSR, and China). We introduce a novel dataset with neutral event descriptions and contrasting viewpoints from different countries. Our findings show significant geopolitical biases, with models favoring specific national narratives. Additionally, simple debiasing prompts had a limited effect in reducing these biases. Experiments with manipulated participant labels reveal models’ sensitivity to attribution, sometimes amplifying biases or recognizing inconsistencies, especially with swapped labels. This work highlights national narrative biases in LLMs, challenges the effectiveness of simple debiasing methods, and offers a framework and dataset for future geopolitical bias research.

中文摘要

本文评估了LLM中各个国家的地缘政治偏见，尽管分析了他们对国家，英国，英国，苏联和中国对历史事件的解释的分析。我们介绍了一个具有中性事件描述的新型数据集和来自不同国家的对比观点。我们的发现显示了巨大的地缘政治偏见，模型有利于特定的民族叙事。此外，简单的伪造提示在减少这些偏见方面的效果有限。操纵参与者标签的实验揭示了模型对归因的敏感性，有时会放大偏见或识别不一致之处，尤其是在交换标签上。这项工作突出了LLM中的民族叙事偏见，挑战了简单的证词方法的有效性，并为未来的地缘政治偏见研究提供了框架和数据集。

裁判官

标题: Magistral
作者: Mistral-AI, Abhinav Rastogi, Albert Q. Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, Jason Rute, Joep Barmentlo, Karmesh Yadav, Kartik Khandelwal, Khyathi Raghavi Chandu, Léonard Blier, Lucile Saulnier, Matthieu Dinot, Maxime Darrin, Neha Gupta, Roman Soletskyi, Sagar Vaze, Teven Le Scao, Yihan Wang, Adam Yang, Alexander H. Liu, Alexandre Sablayrolles, Amélie Héliou, Amélie Martin, Andy Ehrenberg, Anmol Agarwal, Antoine Roux, Arthur Darcet, Arthur Mensch, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Chris Bamford, Christian Wallenwein, Christophe Renaudin, Clémence Lanfranchi, Darius Dabert, Devon Mizelle, Diego de las Casas, Elliot Chane-Sane, Emilien Fugier, Emma Bou Hanna, Gauthier Delerce, Gauthier Guinet, Georgii Novikov, Guillaume Martin, Himanshu Jaju, Jan Ludziejewski, Jean-Hadrien Chabran, Jean-Malo Delignon, Joachim Studnia, Jonas Amar, Josselin Somerville Roberts, Julien Denize, Karan Saxena, Kush Jain, Lingxiao Zhao, Louis Martin, Luyu Gao, Lélio Renard Lavaud, Marie Pellat, Mathilde Guillaumin, Mathis Felardos, Maximilian Augustin, Mickaël Seznec, Nikhil Raghuraman, Olivier Duchenne, Patricia Wang, Patrick von Platen, Patryk Saffer, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Pavankumar Reddy Muddireddy, Philomène Chagniot, Pierre Stock, Pravesh Agrawal, Romain Sauvestre, Rémi Delacourt, Sanchit Gandhi, Sandeep Subramanian, Shashwat Dalal, Siddharth Gandhi, Soham Ghosh, Srijan Mishra, Sumukh Aithal, Szymon Antoniak, Thibault Schueller, Thibaut Lavril, Thomas Robert, Thomas Wang, Timothée Lacroix, Valeriia Nemychnikova, Victor Paltz, Virgile Richard, Wen-Ding Li, William Marshall, Xuanyu Zhang, Yunhao Tang
日期: 2025-06-12
ArXiv主页: https://arxiv.org/abs/2506.10910
论文链接: https://arxiv.org/pdf/2506.10910

英文摘要

We introduce Magistral, Mistral’s first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint’s capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. We present Magistral Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we open-source Magistral Small (Apache 2.0) which further includes cold-start data from Magistral Medium.

中文摘要

我们介绍了Mistral的第一个推理模型和我们自己的可扩展强化学习（RL）管道的裁判官。我们不依靠现有的实现和RL痕迹从先前的模型中提取的，而是遵循一种基本的方法，仅依靠我们自己的模型和基础架构。值得注意的是，我们展示了一个堆栈，使我们能够探索LLM的纯RL训练的限制，提出了一种强制模型推理语言的简单方法，并证明单独使用文本数据的RL可以维护大部分初始检查点的功能。我们发现文本上的RL维护或改善了多模式的理解，以下指令和功能调用。我们提出了裁判媒体，仅用RL就接受了在Mistral Medium 3之上进行推理的培训，我们开源的裁判官Small（Apache 2.0）进一步包括来自裁判媒体的冷启动数据。

多元宇宙：您的语言模型秘密决定如何并行化和合并生成

标题: Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation
作者: Xinyu Yang, Yuwei An, Hongyi Liu, Tianqi Chen, Beidi Chen
日期: 2025-06-11
ArXiv主页: https://arxiv.org/abs/2506.09991
论文链接: https://arxiv.org/pdf/2506.09991
项目链接: https://multiverse4fm.github.io/

英文摘要

Autoregressive Large Language Models (AR-LLMs) frequently exhibit implicit parallelism in sequential generation. Inspired by this, we introduce Multiverse, a new generative model that enables natively parallel generation. Multiverse internalizes a MapReduce paradigm, generating automatically through three stages: (i) a Map stage for adaptive task decomposition, (ii) a Process stage for parallel subtask execution, and (iii) a Reduce stage for lossless result synthesis. Next, we build a real-world Multiverse reasoning model with co-design of data, algorithm, and system, enabling rapid and seamless transfer from frontier AR-LLMs. Starting from sequential reasoning chains, we create Multiverse 1K by converting them into structured training data using an automated LLM-assisted pipeline, avoiding costly human annotations. Algorithmically, we design Multiverse Attention to separate parallel reasoning steps while keeping compatibility with causal attention for efficient training. Systematically, we implement Multiverse Engine to enable parallel inference. It features a dedicated scheduler that dynamically switches between sequential and parallel generation, triggered directly by the model. After a 3-hour fine-tuning with 1K examples, our Multiverse-32B stands as the only open-sourced non-AR model achieving performance on par with leading AR-LLMs of the same scale, evidenced by AIME24 & 25 scores of 54% and 46%, respectively. Moreover, our budget control experiments show that Multiverse-32B exhibits superior scaling, outperforming AR-LLMs by 1.87% on average using the same context length. Such scaling further leads to practical efficiency gain, achieving up to 2x speedup across varying batch sizes. We have open-sourced the entire Multiverse ecosystem, including data, model weights, engine, supporting tools, as well as complete data curation prompts and detailed training and evaluation recipes.

中文摘要

自回归的大型语言模型（AR-LLM）经常在顺序产生中表现出隐式并行性。受此启发，我们介绍了Multiverse，这是一种新的生成模型，可实现本地平行的生成。多宇宙将MAPREDUCE范式内部化，从三个阶段自动生成：（i）自适应任务分解的地图阶段，（ii）并行子任命执行的过程阶段，以及（iii）无损结果合成的减少阶段。接下来，我们构建了一个现实世界中的多元宇宙推理模型，该模型具有数据，算法和系统的共同设计，从而从Frontier AR-LLMS启用了快速和无缝的传输。从顺序推理链开始，我们通过使用自动化LLM辅助管道将其转换为结构化训练数据来创建多宇宙1K，从而避免了昂贵的人类注释。从算法上讲，我们将多元宇宙的关注设计用于单独的平行推理步骤，同时保持与因果关注以进行有效训练的兼容性。从系统地，我们实现多元引擎以实现并行推理。它具有专用调度程序，该调度程序在直接由模型触发的顺序和并行生成之间动态切换。经过3小时的1K示例进行微调后，我们的Multiverse-32B是唯一的开源非AR模型，以相同规模的领先AR-LLM在同一标准范围内达到同等的性能，而AIME24和25分数为54％和46％。此外，我们的预算控制实验表明，使用相同的上下文长度，Multiverse-32B表现出较高的缩放率，平均表现出1.87％。这样的扩展进一步导致实践效率的增长，在不同的批次尺寸上达到了多达2倍的加速。我们已经开源了整个多元宇宙生态系统，包括数据，模型权重，引擎，支持工具以及完整的数据策划提示以及详细的培训和评估食谱。

COMFYUI-R1：探索工作流程的推理模型

标题: ComfyUI-R1: Exploring Reasoning Models for Workflow Generation
作者: Zhenran Xu, Yiyu Wang, Xue Yang, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang
日期: 2025-06-11
ArXiv主页: https://arxiv.org/abs/2506.09790
论文链接: https://arxiv.org/pdf/2506.09790
项目链接: https://github.com/AIDC-AI/ComfyUI-Copilot
gitHub仓库: https://github.com/AIDC-AI/ComfyUI-Copilot

英文摘要

AI-generated content has evolved from monolithic models to modular workflows, particularly on platforms like ComfyUI, enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI-R1, the first large reasoning model for automated workflow generation. Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection, workflow planning, and code-level workflow representation. ComfyUI-R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward, ensuring format validity, structural integrity, and node-level fidelity. Experiments show that our 7B-parameter model achieves a 97% format validity rate, along with high pass rate, node-level and graph-level F1 scores, significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series. Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes, underscoring the potential of long CoT reasoning in AI art creation.

中文摘要

AI生成的内容已从整体模型演变为模块化工作流程，尤其是在Comfyui等平台上，可以在创意管道中进行自定义。但是，制作有效的工作流程需要良好的专业知识来协调众多专业组件，并为用户提供陡峭的学习曲线。为了应对这一挑战，我们介绍了自动化工作流的第一个大型推理模型Comfyui-R1。从我们的4K工作流程的策划数据集开始，我们构建了长链（COT）推理数据，包括节点选择，工作流计划和代码级工作流程表示。COMFYUI-R1通过两个阶段的框架进行了训练：（1）COT微调以进行冷启动，将模型适应Comfyui域；（2）强化学习，以激励推理能力，以精细颗粒规则 - 金属奖励的指导，确保格式有效性，结构完整性和节点级别的保真度。实验表明，我们的7B参数模型达到了97 \％格式的有效性率，以及高通过速率，节点级别和图形级别的F1得分，可显着超过先前采用的先前的最新方法，这些方法采用了领先的封闭源模型，例如GPT-4O和Claude系列。进一步的分析强调了推理过程的关键作用以及将工作流程转换为代码的优势。定性比较揭示了我们在与各种节点合成复杂的工作流程方面的力量，强调了AI艺术创作中长期COT推理的潜力。

SWE-FACTORY：您的自动化工厂发行解决培训数据和评估基准

标题: SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks
作者: Lianghong Guo, Yanlin Wang, Caihua Li, Pengyu Yang, Jiachi Chen, Wei Tao, Yingtian Zou, Duyu Tang, Zibin Zheng
日期: 2025-06-12
ArXiv主页: https://arxiv.org/abs/2506.10954
论文链接: https://arxiv.org/pdf/2506.10954
gitHub仓库: https://github.com/DeepSoftwareAnalytics/swe-factory

英文摘要

Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating task instances. In this paper, we propose SWE-Factory, an automated pipeline designed to address these challenges. To tackle these issues, our pipeline integrates three core automated components. First, we introduce SWE-Builder, a multi-agent system that automates evaluation environment construction, which employs four specialized agents that work in a collaborative, iterative loop and leverages an environment memory pool to enhance efficiency. Second, we introduce a standardized, exit-code-based grading method that eliminates the need for manually writing custom parsers. Finally, we automate the fail2pass validation process using these reliable exit code signals. Experiments on 671 issues across four programming languages show that our pipeline can effectively construct valid task instances; for example, with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at 0.045 per instance, while with Gemini-2.5-flash, it achieves comparable performance at the lowest cost of 0.024 per instance. We also demonstrate that our exit-code-based grading achieves 100% accuracy compared to manual inspection, and our automated fail2pass validation reaches a precision of 0.92 and a recall of 1.00. We hope our automated pipeline will accelerate the collection of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation. Our code and datasets are released at https://github.com/DeepSoftwareAnalytics/swe-factory.

中文摘要

为GITHUB发行解决任务构建大规模数据集对于培训和评估大语言模型（LLMS）的软件工程功能至关重要。但是，创建此类基准的传统过程是众所周知的具有挑战性和劳动力密集的，尤其是在设置评估环境，评分测试结果和验证任务实例的阶段。在本文中，我们提出了SWE-Factory，这是一种旨在应对这些挑战的自动化管道。为了解决这些问题，我们的管道整合了三个核心自动组件。首先，我们介绍了SWE-Builder，这是一种自动化评估环境构建的多代理系统，该系统采用了四个在协作，迭代循环中起作用的专用代理，并利用环境记忆池来提高效率。其次，我们引入了一种标准化的，基于出口代码的分级方法，该方法消除了手动编写自定义解析器的需求。最后，我们使用这些可靠的退出代码信号自动化了Fail2Pass验证过程。对四种编程语言的671个问题进行实验表明，我们的管道可以有效地构建有效的任务实例。例如，使用GPT-4.1-MINI，我们的SWE构建器以每个实例为0.045的有效实例构造269个有效实例，而Gemini-2.5-Flash则以每个实例的0.024的最低成本实现可比性的性能。我们还证明，与手动检查相比，我们的基于出口代码的分级可以达到100％的精度，并且我们的自动化失败2PAPS验证的精度为0.92，召回率为1.00。我们希望我们的自动化管道能够加快用于培训和评估的大规模高质量GitHub问题分辨率数据集的收集。我们的代码和数据集在https://github.com/deepsoftwareanalytics/swe-factory上发布。

时空LM：培训大型语言模型以进行结构化室内建模

标题: SpatialLM: Training Large Language Models for Structured Indoor Modeling
作者: Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, Zihan Zhou
日期: 2025-06-09
ArXiv主页: https://arxiv.org/abs/2506.07491
论文链接: https://arxiv.org/pdf/2506.07491
项目链接: https://manycore-research.github.io/SpatialLM

英文摘要

SpatialLM is a large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object boxes with their semantic categories. Unlike previous methods which exploit task-specific network designs, our model adheres to the standard multimodal LLM architecture and is fine-tuned directly from open-source LLMs. To train SpatialLM, we collect a large-scale, high-quality synthetic dataset consisting of the point clouds of 12,328 indoor scenes (54,778 rooms) with ground-truth 3D annotations, and conduct a careful study on various modeling and training decisions. On public benchmarks, our model gives state-of-the-art performance in layout estimation and competitive results in 3D object detection. With that, we show a feasible path for enhancing the spatial understanding capabilities of modern LLMs for applications in augmented reality, embodied robotics, and more.

中文摘要

SpatialLM是一种大型语言模型，旨在处理3D点云数据并生成结构化的3D场景理解输出。这些输出包括具有语义类别的墙壁，门，窗户和定向对象框，例如墙壁，门，窗户。与以前利用特定任务网络设计的方法不同，我们的模型遵循标准的多模式LLM体系结构，并直接从开源LLM进行微调。为了训练空间LM，我们收集了一个大规模的高质量合成数据集，该数据集由12,328个室内场景（54,778个房间）组成，并带有地面3D注释，并对各种建模和培训决策进行了仔细的研究。在公共基准上，我们的模型在布局估算和3D对象检测中提供了最先进的性能。因此，我们展示了一条可行的途径，以增强现代LLM在增强现实，体现的机器人技术等方面的空间理解能力。

实时互动视频生成的自回归对抗后训练

标题: Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation
作者: Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, Lu Jiang
日期: 2025-06-11
ArXiv主页: https://arxiv.org/abs/2506.09350
论文链接: https://arxiv.org/pdf/2506.09350
项目链接: https://seaweed-apt.com/2

英文摘要

Existing large-scale video generation models are computationally intensive, preventing adoption in real-time and interactive applications. In this work, we propose autoregressive adversarial post-training (AAPT) to transform a pre-trained latent video diffusion model into a real-time, interactive video generator. Our model autoregressively generates a latent frame at a time using a single neural function evaluation (1NFE). The model can stream the result to the user in real time and receive interactive responses as controls to generate the next latent frame. Unlike existing approaches, our method explores adversarial training as an effective paradigm for autoregressive generation. This not only allows us to design an architecture that is more efficient for one-step generation while fully utilizing the KV cache, but also enables training the model in a student-forcing manner that proves to be effective in reducing error accumulation during long video generation. Our experiments demonstrate that our 8B model achieves real-time, 24fps, streaming video generation at 736x416 resolution on a single H100, or 1280x720 on 8xH100 up to a minute long (1440 frames). Visit our research website at https://seaweed-apt.com/2

中文摘要

现有的大规模视频生成模型在计算上是密集的，可防止在实时和交互式应用中采用。在这项工作中，我们提出了自回归的对抗后训练（AAPT），以将预训练的潜在视频扩散模型转换为实时的交互式视频生成器。我们的模型自动加工一次使用单个神经功能评估（1NFE）一次生成潜在框架。该模型可以实时将结果传输到用户，并接收交互式响应作为控件以生成下一个潜在帧。与现有方法不同，我们的方法探讨了对抗性训练，作为自动回归产生的有效范式。这不仅使我们能够设计一种对一步生成更有效的体系结构，同时充分利用KV缓存，而且还可以以学生训练的方式培训模型，该方式被证明可以有效地减少长时间视频生成期间的错误积累。我们的实验表明，我们的8B模型可实现实时的24fps，单个H100的736x416分辨率流动视频生成，或8xH100的1280x720，最多为一分钟（1440帧）。访问我们的研究网站https://seaweed-apt.com/2

带有扩散模型的文本感知图像恢复

标题: Text-Aware Image Restoration with Diffusion Models
作者: Jaewon Min, Jin Hyeon Kim, Paul Hyunbin Cho, Jaeeun Lee, Jihye Park, Minkyu Park, Sangpil Kim, Hyunhee Park, Seungryong Kim
日期: 2025-06-11
ArXiv主页: https://arxiv.org/abs/2506.09993
论文链接: https://arxiv.org/pdf/2506.09993
项目链接: https://cvlab-kaist.github.io/TAIR/
gitHub仓库: https://github.com/cvlab-kaist/TAIR

英文摘要

Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images. Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination. In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity. To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from joint training. This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy. See our project page: https://cvlab-kaist.github.io/TAIR/

中文摘要

图像恢复旨在恢复降级的图像。但是，尽管自然图像恢复方面取得了巨大成功，但现有的基于扩散的恢复方法通常很难忠实地重建降级图像中的文本区域。这些方法经常产生合理但不正确的文本样模式，这是我们称为文本图像幻觉的现象。在本文中，我们介绍了文本感知图像恢复（TAIR），这是一项新型的恢复任务，需要同时恢复视觉内容和文本保真度。为了解决这项任务，我们提出SA-Text，这是100K高质量场景图像的大规模基准，并以多种多样且复杂的文本实例密集注释。此外，我们提出了一个称为Terediff的多任务扩散框架，该框架将扩散模型的内部特征集成到文本介绍模块中，从而使两个组件都能从联合培训中受益。这允许提取丰富的文本表示形式，在随后的降解步骤中，这些表示被用作提示。广泛的实验表明，我们的方法始终优于最先进的恢复方法，从而实现了文本识别准确性的显着提高。请参阅我们的项目页面：https：//cvlab-kaist.github.io/tair/

Oneig-bench：图像生成的Omni维度细微差别评估

标题: OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation
作者: Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, Hai-Bao Chen
日期: 2025-06-09
ArXiv主页: https://arxiv.org/abs/2506.07977
论文链接: https://arxiv.org/pdf/2506.07977
项目链接: https://oneig-bench.github.io/
gitHub仓库: https://github.com/OneIG-Bench/OneIG-Benchmark

英文摘要

Text-to-image (T2I) models have garnered significant attention for generating high-quality images aligned with text prompts. However, rapid T2I model advancements reveal limitations in early benchmarks, lacking comprehensive evaluations, for example, the evaluation on reasoning, text rendering and style. Notably, recent state-of-the-art models, with their rich knowledge modeling capabilities, show promising results on the image generation problems requiring strong reasoning ability, yet existing evaluation systems have not adequately addressed this frontier. To systematically address these gaps, we introduce OneIG-Bench, a meticulously designed comprehensive benchmark framework for fine-grained evaluation of T2I models across multiple dimensions, including prompt-image alignment, text rendering precision, reasoning-generated content, stylization, and diversity. By structuring the evaluation, this benchmark enables in-depth analysis of model performance, helping researchers and practitioners pinpoint strengths and bottlenecks in the full pipeline of image generation. Specifically, OneIG-Bench enables flexible evaluation by allowing users to focus on a particular evaluation subset. Instead of generating images for the entire set of prompts, users can generate images only for the prompts associated with the selected dimension and complete the corresponding evaluation accordingly. Our codebase and dataset are now publicly available to facilitate reproducible evaluation studies and cross-model comparisons within the T2I research community.

中文摘要

文本对图像（T2I）模型已引起了与文本提示一致的高质量图像的重大关注。但是，快速T2I模型的进步揭示了早期基准的局限性，缺乏全面的评估，例如对推理，文本渲染和样式的评估。值得注意的是，最近的最新模型凭借其丰富的知识建模功能，在需要强大推理能力的图像生成问题上显示出令人鼓舞的结果，但是现有的评估系统尚未充分解决这个领域。为了系统地解决这些差距，我们介绍了Oneig-Bench，这是一种精心设计的全面基准框架，用于对跨多个维度的T2I模型进行精细评估，包括及时图像对齐，文本渲染精度，推理产生的内容，风格化，风格化和多样性。通过构建评估，该基准可以对模型性能进行深入分析，帮助研究人员和从业人员在图像生成的完整渠道中查明优势和瓶颈。具体而言，Oneig-Bench可以通过允许用户专注于特定评估子集来实现灵活的评估。用户无需为整个提示生成图像，而是可以为与所选维度关联的提示生成图像，并相应地完成相应的评估。现在，我们的代码库和数据集可以公开使用，以促进T2I研究社区内可再现的评估研究和跨模型比较。

模仿：与MCT驱动的剪辑一代自动化的多代理动画讲故事

标题: AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation
作者: Haoyuan Shi, Yunxin Li, Xinyu Chen, Longyue Wang, Baotian Hu, Min Zhang
日期: 2025-06-12
ArXiv主页: https://arxiv.org/abs/2506.10540
论文链接: https://arxiv.org/pdf/2506.10540
项目链接: https://animaker-dev.github.io/
gitHub仓库: https://github.com/HITsz-TMG/Anim-Director

英文摘要

Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation’s logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker’s approach are two key technical components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent, the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion, and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.

中文摘要

尽管视频生成模型取得了迅速的进步，但生成跨越多个场景和角色的连贯的讲故事视频仍然具有挑战性。当前的方法通常将预先生成的密钥帧牢固地转换为固定长度的剪辑，从而导致叙事和起搏问题脱节。此外，视频生成模型的固有不稳定性意味着即使是单个低质量的剪辑也可以显着降低整个输出动画的逻辑相干性和视觉连续性。为了克服这些障碍，我们介绍了Animaker，这是一个多代理框架，可实现高效的多偏剪剪辑生成和讲故事的剪辑选择，从而仅根据文本输入创建全球一致和故事联系的动画。该框架是围绕专门的代理人组成的，包括故事板生成的主管代理，视频剪辑生成的摄影代理，评估审稿人代理以及用于编辑和配音的后期制作剂。Animaker方法的核心是两个关键的技术组成部分：摄影代理中的MCT-GEN，这是一种有效的蒙特卡洛树搜索（MCT）启发的策略，该策略巧妙地浏览了候选空间以生成高潜力剪辑，同时优化资源使用；和Anieval In Reviewer Agent是专门为多拍动画评估而设计的第一个框架，该框架通过在其先前和成功的剪辑的背景下考虑每个剪辑来评估重要方面，例如故事级的一致性，动作完成和针对动画特定的功能。实验表明，通过包括VBENCH和我们提议的Anieval框架在内的流行指标衡量的Animaker可以达到卓越的质量，同时显着提高了多候选生成的效率，从而将AI生成的讲故事的动画推向了更接近生产标准。

利用自我注意力依赖于输入依赖的软提示LLMS

标题: Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs
作者: Ananth Muppidi, Abhilash Nandy, Sambaran Bandyopadhyay
日期: 2025-06-05
ArXiv主页: https://arxiv.org/abs/2506.05629
论文链接: https://arxiv.org/pdf/2506.05629

英文摘要

The performance of large language models in domain-specific tasks necessitates fine-tuning, which is computationally expensive and technically challenging. This paper focuses on parameter-efficient fine-tuning using soft prompting, a promising approach that adapts pre-trained models to downstream tasks by learning a small set of parameters. We propose a novel Input Dependent Soft Prompting technique with a self-Attention Mechanism (ID-SPAM) that generates soft prompts based on the input tokens and attends different tokens with varying importance. Our method is simple and efficient, keeping the number of trainable parameters small. We show the merits of the proposed approach compared to state-of-the-art techniques on various tasks and show the improved zero shot domain transfer capability.

中文摘要

大型语言模型在特定于领域的任务中的性能需要进行微调，这在计算上昂贵且在技术上具有挑战性。本文着重于使用软提示的参数效率微调，这是一种有前途的方法，通过学习一小部分参数，将预训练的模型调整为下游任务。我们提出了一种新颖的输入依赖性软提示技术，其自我发挥机制（ID-SPAM）基于输入令牌生成软提示，并以不同的重视参与不同的令牌。我们的方法简单有效，使可训练参数的数量少。我们显示了与各种任务的最新技术相比，提出的方法的优点，并显示了提高的零射击域传输能力。

少数情况：高价值数据选择用于有效的多模式推理

标题: Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning
作者: Shenshen Li, Kaiyuan Deng, Lei Wang, Hao Yang, Chong Peng, Peng Yan, Fumin Shen, Heng Tao Shen, Xing Xu
日期: 2025-06-05
ArXiv主页: https://arxiv.org/abs/2506.04755
论文链接: https://arxiv.org/pdf/2506.04755
项目链接: https://github.com/Leo-ssl/RAP
gitHub仓库: https://github.com/Leo-ssl/RAP

英文摘要

While multi-modal large language models (MLLMs) have made significant progress in complex reasoning tasks via reinforcement learning, it is commonly believed that extensive training data is necessary for improving multi-modal reasoning ability, inevitably leading to data redundancy and substantial computational costs. However, can smaller high-value datasets match or outperform full corpora for multi-modal reasoning in MLLMs? In this work, we challenge this assumption through a key observation: meaningful multi-modal reasoning is triggered by only a sparse subset of training samples, termed cognitive samples, whereas the majority contribute marginally. Building on this insight, we propose a novel data selection paradigm termed Reasoning Activation Potential (RAP), which identifies cognitive samples by estimating each sample’s potential to stimulate genuine multi-modal reasoning by two complementary estimators: 1) Causal Discrepancy Estimator (CDE) based on the potential outcome model principle, eliminates samples that overly rely on language priors by comparing outputs between multi-modal and text-only inputs; 2) Attention Confidence Estimator (ACE), which exploits token-level self-attention to discard samples dominated by irrelevant but over-emphasized tokens in intermediate reasoning stages. Moreover, we introduce a Difficulty-aware Replacement Module (DRM) to substitute trivial instances with cognitively challenging ones, thereby ensuring complexity for robust multi-modal reasoning. Experiments on six datasets show that our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%. Our code is available at https://github.com/Leo-ssl/RAP.

中文摘要

尽管多模式大型语言模型（MLLM）通过强化学习在复杂的推理任务方面取得了重大进展，但通常认为，广泛的培训数据对于提高多模式推理能力是必要的，不可避免地会导致数据冗余和实质性计算成本。但是，较小的高价值数据集可以匹配MLLM中多模式推理的完整表现吗？在这项工作中，我们通过一个关键观察来挑战这一假设：有意义的多模式推理仅是由稀疏的训练样本的稀疏子集触发的，这些样本称为认知样本，而大多数人却略有贡献。Building on this insight, we propose a novel data selection paradigm termed Reasoning Activation Potential (RAP), which identifies cognitive samples by estimating each sample’s potential to stimulate genuine multi-modal reasoning by two complementary estimators: 1) Causal Discrepancy Estimator (CDE) based on the potential outcome model principle, eliminates samples that overly rely on language priors by comparing outputs between multi-modal and仅文本输入；2）注意置信度估计器（ACE），它利用令牌级的自我注意力以丢弃以中间推理阶段无关但过度强调的令牌为主的样本。此外，我们引入了一个困难的替代模块（DRM），用认知挑战性的实例替代了琐碎的实例，从而确保了强大的多模式推理的复杂性。六个数据集的实验表明，我们的RAP方法始终使用9.3％的培训数据实现了卓越的性能，同时将计算成本降低了43％以上。我们的代码可在https://github.com/leo-ssl/rap上找到。

Playerone：以自我为中心的世界模拟器

标题: PlayerOne: Egocentric World Simulator
作者: Yuanpeng Tu, Hao Luo, Xi Chen, Xiang Bai, Fan Wang, Hengshuang Zhao
日期: 2025-06-11
ArXiv主页: https://arxiv.org/abs/2506.09995
论文链接: https://arxiv.org/pdf/2506.09995
项目链接: https://playerone-hku.github.io/

英文摘要

We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline. Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames, ensuring scene consistency in the long-form video generation. Experimental results demonstrate its great generalization ability in precise control of varying human movements and worldconsistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.

中文摘要

我们介绍了Playerone，这是第一个以自我的现实世界模拟器，促进了生动动态的环境中的沉浸式和不受限制的探索。鉴于用户的以自我为中心的场景图像，PlayerOne可以准确构建相应的世界并生成以egentric摄像机捕获的用户的真实场景的人类运动的严格对齐的以自我为中心的视频。PlayerOne接受了一条粗到精细的管道的训练，该管道首先在大规模的以自我为中心的文本视频对进行预处理，以进行粗级以上的自我理解，然后对从同步运动中提取的同步运动数据进行填充，并与我们的自动构造式构造启动。此外，考虑到不同组件的重要性，我们设计了一个部分触发运动注入方案，从而可以精确控制零件级运动。此外，我们设计了一个联合重建框架，该框架逐渐建模了4D场景和视频帧，从而确保了长期视频生成的场景一致性。实验结果证明了其在精确控制人类运动和各种情况模型的精确控制方面的巨大概括能力。它标志着以自我为中心的现实世界模拟的第一个努力，可以为社区挖掘世界建模及其多样化应用的新鲜边界铺平道路。

自回归语义视觉重建有助于VLM更好地理解

标题: Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
作者: Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu, Zhongyu Wei, Jiaqi Wang
日期: 2025-06-10
ArXiv主页: https://arxiv.org/abs/2506.09040
论文链接: https://arxiv.org/pdf/2506.09040

英文摘要

Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at https://github.com/AlenjandroWang/ASVR.

中文摘要

典型的大型视觉模型（LVLM）仅将自回归监督应用于文本序列，而无需将视觉模态完全融合到学习过程中。这导致了三个关键局限性：（1）无法使用图像而不随附的标题，（2）字幕省略关键视觉细节的风险，以及（3）挑战是无法通过文本充分传达某些以视觉的内容。结果，当前的LVLM通常优先考虑视力到语言对齐，同时有可能忽略细粒度的视觉信息。尽管一些先前的作品探索了自回归图像的产生，但有效利用自回归视觉监督来增强图像理解仍然是一个开放的挑战。在本文中，我们介绍了自回归语义视觉重建（ASVR），从而可以在统一自动回归框架内联合学习视觉和文本方式。我们表明，自动重建图像的原始视觉外观不会增强，甚至可能会损害多模式的理解。相比之下，自动调节图像的语义表示始终提高理解。值得注意的是，我们发现，即使将模型作为输入赋予连续的图像特征，它们也可以有效地重建离散的语义令牌，从而在广泛的多模式理解基准测试基准中稳定且一致地改进。我们的方法可在不同的数据量表（556K-2M）和LLM Bacbones的类型中带来显着的性能增长。具体而言，ASVR在14个多模式基准的平均得分中将LLAVA-1.5提高了5％。该代码可在https://github.com/alenjandrowang/asvr上找到。

MORSE-500：一个可以控制的视频基准，以进行压力测试多模式推理

标题: MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning
作者: Zikui Cai, Andrew Wang, Anirudh Satheesh, Ankit Nakhawa, Hyunwoo Jae, Keenan Powell, Minghui Liu, Neel Jay, Sungbin Oh, Xiyao Wang, Yongyuan Liang, Tom Goldstein, Furong Huang
日期: 2025-06-05
ArXiv主页: https://arxiv.org/abs/2506.05523
论文链接: https://arxiv.org/pdf/2506.05523
项目链接: https://morse-500.github.io/
gitHub仓库: https://github.com/morse-benchmark/morse-500

英文摘要

Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in three key dimensions. First, they overwhelmingly rely on static images, failing to capture the temporal complexity of real-world environments. Second, they narrowly focus on mathematical problem-solving, neglecting the broader spectrum of reasoning skills – including abstract, physical, planning, spatial, and temporal capabilities – required for robust multimodal intelligence. Third, many benchmarks quickly saturate, offering limited headroom for diagnosing failure modes or measuring continued progress. We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary reasoning categories. Each instance is programmatically generated using deterministic Python scripts (via Manim, Matplotlib, MoviePy), generative video models, and curated real footage. This script-driven design allows fine-grained control over visual complexity, distractor density, and temporal dynamics – enabling difficulty to be scaled systematically as models improve. Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve: its controllable generation pipeline supports the creation of arbitrarily challenging new instances, making it ideally suited for stress-testing next-generation models. Initial experiments with state-of-the-art systems – including various Gemini 2.5 Pro and OpenAI o3 which represent the strongest available at the time, alongside strong open-source models – reveal substantial performance gaps across all categories, with particularly large deficits in abstract and planning tasks. We release the full dataset, generation scripts, and evaluation harness to support transparent, reproducible, and forward-looking multimodal reasoning research.

中文摘要

尽管视觉模型（VLMS）迅速发展，但在三个关键维度上，多模式推理的当前基准均缺乏。首先，他们压倒性地依靠静态图像，无法捕获现实环境的时间复杂性。其次，他们狭do以数学问题解决，忽略了强大的多模式智能所需的更广泛的推理技能（包括抽象，物理，计划，空间和时间功能）。第三，许多基准迅速饱和，为诊断故障模式或测量持续进展提供有限的净空。我们介绍了Morse-500（多模式推理应力测试环境），这是一个视频基准，由500个完全脚本的剪辑组成，其中包含嵌入式问题，涵盖了六个互补的推理类别。每个实例都是使用确定性Python脚本（通过Manim，Matplotlib，Monypy），生成视频模型和精心策划的真实镜头来编程生成的。这种脚本驱动的设计允许对视觉复杂性，干扰物密度和时间动态的细粒度控制 - 使难以随着模型的改善而系统地缩放。与一旦饱和的静态基准测试，Morse-500是为了发展而来的：它可控的生成管道支持创建任意挑战的新实例，使其非常适合于压力测试的下一代模型。最初对最先进系统的实验 - 包括各种Gemini 2.5 Pro和OpenAI O3，代表当时最强的可用型号，以及强大的开源模型 - 揭示了所有类别的大量性能差距，在抽象和规划任务中尤其较大。我们发布完整的数据集，发电脚本和评估线束，以支持透明，可重现和前瞻性的多模式推理研究。

离散的音频令牌：不仅仅是调查！

标题: Discrete Audio Tokens: More Than a Survey!
作者: Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli
日期: 2025-06-12
ArXiv主页: https://arxiv.org/abs/2506.10274
论文链接: https://arxiv.org/pdf/2506.10274
项目链接: https://poonehmousavi.github.io/dates-website/

英文摘要

Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks.They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.

中文摘要

离散的音频令牌是紧凑的表示形式，旨在保持知觉质量，语音内容和说话者特征，同时实现有效的存储和推理，以及在各种下游任务中的竞争性能。它们为连续特征提供了一种实用的替代方案，可以将语音和音频集成到现代大型语言模型中（LLM）。随着对基于令牌的音频处理的兴趣的增长，已经出现了各种令牌化方法，并且几项调查审查了该领域的最新进展。但是，现有的研究通常集中在特定领域或任务上，并且缺乏各种基准的统一比较。本文介绍了离散音频引物器的系统评价和基准，涵盖了三个领域：语音，音乐和一般音频。我们提出了基于编码器，量化技术，训练范式，流式和应用领域的代币化方法的分类法。我们在多个基准上评估了以重建，下游性能和原声语言建模的标记，并通过受控的消融研究分析权衡取舍。我们的发现突出了关键的局限性，实际考虑和公开挑战，为在这个迅速发展的领域的未来研究提供了见识和指导。有关更多信息，包括我们的主要结果和Tokenizer数据库，请参阅我们的网站：https：//poonehmousavi.github.io/dates-website/。

自动回归与流程匹配：文本到音乐生成的建模范式的比较研究

标题: Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation
作者: Or Tal, Felix Kreuk, Yossi Adi
日期: 2025-06-10
ArXiv主页: https://arxiv.org/abs/2506.08570
论文链接: https://arxiv.org/pdf/2506.08570

英文摘要

Recent progress in text-to-music generation has enabled models to synthesize high-quality musical segments, full compositions, and even respond to fine-grained control signals, e.g. chord progressions. State-of-the-art (SOTA) systems differ significantly across many dimensions, such as training datasets, modeling paradigms, and architectural choices. This diversity complicates efforts to evaluate models fairly and pinpoint which design choices most influence performance. While factors like data and architecture are important, in this study we focus exclusively on the modeling paradigm. We conduct a systematic empirical analysis to isolate its effects, offering insights into associated trade-offs and emergent behaviors that can guide future text-to-music generation systems. Specifically, we compare the two arguably most common modeling paradigms: Auto-Regressive decoding and Conditional Flow-Matching. We conduct a controlled comparison by training all models from scratch using identical datasets, training configurations, and similar backbone architectures. Performance is evaluated across multiple axes, including generation quality, robustness to inference configurations, scalability, adherence to both textual and temporally aligned conditioning, and editing capabilities in the form of audio inpainting. This comparative study sheds light on distinct strengths and limitations of each paradigm, providing actionable insights that can inform future architectural and training decisions in the evolving landscape of text-to-music generation. Audio sampled examples are available at: https://huggingface.co/spaces/ortal1602/ARvsFM

中文摘要

文本到音乐生成的最新进展使模型能够综合高质量的音乐片段，完整的作品，甚至响应细粒度的控制信号，例如和弦进度。在许多方面，诸如培训数据集，建模范式和建筑选择等许多方面的最新系统（SOTA）系统差异很大。这种多样性使评估模型的努力变得复杂，并确定设计选择最大的设计影响性能。尽管数据和体系结构等因素很重要，但在这项研究中，我们专注于建模范式。我们进行了系统的经验分析，以隔离其效果，从而洞悉相关的权衡和新兴行为，这些行为可以指导未来的文本到音乐生成系统。具体而言，我们比较了这两个最常见的建模范式：自动回归解码和条件流程匹配。我们通过使用相同的数据集，培训配置和相似的骨干体系结构进行训练所有模型进行了对照比较。在多个轴上评估性能，包括发电质量，对推理配置的鲁棒性，可伸缩性，对文本和时间对齐条件的遵守以及以音频介入形式进行编辑功能。这项比较研究阐明了每个范式的独特优势和局限性，提供了可行的见解，可以在文本到音乐的发展中不断发展的景观中为未来的建筑和培训决策提供信息。音频采样示例可在以下网址提供：https：//huggingface.co/spaces/ortal1602/arvsfm

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

FFmpeg过滤器框架分析

2048 AI社区

一文读懂AI大模型核心术语：从参数到Agent的完整指南

2048 AI社区

结合AI大模型的本地知识库搭建方法总结，大模型入门到精通，收藏这篇就足够了！

2048 AI社区

所有评论(0)

查看更多评论

万俟淋曦

@maizousidemao

已为社区贡献38条内容

【论文速递】2025年第24周(Jun-08-14)(Robotics/Embodied AI/LLM)

万俟淋曦

目录

增强预训练

英文摘要

中文摘要

明天还会真的吗？多语言常绿问题分类，以提高值得信赖的质量检查

英文摘要

中文摘要

您所需要的自信：语言模型的几个射击rl微调

英文摘要

中文摘要

Lingshu：统一多模式医学理解和推理的通才基础模型

英文摘要

中文摘要

播种1.0：探索视频生成模型的边界

英文摘要

中文摘要

原因：一个370k的多代理生成的数据集用于推进医疗推理

英文摘要

中文摘要

minicpm4：端设备上的超高效LLM

英文摘要

中文摘要

PartCrafter：通过组成潜水扩散变压器生成结构化的3D网格

英文摘要

中文摘要

saffron-1：迈向LLM安全保证的推理缩放范例

英文摘要

中文摘要

LLMS中的地缘政治偏见：根据当代语言模型，什么是“好”和“坏”国家

英文摘要

中文摘要

裁判官

英文摘要

中文摘要

多元宇宙：您的语言模型秘密决定如何并行化和合并生成

英文摘要

中文摘要

COMFYUI-R1：探索工作流程的推理模型

英文摘要

中文摘要

SWE-FACTORY：您的自动化工厂发行解决培训数据和评估基准

英文摘要

中文摘要

时空LM：培训大型语言模型以进行结构化室内建模

英文摘要

中文摘要

实时互动视频生成的自回归对抗后训练

英文摘要

中文摘要

带有扩散模型的文本感知图像恢复

英文摘要

中文摘要

Oneig-bench：图像生成的Omni维度细微差别评估

英文摘要

中文摘要

模仿：与MCT驱动的剪辑一代自动化的多代理动画讲故事

英文摘要

中文摘要

利用自我注意力依赖于输入依赖的软提示LLMS

英文摘要

中文摘要

少数情况：高价值数据选择用于有效的多模式推理

英文摘要

中文摘要

Playerone：以自我为中心的世界模拟器

英文摘要

中文摘要

自回归语义视觉重建有助于VLM更好地理解

英文摘要

中文摘要

MORSE-500：一个可以控制的视频基准，以进行压力测试多模式推理

英文摘要

中文摘要

离散的音频令牌：不仅仅是调查！

英文摘要

中文摘要

自动回归与流程匹配：文本到音乐生成的建模范式的比较研究

英文摘要