【论文速递】2025年10周 (Robotics/Embodied AI/LLM)
像OpenAI-o1和DeepSeek-R1这样的大型推理模型(LRM)通过利用长思维链(CoT)在复杂推理任务中展现了卓越的能力。然而,这些模型往往由于仅依赖内部推理过程而出现幻觉和效率低下的问题。在本文中,我们介绍了START(带工具的自学习推理者),这是一种新颖的、工具集成的长思维链推理大型语言模型(LLM),它通过利用外部工具显著增强了推理能力。通过代码执行,START能够执行复杂计算、自
中文使用 googletrans 翻译,翻译不对的地方以英文为准
目录
- 开始:具有工具的自学成才的推理器
- 多模式LLM的令牌有效的长期视频理解
- PHI-4-MINI技术报告:通过Loras混合物紧凑而强大的多模式模型
- Visual-RFT:视觉增强微调
- LLMVOX:任何LLM的自动回报流式传输文本到语音模型
- Babel:开放多语言大语模型,为90%的全球扬声器提供服务
- 预测数据选择:预测的数据是教导的数据
- 草稿链:通过少写的速度更快地思考
- 热:突出显示的思想链,用于参考输入的支持事实
- Egolife:迈向以自我为中心的生活助理
- Difix3d+:通过单步扩散模型改进3D重建
- 基于过程的自我奖励模型
- 认知行为能够实现自我提高推理者,或者是四个高效星星的习惯
- 深度解决:通过基于树的探索和双点思维来增强复杂的工程解决方案设计
- KodCode:用于编码的多样,具有挑战性和可验证的合成数据集
- 通过单步奖励生成多轮迭代代码
- 从小时到几分钟:超长序列产生至100K令牌的无损加速度
- 多基金会:评估LLM代理的协作和竞争
- 衍射性:潜在扩散的快速且令人尴尬的简单端到端全长歌曲。
- semviqa:越南信息事实检查的语义问题答案系统
- ONEREC:统一检索和排名,并具有生成的推荐和迭代优先对齐
- MPO:通过META计划优化增强LLM代理
- llm作为电话断开:迭代生成扭曲信息
- 音频Flamingo 2:具有长期审计和专家推理能力的音频语言模型
- 我们可以使用ImageNet进行多远以获得文本图像生成?
- Lingoly-too:通过语言型和拼字化混淆的推理解开记忆
- GEN3C:具有精确的相机控制的3D信息世界一致的视频生成
- 梯子:通过递归问题分解自我提出的LLM
- LLM时代的Wikipedia:进化与风险
- SOS1:O1和R1般的推理LLM是平方的求解器
开始:具有工具的自学成才的推理器
- 标题: START: Self-taught Reasoner with Tools
- 作者: Chengpeng Li, Mingfeng Xue, Zhenru Zhang, Jiaxi Yang, Beichen Zhang, Xiang Wang, Bowen Yu, Binyuan Hui, Junyang Lin, Dayiheng Liu
- 日期: 2025-03-06
- ArXiv主页: https://arxiv.org/abs/2503.04625
- 论文链接: https://arxiv.org/pdf/2503.04625
英文摘要
Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable capabilities in complex reasoning tasks through the utilization of long Chain-of-thought (CoT). However, these models often suffer from hallucinations and inefficiencies due to their reliance solely on internal reasoning processes. In this paper, we introduce START (Self-Taught Reasoner with Tools), a novel tool-integrated long CoT reasoning LLM that significantly enhances reasoning capabilities by leveraging external tools. Through code execution, START is capable of performing complex computations, self-checking, exploring diverse methods, and self-debugging, thereby addressing the limitations of LRMs. The core innovation of START lies in its self-learning framework, which comprises two key techniques: 1) Hint-infer: We demonstrate that inserting artificially designed hints (e.g., ``Wait, maybe using Python here is a good idea.‘’) during the inference process of a LRM effectively stimulates its ability to utilize external tools without the need for any demonstration data. Hint-infer can also serve as a simple and effective sequential test-time scaling method; 2) Hint Rejection Sampling Fine-Tuning (Hint-RFT): Hint-RFT combines Hint-infer and RFT by scoring, filtering, and modifying the reasoning trajectories with tool invocation generated by a LRM via Hint-infer, followed by fine-tuning the LRM. Through this framework, we have fine-tuned the QwQ-32B model to achieve START. On PhD-level science QA (GPQA), competition-level math benchmarks (AMC23, AIME24, AIME25), and the competition-level code benchmark (LiveCodeBench), START achieves accuracy rates of 63.6%, 95.0%, 66.7%, 47.1%, and 47.3%, respectively. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B and the proprietary model o1-Preview.
中文摘要
像OpenAI-o1和DeepSeek-R1这样的大型推理模型(LRM)通过利用长思维链(CoT)在复杂推理任务中展现了卓越的能力。然而,这些模型往往由于仅依赖内部推理过程而出现幻觉和效率低下的问题。在本文中,我们介绍了START(带工具的自学习推理者),这是一种新颖的、工具集成的长思维链推理大型语言模型(LLM),它通过利用外部工具显著增强了推理能力。通过代码执行,START能够执行复杂计算、自我检查、探索多种方法并进行自我调试,从而克服了大型推理模型的局限性。START的核心创新在于其自学习框架,该框架包括两项关键技术:1)提示推断:我们证明,在大型推理模型的推理过程中插入人工设计的提示(例如,“等等,也许在这里使用Python是个好主意。”)可以有效地激发其利用外部工具的能力,而无需任何演示数据。提示推断还可以作为一种简单有效的序列测试时扩展方法;2)提示拒绝采样微调(Hint-RFT):Hint-RFT结合了提示推断和拒绝采样微调(RFT),通过提示推断对大型推理模型生成的带有工具调用的推理轨迹进行评分、过滤和修改,然后对大型推理模型进行微调。通过这一框架,我们对QwQ-32B模型进行了微调,以实现START。在博士级科学问答(GPQA)、竞赛级数学基准测试(AMC23、AIME24、AIME25)和竞赛级代码基准测试(LiveCodeBench)中,START分别取得了63.6%、95.0%、66.7%、47.1%和47.3%的准确率。它显著优于基础QwQ-32B模型,并达到了与最先进的开放权重模型R1-Distill-Qwen-32B和专有模型o1-Preview相当的性能。
多模式LLM的令牌有效的长期视频理解
- 标题: Token-Efficient Long Video Understanding for Multimodal LLMs
- 作者: Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon
- 日期: 2025-03-06
- ArXiv主页: https://arxiv.org/abs/2503.04130
- 论文链接: https://arxiv.org/pdf/2503.04130
- 项目链接: https://research.nvidia.com/labs/lpr/storm/
英文摘要
Recent advances in video-based multimodal large language models (Video-LLMs) have significantly improved video understanding by processing videos as sequences of image frames. However, many existing methods treat frames independently in the vision backbone, lacking explicit temporal modeling, which limits their ability to capture dynamic patterns and efficiently handle long videos. To address these limitations, we introduce STORM (Spatiotemporal TOken Reduction for Multimodal LLMs), a novel architecture incorporating a dedicated temporal encoder between the image encoder and the LLM. Our temporal encoder leverages the Mamba State Space Model to integrate temporal information into image tokens, generating enriched representations that preserve inter-frame dynamics across the entire video sequence. This enriched encoding not only enhances video reasoning capabilities but also enables effective token reduction strategies, including test-time sampling and training-based temporal and spatial pooling, substantially reducing computational demands on the LLM without sacrificing key temporal information. By integrating these techniques, our approach simultaneously reduces training and inference latency while improving performance, enabling efficient and robust video understanding over extended temporal contexts. Extensive evaluations show that STORM achieves state-of-the-art results across various long video understanding benchmarks (more than 5% improvement on MLVU and LongVideoBench) while reducing the computation costs by up to 8times and the decoding latency by 2.4-2.9times for the fixed numbers of input frames. Project page is available at https://research.nvidia.com/labs/lpr/storm
中文摘要
基于视频的多模式大语言模型(视频LLM)的最新进展通过将视频作为图像框架的序列来显着改善视频理解。但是,许多现有方法在视觉主链中独立处理帧,缺乏明确的时间建模,这限制了其捕获动态模式并有效处理长视频的能力。为了解决这些局限性,我们引入了Storm(多模式LLMS时空令牌减少),这是一种新型架构,结合了图像编码器和LLM之间的专用时间编码器。我们的时间编码器利用Mamba状态空间模型将时间信息集成到图像令牌中,从而生成丰富的表示形式,这些表示可以在整个视频序列中保留框架间动力学。这种丰富的编码不仅增强了视频推理功能,还可以实现有效的令牌缩短策略,包括测试时间抽样和基于培训的时间和空间汇总,从而大大降低了对LLM的计算需求,而无需牺牲关键的时间信息。通过整合这些技术,我们的方法同时减少了培训和推理潜伏期,同时提高了性能,从而对扩展时间上下文产生了高效且强大的视频理解。广泛的评估表明,风暴在各种长时间的视频理解基准(超过5%的MLVU和LongVideObench提高5 \%)中取得了最新的结果,同时最多将计算成本降低了8倍,而解码延迟则在2.4-2.9time降低了固定数量的输入帧。项目页面可从https://research.nvidia.com/labs/lpr/storm获得
PHI-4-MINI技术报告:通过Loras混合物紧凑而强大的多模式模型
- 标题: Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
- 作者: Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxuan Hu, Xin Jin, Mahmoud Khademi, Dongwoo Kim, Young Jin Kim, Gina Lee, Jinyu Li, Yunsheng Li, Chen Liang, Xihui Lin, Zeqi Lin, Mengchen Liu, Yang Liu, Gilsinia Lopez, Chong Luo, Piyush Madan, Vadim Mazalov, Ali Mousavi, Anh Nguyen, Jing Pan, Daniel Perez-Becker, Jacob Platin, Thomas Portet, Kai Qiu, Bo Ren, Liliang Ren, Sambuddha Roy, Ning Shang, Yelong Shen, Saksham Singhal, Subhojit Som, Xia Song, Tetyana Sych, Praneetha Vaddamanu, Shuohang Wang, Yiming Wang, Zhenghao Wang, Haibin Wu, Haoran Xu, Weijian Xu, Yifan Yang, Ziyi Yang, Donghan Yu, Ishmam Zabir, Jianwen Zhang, Li Lyna Zhang, Yunan Zhang, Xiren Zhou
- 日期: 2025-03-03
- ArXiv主页: https://arxiv.org/abs/2503.01743
- 论文链接: https://arxiv.org/pdf/2503.01743
- 项目链接: https://huggingface.co/microsoft/Phi-4-multimodal-instruct
英文摘要
We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding datasets. Compared to its predecessor, Phi-3.5-Mini, Phi-4-Mini features an expanded vocabulary size of 200K tokens to better support multilingual applications, as well as group query attention for more efficient long-sequence generation. Phi-4-Multimodal is a multimodal model that integrates text, vision, and speech/audio input modalities into a single model. Its novel modality extension approach leverages LoRA adapters and modality-specific routers to allow multiple inference modes combining various modalities without interference. For example, it now ranks first in the OpenASR leaderboard to date, although the LoRA component of the speech/audio modality has just 460 million parameters. Phi-4-Multimodal supports scenarios involving (vision + language), (vision + speech), and (speech/audio) inputs, outperforming larger vision-language and speech-language models on a wide range of tasks. Additionally, we experiment to further train Phi-4-Mini to enhance its reasoning capabilities. Despite its compact 3.8-billion-parameter size, this experimental version achieves reasoning performance on par with or surpassing significantly larger models, including DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B.
中文摘要
我们介绍了Phi-4-Mini和Phi-4-Multimodal,紧凑但功能高度的语言和多模型模型。PHI-4-MINI是一种3.8亿参数语言模型,该模型在高质量的Web和合成数据上训练,大大优于最近大小相似的开源模型,并匹配其在数学和编码任务上的大小和需要复杂推理的模型的性能。这项成就是由精心策划的合成数据配方强调高质量数学和编码数据集的驱动的。与其前身PHI-3.5-MINI相比,Phi-4-Mini具有扩大的词汇大小为200k代币,以更好地支持多语言应用程序,以及集体查询的关注,以增加有效的长期序列生成。Phi-4-MultiModal是一种多模型模型,将文本,视觉和语音/音频输入模式集成到单个模型中。它的新型模态扩展方法利用Lora适配器和模态特异性路由器允许多种推理模式结合各种模态而不会受到干扰。例如,尽管语音/音频模式的洛拉组成部分仅具有4.6亿个参数,但现在它在OpenASR排行榜中排名第一。Phi-4-Multimodal支持涉及(视觉 +语言),(视觉 +语音)和(语音/音频)输入的场景,在广泛的任务上优于更大的视觉语言语言和语音语言模型。此外,我们尝试进一步训练PHI-4-MINI以增强其推理能力。尽管具有紧凑的38亿参数的规模,但该实验版本仍以与或超过更大的模型的推理性能,包括DeepSeek-R1-Distill-Qwen-7b和DeepSeek-R1-Distill-distill-lllama-8B。
Visual-RFT:视觉增强微调
- 标题: Visual-RFT: Visual Reinforcement Fine-Tuning
- 作者: Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang
- 日期: 2025-03-03
- ArXiv主页: https://arxiv.org/abs/2503.01785
- 论文链接: https://arxiv.org/pdf/2503.01785
- 项目链接: https://github.com/Liuziyu77/Visual-RFT
- gitHub仓库: https://github.com/Liuziyu77/Visual-RFT
英文摘要
Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks. Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs) to generate multiple responses containing reasoning tokens and final answers for each input, and then uses our proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization (GRPO). We design different verifiable reward functions for different perception tasks, such as the Intersection over Union (IoU) reward for object detection. Experimental results on fine-grained image classification, few-shot object detection, reasoning grounding, as well as open-vocabulary object detection benchmarks show the competitive performance and advanced generalization ability of Visual-RFT compared with Supervised Fine-tuning (SFT). For example, Visual-RFT improves accuracy by 24.3% over the baseline in one-shot fine-grained image classification with around 100 samples. In few-shot object detection, Visual-RFT also exceeds the baseline by 21.9 on COCO’s two-shot setting and 15.4 on LVIS. Our Visual-RFT represents a paradigm shift in fine-tuning LVLMs, offering a data-efficient, reward-driven approach that enhances reasoning and adaptability for domain-specific tasks.
中文摘要
在诸如Openai O1之类的大型推理模型中,加强微调(RFT)从其答案的反馈中学习,这在稀缺的微调数据时特别有用。诸如DeepSeek-R1之类的最近开源工作表明,具有可验证奖励的增强学习是复制O1的一个关键方向。尽管R1风格的模型在语言模型中表现出成功,但其在多模式域中的应用仍未得到探索。这项工作引入了视觉增强微调(Visual-RFT),这进一步扩展了视觉任务上RFT的应用区域。具体而言,Visual-RFT首先使用大型视觉语言模型(LVLM)来生成每个输入中包含推理令牌和最终答案的多个响应,然后使用我们提出的视觉感知可验证的可验证奖励函数来通过策略优化算法(例如组相对策略优化(GRPO))来更新模型。我们为不同的感知任务设计了不同的可验证奖励功能,例如与对象检测的联合(IOU)奖励的交集。与监督的微调(SFT)相比,有关细粒度图像分类,很少的对象检测,几乎没有射击对象检测,推理接地以及开放式对象检测基准的实验结果显示了Visual-RFT的竞争性能和高级概括能力。例如,视觉-RFT以大约100个样本的一弹性细粒度分类,比基线的精度提高了24.3%。在几次射击对象检测中,在可可的两杆设置上,Visual-RFT也超过了21.9,在LVIS上也超过了15.4。我们的Visual-RFT代表了微调LVLM的范式转变,提供了一种数据效率,奖励驱动的方法,可增强针对特定领域任务的推理和适应性。
LLMVOX:任何LLM的自动回报流式传输文本到语音模型
- 标题: LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
- 作者: Sambal Shikhar, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jean Lahoud, Fahad Khan, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal
- 日期: 2025-03-06
- ArXiv主页: https://arxiv.org/abs/2503.04724
- 论文链接: https://arxiv.org/pdf/2503.04724
- 项目链接: https://mbzuai-oryx.github.io/LLMVoX/
- gitHub仓库: https://github.com/mbzuai-oryx/LLMVoX
英文摘要
Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .
中文摘要
语音到语音对话系统的最新进展利用LLM进行多模式交互,但仍受到微调要求,高计算开销和文本语音未对准的阻碍。现有的支持语音的LLM通常通过修改LLM来降低对话质量,从而损害其语言能力。相比之下,我们提出了LLMVOX,这是一种轻巧的30m参数,LLM-AGNOSTIC,自回旋的流媒体TTS系统,该系统生成具有低潜伏期的高质量语音,同时充分保留了基本LLM的功能。与启用语音的LLM相比,我们的方法的单词错误率显着较低,同时以可比的延迟和UTMOS分数运行。通过通过多标记流系统将语音合成从LLM处理中,LLMVOX支持无缝的无限长度对话。它的插件设计还促进了具有不同骨架的各种任务的扩展。此外,LLMVOX仅具有数据集适应的新语言概括,在阿拉伯语语音任务上达到了低角色错误率。此外,我们将LLMVOX与视觉语言模型集成在一起,以创建具有语音,文本和视觉功能的Omni模型,而无需进行其他多模式训练。我们的代码库和项目页面可从https://mbzuai-oryx.github.io/llmvox获得。
Babel:开放多语言大语模型,为90%的全球扬声器提供服务
- 标题: Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers
- 作者: Yiran Zhao, Chaoqun Liu, Yue Deng, Jiahao Ying, Mahani Aljunied, Zhaodonghui Li, Lidong Bing, Hou Pong Chan, Yu Rong, Deli Zhao, Wenxuan Zhang
- 日期: 2025-03-02
- ArXiv主页: https://arxiv.org/abs/2503.00865
- 论文链接: https://arxiv.org/pdf/2503.00865
- 项目链接: https://babel-llm.github.io/babel-llm/
- gitHub仓库: https://github.com/babel-llm/babel-llm
英文摘要
Large language models (LLMs) have revolutionized natural language processing (NLP), yet open-source multilingual LLMs remain scarce, with existing models often limited in language coverage. Such models typically prioritize well-resourced languages, while widely spoken but under-resourced languages are often overlooked. To address this disparity, we introduce Babel, an open multilingual LLM that covers the top 25 languages by number of speakers, supports over 90% of the global population, and includes many languages neglected by other open multilingual LLMs. Unlike traditional continue pretraining approaches, Babel expands its parameter count through a layer extension technique that elevates Babel’s performance ceiling. We introduce two variants: Babel-9B, designed for efficient inference and fine-tuning, and Babel-83B, which sets a new standard for open multilingual LLMs. Extensive evaluations on multilingual tasks demonstrate its superior performance compared to open LLMs of comparable size. In addition, using open-source supervised fine-tuning datasets, Babel achieves remarkable performance, with Babel-9B-Chat leading among 10B-sized LLMs and Babel-83B-Chat setting a new standard for multilingual tasks, reaching the same level of commercial models.
中文摘要
大型语言模型(LLM)彻底改变了自然语言处理(NLP),但开源的多语言LLM仍然稀缺,现有模型在语言覆盖范围内通常有限。这样的模型通常会优先考虑资源良好的语言,虽然说话广泛但资源不足的语言经常被忽略。为了解决这一差异,我们介绍了Babel,Babel是一种开放的多语言LLM,按数量涵盖了前25种语言,支持超过90%的全球人口,其中包括其他开放的多语言LLM所忽视的许多语言。与传统的继续预处理方法不同,Babel通过层扩展技术扩展了其参数计数,从而提高了Babel的性能天花板。我们介绍了两个变体:为有效的推理和微调设计的Babel-9b,以及Babel-83B,为开放的多语言LLM设定了新的标准。对多语言任务的广泛评估表明,与可比大小的开放LLM相比,其表现出色。此外,使用开源监督的微调数据集,Babel实现了卓越的性能,Babel-9B-Chat在10B尺寸的LLMS和Babel-83B-Chat中领先,为多语言任务设定了新标准,以达到相同水平的商业模型。
预测数据选择:预测的数据是教导的数据
-
标题: Predictive Data Selection: The Data That Predicts Is the Data That Teaches
-
作者: Kashun Shum, Yuzhen Huang, Hongjian Zou, Ding Qi, Yixuan Liao, Xiaoxin Chen, Qian Liu, Junxian He
-
日期: 2025-03-02
-
ArXiv主页: https://arxiv.org/abs/2503.00808
-
gitHub仓库: https://github.com/hkust-nlp/PreSelect
英文摘要
Language model pretraining involves training on extensive corpora, where data quality plays a pivotal role. In this work, we aim to directly estimate the contribution of data during pretraining and select pretraining data in an efficient manner. Specifically, we draw inspiration from recent findings showing that compression efficiency (i.e., the normalized loss) of diverse models on certain text correlates strongly with their downstream performance, when the text domain aligns with the downstream benchmark (Huang et al., 2024). Building on this observation, we hypothesize that data on which model losses are predictive of downstream abilities also contribute effectively to learning. To leverage this insight, we introduce data selection based on data’s Predictive strength (Preselect), a lightweight and efficient data selection method that requires training and deploying only a fastText-based scorer. Through comprehensive experiments with 1B and 3B parameter models, we demonstrate that models trained on 30B tokens selected with PreSelect surpasses the performance of a vanilla baseline trained on 300B tokens, achieving a 10x reduction in compute requirements. Furthermore, PreSelect significantly outperforms other competitive data selection baselines, such as DCLM and FineWeb-Edu on a scale of 3B models trained on 100B tokens. We open-source our trained data selection scorer along with the curated datasets at https://github.com/hkust-nlp/PreSelect.
中文摘要
审计语言模型涉及在广泛的语料库上进行培训,在该语料库中,数据质量起着关键作用。在这项工作中,我们旨在直接估计在预处理过程中数据的贡献,并以有效的方式选择预处理数据。具体而言,我们从最近的发现中汲取灵感,表明某些文本上不同模型的压缩效率(即归一化损失)与其下游性能密切相关,而文本域与下游基准相符(Huang等,2024)。在这一观察结果的基础上,我们假设数据损失的数据可预测下游能力,这也有效地有助于学习。为了利用这种见解,我们根据数据的预测强度(Preselect)介绍数据选择,这是一种轻巧有效的数据选择方法,需要培训并仅部署基于FastText的得分手。通过使用1B和3B参数模型进行的全面实验,我们证明了在30B令牌上训练的模型超过了在300B令牌上训练的香草基线的性能,从而减少了计算要求的10倍。此外,预选明显优于其他竞争性数据选择基准,例如DCLM和FineWeb-Edu,在100B代币训练的3B模型中,以量表为单位。我们在https://github.com/hkust-nlp/preselect上开放训练有素的数据选择得分手以及策划的数据集。
草稿链:通过少写的速度更快地思考
-
标题: Chain of Draft: Thinking Faster by Writing Less
-
作者: Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He
-
日期: 2025-02-25
-
ArXiv主页: https://arxiv.org/abs/2502.18600
-
gitHub仓库: https://github.com/sileix/chain-of-draft
英文摘要
Large Language Models (LLMs) have demonstrated remarkable performance in solving complex reasoning tasks through mechanisms like Chain-of-Thought (CoT) prompting, which emphasizes verbose, step-by-step reasoning. However, humans typically employ a more efficient strategy: drafting concise intermediate thoughts that capture only essential information. In this work, we propose Chain of Draft (CoD), a novel paradigm inspired by human cognitive processes, where LLMs generate minimalistic yet informative intermediate reasoning outputs while solving tasks. By reducing verbosity and focusing on critical insights, CoD matches or surpasses CoT in accuracy while using as little as only 7.6% of the tokens, significantly reducing cost and latency across various reasoning tasks.
中文摘要
大型语言模型(LLMS)在通过诸如Thebough(COT)提示之类的机制来解决复杂的推理任务方面表现出了显着的性能,该机制强调了详细的,逐步的推理。但是,人类通常采用更有效的策略:起草简洁的中间思想,只捕获基本信息。在这项工作中,我们提出了草稿链(COD),这是一种受人类认知过程启发的新型范式,在该过程中,LLMS在解决任务时会产生简约但内容丰富的中间推理输出。通过降低详细的洞察力并专注于关键见解,COD匹配或超过COT的准确性,同时仅使用几乎7.6%的令牌,从而大大降低了各种推理任务的成本和潜伏期。
热:突出显示的思想链,用于参考输入的支持事实
- 标题: HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs
- 作者: Tin Nguyen, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen
- 日期: 2025-03-03
- ArXiv主页: https://arxiv.org/abs/2503.02003
- 论文链接: https://arxiv.org/pdf/2503.02003
- 项目链接: https://highlightedchainofthought.github.io/
英文摘要
An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the query. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Interestingly, in few-shot settings, HoT outperforms vanilla chain of thought prompting (CoT) on a wide range of 17 tasks from arithmetic, reading comprehension to logical reasoning. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to make users believe that an answer is correct.
中文摘要
大型语言模型(LLMS)的致命脚跟是他们幻觉非事实陈述的趋势。事实和非事实陈述的回应构成了人类验证并准确基于决策的挑战。为了解决这个问题,我们提出了突出显示的链条链(HOT),这是一种提示LLMS使用XML标签生成响应的技术,这些标签将事实扎根于查询中提供的事实。也就是说,给出一个输入问题,LLMS将首先重新格式化问题,以添加XML标签突出关键事实,然后在输入中引用的事实中生成一个响应。有趣的是,在几个设置中,热门的表现优于促进思想链(COT),从算术,阅读理解到逻辑推理的17个任务上,均超过了思想链(COT)。当要求人类验证LLM响应时,突出显示有助于时间限制的参与者更准确,更有效地识别LLM何时正确。但是,令人惊讶的是,当LLM错误时,HOTS倾向于使用户相信答案是正确的。
Egolife:迈向以自我为中心的生活助理
- 标题: EgoLife: Towards Egocentric Life Assistant
- 作者: Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, Bei Ouyang, Zhengyu Lin, Marco Cominelli, Zhongang Cai, Yuanhan Zhang, Peiyuan Zhang, Fangzhou Hong, Joerg Widmer, Francesco Gringoli, Lei Yang, Bo Li, Ziwei Liu
- 日期: 2025-03-05
- ArXiv主页: https://arxiv.org/abs/2503.03803
- 论文链接: https://arxiv.org/pdf/2503.03803
- 项目链接: https://egolife-ai.github.io/blog/
- gitHub仓库: https://github.com/EvolvingLMMs-Lab/EgoLife
英文摘要
We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses. To lay the foundation for this assistant, we conducted a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities - including discussions, shopping, cooking, socializing, and entertainment - using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references. This effort resulted in the EgoLife Dataset, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation. Leveraging this dataset, we introduce EgoLifeQA, a suite of long-context, life-oriented question-answering tasks designed to provide meaningful assistance in daily life by addressing practical questions such as recalling past relevant events, monitoring health habits, and offering personalized recommendations. To address the key technical challenges of (1) developing robust visual-audio models for egocentric data, (2) enabling identity recognition, and (3) facilitating long-context question answering over extensive temporal information, we introduce EgoButler, an integrated system comprising EgoGPT and EgoRAG. EgoGPT is an omni-modal model trained on egocentric datasets, achieving state-of-the-art performance on egocentric video understanding. EgoRAG is a retrieval-based component that supports answering ultra-long-context questions. Our experimental studies verify their working mechanisms and reveal critical factors and bottlenecks, guiding future improvements. By releasing our datasets, models, and benchmarks, we aim to stimulate further research in egocentric AI assistants.
中文摘要
我们介绍了Egolife,这是一个开发以自我为中心的生活助理的项目,该助手伴随并通过AI驱动的可穿戴眼镜提高了个人效率。为了为这位助手奠定基础,我们进行了一项全面的数据收集研究,其中有六名参与者一起生活了一周,不断使用AI眼镜记录他们的日常活动,包括讨论,购物,烹饪,社交和娱乐 - 用于多模式电子中心视频捕获,以及同步的第三人称视频视频参考文献。这项工作导致了Egolife数据集,这是一个全面的300小时以人际,人际关系,多视图和多模式的日常生活数据集,并具有密集的注释。利用此数据集,我们介绍了EgolifeQa,这是一套长期以来,以生活为导向的问题,索问题的任务,旨在通过解决诸如召回过去相关事件,监视健康习惯和提供个性化建议的实用问题,以在日常生活中提供有意义的帮助。为了解决(1)为以eg中心数据开发强大的视觉审计模型的关键技术挑战,(2)实现身份识别,以及(3)促进对广泛的时间信息的长篇文化问题回答,我们介绍了Egobutler,Egobutler,Egobutler,Egogpt和Egogpt和Egorag组成的集成系统。Egogpt是一种在以自我为中心数据集中训练的Omni-Modal模型,在以自我为中心的视频理解方面实现了最先进的性能。Egorag是一个基于检索的组件,支持回答超长的问题。我们的实验研究验证了它们的工作机制,并揭示了关键因素和瓶颈,从而指导未来的改进。通过释放我们的数据集,模型和基准,我们旨在刺激以Egentric AI助手的进一步研究。
Difix3d+:通过单步扩散模型改进3D重建
- 标题: Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models
- 作者: Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, Huan Ling
- 日期: 2025-03-03
- ArXiv主页: https://arxiv.org/abs/2503.01774
- 论文链接: https://arxiv.org/pdf/2503.01774
- 项目链接: https://research.nvidia.com/labs/toronto-ai/difix3d
英文摘要
Neural Radiance Fields and 3D Gaussian Splatting have revolutionized 3D reconstruction and novel-view synthesis task. However, achieving photorealistic rendering from extreme novel viewpoints remains challenging, as artifacts persist across representations. In this work, we introduce Difix3D+, a novel pipeline designed to enhance 3D reconstruction and novel-view synthesis through single-step diffusion models. At the core of our approach is Difix, a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by underconstrained regions of the 3D representation. Difix serves two critical roles in our pipeline. First, it is used during the reconstruction phase to clean up pseudo-training views that are rendered from the reconstruction and then distilled back into 3D. This greatly enhances underconstrained regions and improves the overall 3D representation quality. More importantly, Difix also acts as a neural enhancer during inference, effectively removing residual artifacts arising from imperfect 3D supervision and the limited capacity of current reconstruction models. Difix3D+ is a general solution, a single model compatible with both NeRF and 3DGS representations, and it achieves an average 2times improvement in FID score over baselines while maintaining 3D consistency.
中文摘要
神经辐射场和3D高斯裂缝已彻底改变了3D重建和新型视图综合任务。然而,从极端新颖的观点中实现逼真的渲染仍然充满挑战,因为伪影持续存在。在这项工作中,我们引入了Difix3D+,这是一种新型管道,旨在通过单步扩散模型来增强3D重建和新型视图合成。我们方法的核心是Difix,这是一个单步图像扩散模型,该模型训练有素,以增强和去除由3D表示的不受约束区域引起的呈现的新型视图中的伪影。DiFix在我们的管道中扮演两个关键角色。首先,在重建阶段使用它来清理从重建中呈现的伪训练视图,然后蒸馏回3D。这大大提高了无限区域,并提高了总体3D表示质量。更重要的是,DIFIX在推断过程中还充当神经增强剂,有效地消除了由不完善的3D监督和当前重建模型的有限能力产生的残留伪像。difix3d+是一种通用解决方案,是与NERF和3DGS表示兼容的单个模型,并且在维持3D一致性的同时,它的平均2倍提高了FID得分。
基于过程的自我奖励模型
- 标题: Process-based Self-Rewarding Language Models
- 作者: Shimao Zhang, Xiao Liu, Xin Zhang, Junxiao Liu, Zheheng Luo, Shujian Huang, Yeyun Gong
- 日期: 2025-03-05
- ArXiv主页: https://arxiv.org/abs/2503.03746
- 论文链接: https://arxiv.org/pdf/2503.03746
英文摘要
Large Language Models have demonstrated outstanding performance across various downstream tasks and have been widely applied in multiple scenarios. Human-annotated preference data is used for training to further improve LLMs’ performance, which is constrained by the upper limit of human performance. Therefore, Self-Rewarding method has been proposed, where LLMs generate training data by rewarding their own outputs. However, the existing self-rewarding paradigm is not effective in mathematical reasoning scenarios and may even lead to a decline in performance. In this work, we propose the Process-based Self-Rewarding pipeline for language models, which introduces long-thought reasoning, step-wise LLM-as-a-Judge, and step-wise preference optimization within the self-rewarding paradigm. Our new paradigm successfully enhances the performance of LLMs on multiple mathematical reasoning benchmarks through iterative Process-based Self-Rewarding, demonstrating the immense potential of self-rewarding to achieve LLM reasoning that may surpass human capabilities.
中文摘要
大型语言模型已经在各种下游任务中表现出出色的性能,并且已在多种情况下广泛应用。人类注销的偏好数据用于培训,以进一步提高LLMS的性能,这受到人类绩效上限的约束。因此,已经提出了自我奖励方法,其中LLM通过奖励自己的产出来生成培训数据。但是,现有的自我奖励范式在数学推理方案中无效,甚至可能导致性能下降。In this work, we propose the Process-based Self-Rewarding pipeline for language models, which introduces long-thought reasoning, step-wise LLM-as-a-Judge, and step-wise preference optimization within the self-rewarding paradigm.我们的新范式通过基于迭代过程的自我奖励成功地增强了LLM在多个数学推理基准上的性能,这表明了自我奖励的巨大潜力以实现可能超过人类能力的LLM推理。
认知行为能够实现自我提高推理者,或者是四个高效星星的习惯
-
标题: Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs
-
作者: Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, Noah D. Goodman
-
日期: 2025-03-03
-
ArXiv主页: https://arxiv.org/abs/2503.01307
英文摘要
Test-time inference has emerged as a powerful paradigm for enabling language models to ``think’’ longer and more carefully about complex challenges, much like skilled human experts. While reinforcement learning (RL) can drive self-improvement in language models on verifiable tasks, some models exhibit substantial gains while others quickly plateau. For instance, we find that Qwen-2.5-3B far exceeds Llama-3.2-3B under identical RL training for the game of Countdown. This discrepancy raises a critical question: what intrinsic properties enable effective self-improvement? We introduce a framework to investigate this question by analyzing four key cognitive behaviors – verification, backtracking, subgoal setting, and backward chaining – that both expert human problem solvers and successful language models employ. Our study reveals that Qwen naturally exhibits these reasoning behaviors, whereas Llama initially lacks them. In systematic experimentation with controlled behavioral datasets, we find that priming Llama with examples containing these reasoning behaviors enables substantial improvements during RL, matching or exceeding Qwen’s performance. Importantly, the presence of reasoning behaviors, rather than correctness of answers, proves to be the critical factor – models primed with incorrect solutions containing proper reasoning patterns achieve comparable performance to those trained on correct solutions. Finally, leveraging continued pretraining with OpenWebMath data, filtered to amplify reasoning behaviors, enables the Llama model to match Qwen’s self-improvement trajectory. Our findings establish a fundamental relationship between initial reasoning behaviors and the capacity for improvement, explaining why some language models effectively utilize additional computation while others plateau.
中文摘要
测试时间推论已成为一种有力的范式,可以使语言模型更长地“思考’'更长时间,更仔细地对复杂的挑战,就像熟练的人类专家一样。尽管增强学习(RL)可以在可验证的任务上推动语言模型中的自我完善,但一些模型表现出可观的增长,而另一些模型很快就会迅速发展。例如,我们发现QWEN-2.5-3B远远超过了倒计时游戏中相同的RL训练的Llama-3.2-3b。这种差异提出了一个关键的问题:哪些内在特性可以有效自我完善?我们介绍了一个框架来调查这个问题,通过分析四种关键的认知行为 - 验证,回溯,子目标设置和后退链接 - 专家人类问题解决者和成功的语言模型都采用了专家。我们的研究表明,QWEN自然表现出这些推理行为,而Llama最初缺乏它们。在对受控行为数据集进行系统的实验中,我们发现,使用包含这些推理行为的示例启动Llama可以在RL期间进行实质性改进,匹配或超过QWEN的性能。重要的是,推理行为的存在,而不是答案的正确性,被证明是关键因素 - 具有不正确的解决方案的模型,包含适当的推理模式的解决方案实现了与在正确解决方案中训练的人相当的性能。最后,利用OpenWebMath数据继续进行预处理,过滤以扩大推理行为,使Llama模型能够匹配Qwen的自我改进轨迹。我们的发现建立了初始推理行为与改进能力之间的基本关系,解释了为什么某些语言模型有效地利用了其他计算,而另一些语言模型则是在平稳的。
深度解决:通过基于树的探索和双点思维来增强复杂的工程解决方案设计
-
标题: DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking
-
作者: Zhuoqun Li, Haiyang Yu, Xuanang Chen, Hongyu Lin, Yaojie Lu, Fei Huang, Xianpei Han, Yongbin Li, Le Sun
-
日期: 2025-02-28
-
ArXiv主页: https://arxiv.org/abs/2502.20730
-
gitHub仓库: https://github.com/Li-Z-Q/DeepSolution
英文摘要
Designing solutions for complex engineering challenges is crucial in human production activities. However, previous research in the retrieval-augmented generation (RAG) field has not sufficiently addressed tasks related to the design of complex engineering solutions. To fill this gap, we introduce a new benchmark, SolutionBench, to evaluate a system’s ability to generate complete and feasible solutions for engineering problems with multiple complex constraints. To further advance the design of complex engineering solutions, we propose a novel system, SolutionRAG, that leverages the tree-based exploration and bi-point thinking mechanism to generate reliable solutions. Extensive experimental results demonstrate that SolutionRAG achieves state-of-the-art (SOTA) performance on the SolutionBench, highlighting its potential to enhance the automation and reliability of complex engineering solution design in real-world applications.
中文摘要
为复杂的工程挑战设计解决方案对于人类生产活动至关重要。但是,先前在检索增强生成(RAG)领域的研究还没有足够解决与复杂工程解决方案设计有关的任务。为了填补这一空白,我们引入了一个新的基准测试台,以评估系统为具有多个复杂约束的工程问题生成完整且可行的解决方案的能力。为了进一步推进复杂工程解决方案的设计,我们提出了一个新型系统,即解决基于树的探索和双点思维机制来生成可靠的解决方案。广泛的实验结果表明,Solutionrag在解决方案板上实现了最新的(SOTA)性能,突出了其在现实世界应用中增强复杂工程解决方案设计的自动化和可靠性的潜力。
KodCode:用于编码的多样,具有挑战性和可验证的合成数据集
- 标题: KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding
- 作者: Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, Radha Poovendran
- 日期: 2025-03-04
- ArXiv主页: https://arxiv.org/abs/2503.02951
- 论文链接: https://github.com/KodCode-AI/kodcode/blob/main/paper/kodcode_v1.pdf
- 项目链接: https://kodcode-ai.github.io/
- gitHub仓库: https://github.com/KodCode-AI/kodcode
英文摘要
We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training Large Language Models for coding. Existing code-focused resources typically fail to ensure either the breadth of coverage (e.g., spanning simple coding tasks to advanced algorithmic problems) or verifiable correctness (e.g., unit tests). In contrast, KodCode comprises question-solution-test triplets that are systematically validated via a self-verification procedure. Our pipeline begins by synthesizing a broad range of coding questions, then generates solutions and test cases with additional attempts allocated to challenging problems. Finally, post-training data synthesis is done by rewriting questions into diverse formats and generating responses under a test-based reject sampling procedure from a reasoning model (DeepSeek R1). This pipeline yields a large-scale, robust and diverse coding dataset. KodCode is suitable for supervised fine-tuning and the paired unit tests also provide great potential for RL tuning. Fine-tuning experiments on coding benchmarks (HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench) demonstrate that KodCode-tuned models achieve state-of-the-art performance, surpassing models like Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B.
中文摘要
我们介绍了KodCode,这是一个合成数据集,它解决了在各种困难和领域中获取高质量、可验证的训练数据以训练大型语言模型进行编码的持续挑战。现有的以代码为中心的资源通常无法确保覆盖范围的广度(例如,从简单的编码任务到高级算法问题)或可验证的正确性(例如,单元测试)。相比之下,KodCode包括通过自我验证程序系统验证的问题解决方案测试三元组。我们的流程首先综合了一系列广泛的编码问题,然后生成解决方案和测试用例,并为具有挑战性的问题分配了额外的尝试。最后,训练后数据合成是通过将问题重写为不同的格式,并在基于测试的拒绝抽样程序下从推理模型(DeepSeek R1)生成响应来完成的。这个管道产生了一个大规模、健壮和多样化的编码数据集。KodCode适用于监督微调,成对的单元测试也为RL调优提供了巨大的潜力。对编码基准(HumanEval(+)、MBPP(+),BigCodeBench和LiveCodeBench)进行的微调实验表明,KodCode调优的模型达到了最先进的性能,超过了Qwen2.5-Coder-32B-Instruct和DeepSeek-R1-Distill-Llama-70B等模型。
通过单步奖励生成多轮迭代代码
- 标题: Multi-Turn Code Generation Through Single-Step Rewards
- 作者: Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, Sanjiban Choudhury
- 日期: 2025-02-27
- ArXiv主页: https://arxiv.org/abs/2502.20380
- 论文链接: https://arxiv.org/pdf/2502.20380
- 项目链接: https://portal-cornell.github.io/muCode/
- gitHub仓库: https://github.com/portal-cornell/muCode
英文摘要
We are excited to share our paper “Multi-Turn Code Generation Through Single-Step Rewards“. Please find our project page at https://portal-cornell.github.io/muCode/
中文摘要
我们很高兴分享我们的论文“通过单步奖励生成多轮迭代代码”。Please find our project page at https://portal-cornell.github.io/muCode/
从小时到几分钟:超长序列产生至100K令牌的无损加速度
-
标题: From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens
-
作者: Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng
-
日期: 2025-02-26
-
ArXiv主页: https://arxiv.org/abs/2502.18890
-
gitHub仓库: https://github.com/bigai-nlco/TokenSwift
英文摘要
Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model’s inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at https://github.com/bigai-nlco/TokenSwift.
中文摘要
使用大语言模型(LLM)生成超长序列已变得越来越关键,但仍然是一项高度耗时的任务,尤其是对于多达100K令牌的序列。尽管存在传统的投机解码方法,但仅扩大其生成限制并无法加速该过程,并且可能有害。通过深入分析,我们确定了阻碍有效产生的三个主要挑战:频繁重新加载,动态键值(KV)管理和重复生成。为了解决这些问题,我们介绍了TokensWift,这是一个新颖的框架,旨在实质上加速超长序列的生成过程,同时保持目标模型的固有质量。实验结果表明,TokensWift在不同尺度(1.5b,7b,8b,14b)和体系结构(MHA,GQA)的模型中达到了3倍的速度。这种加速度转化为超长序列产生的时间节省时间,以前所未有的长度建立TokensWift作为可扩展有效的解决方案。代码可以在https://github.com/bigai-nlco/tokenswift上找到。
多基金会:评估LLM代理的协作和竞争
-
标题: MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents
-
作者: Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, Jiaxuan You
-
日期: 2025-03-03
-
ArXiv主页: https://arxiv.org/abs/2503.01935
英文摘要
Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at https://github.com/MultiagentBench/MARBLE.
中文摘要
大型语言模型(LLMS)作为自主代理显示出了显着的功能,但是现有的基准要么专注于单一代理任务,要么局限于狭窄的域,因此未能捕获多机构协调和竞争的动态。在本文中,我们介绍了MultiaGentBench,这是一种综合基准,旨在评估各种交互式场景的基于LLM的多代理系统。我们的框架不仅衡量了任务完成,还可以使用基于里程碑的主要绩效指标来衡量协作和竞争的质量。此外,我们评估了各种协调方案(包括星,链,树木和图形拓扑)以及诸如小组讨论和认知计划之类的创新策略。值得注意的是,GPT-4O-Mini达到了平均最高任务得分,图表结构在研究方案中的协调方案中表现最好,并且认知计划将里程碑成就率提高了3%。代码和数据集可在https://github.com/multiagentbench/marble上公开。
衍射性:潜在扩散的快速且令人尴尬的简单端到端全长歌曲。
- 标题: DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion
- 作者: Ziqian Ning, Huakang Chen, Yuepeng Jiang, Chunbo Hao, Guobin Ma, Shuai Wang, Jixun Yao, Lei Xie
- 日期: 2025-03-03
- ArXiv主页: https://arxiv.org/abs/2503.01183
- 论文链接: https://arxiv.org/pdf/2503.01183
- 项目链接: https://aslp-lab.github.io/DiffRhythm.github.io/
- gitHub仓库: https://github.com/ASLP-lab/DiffRhythm
英文摘要
Recent advancements in music generation have garnered significant attention, yet existing approaches face critical limitations. Some current generative models can only synthesize either the vocal track or the accompaniment track. While some models can generate combined vocal and accompaniment, they typically rely on meticulously designed multi-stage cascading architectures and intricate data pipelines, hindering scalability. Additionally, most systems are restricted to generating short musical segments rather than full-length songs. Furthermore, widely used language model-based methods suffer from slow inference speeds. To address these challenges, we propose DiffRhythm, the first latent diffusion-based song generation model capable of synthesizing complete songs with both vocal and accompaniment for durations of up to 4m45s in only ten seconds, maintaining high musicality and intelligibility. Despite its remarkable capabilities, DiffRhythm is designed to be simple and elegant: it eliminates the need for complex data preparation, employs a straightforward model structure, and requires only lyrics and a style prompt during inference. Additionally, its non-autoregressive structure ensures fast inference speeds. This simplicity guarantees the scalability of DiffRhythm. Moreover, we release the complete training code along with the pre-trained model on large-scale data to promote reproducibility and further research.
中文摘要
音乐发电的最新进展引起了极大的关注,但是现有的方法面临着关键的局限性。一些当前的生成模型只能合成人声轨道或伴奏轨道。尽管某些模型可以生成组合的声音和伴奏,但它们通常依赖于精心设计的多阶段级联架构架构和复杂的数据管道,从而阻碍可扩展性。此外,大多数系统仅限于产生简短的音乐片段而不是全长歌曲。此外,广泛使用的基于语言模型的方法的推理速度缓慢。为了应对这些挑战,我们提出了Difefrythm,这是第一个基于潜在扩散的歌曲生成模型,能够在仅十秒钟内使用高达4M45s的声音和伴奏来合成完整的歌曲,从而保持高度的音乐性和清晰度。尽管具有显着的功能,但衍射性的设计是简单而优雅的:它消除了对复杂数据准备的需求,采用直接的模型结构,并且在推理过程中仅需要歌词和样式提示。此外,其非自动回旋结构可确保快速推理速度。这种简单性保证了衍射性的可扩展性。此外,我们发布了完整的培训代码以及大规模数据的预培训模型,以促进可重复性和进一步的研究。
semviqa:越南信息事实检查的语义问题答案系统
-
标题: SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking
-
作者: Nam V. Nguyen, Dien X. Tran, Thanh T. Tran, Anh T. Hoang, Tai V. Duong, Di T. Le, Phuc-Lu Le
-
日期: 2025-03-02
-
ArXiv主页: https://arxiv.org/abs/2503.00955
英文摘要
The rise of misinformation, exacerbated by Large Language Models (LLMs) like GPT and Gemini, demands robust fact-checking solutions, especially for low-resource languages like Vietnamese. Existing methods struggle with semantic ambiguity, homonyms, and complex linguistic structures, often trading accuracy for efficiency. We introduce SemViQA, a novel Vietnamese fact-checking framework integrating Semantic-based Evidence Retrieval (SER) and Two-step Verdict Classification (TVC). Our approach balances precision and speed, achieving state-of-the-art results with 78.97% strict accuracy on ISE-DSC01 and 80.82% on ViWikiFC, securing 1st place in the UIT Data Science Challenge. Additionally, SemViQA Faster improves inference speed 7x while maintaining competitive accuracy. SemViQA sets a new benchmark for Vietnamese fact verification, advancing the fight against misinformation. The source code is available at: https://github.com/DAVID-NGUYEN-S16/SemViQA.
中文摘要
大语言模型(例如GPT和Gemini)加剧了错误信息的兴起,要求对事实检查解决方案进行强有力的核对解决方案,尤其是对于像越南人这样的低资源语言。现有的方法与语义歧义,同音词和复杂的语言结构相比,通常以效率交易精度。我们介绍了Semviqa,这是一个新颖的越南事实检查框架,该框架整合了基于语义的证据检索(SER)和两步性判决分类(TVC)。我们的方法平衡了精度和速度,以78.97 \%的ISE-DSC01和80.82 \%在Viwikifc上获得最新的结果,并在UIT数据科学挑战中获得第一名。此外,SemViQA更快地提高了推理速度7倍,同时保持竞争精度。Semviqa为越南事实验证设定了一个新的基准,推进了反对错误信息的斗争。源代码可在以下网址提供:https://github.com/david-nguyen-s16/semviqa。
ONEREC:统一检索和排名,并具有生成的推荐和迭代优先对齐
- 标题: OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment
- 作者: Jiaxin Deng, Shiyao Wang, Kuo Cai, Lejian Ren, Qigen Hu, Weifeng Ding, Qiang Luo, Guorui Zhou
- 日期: 2025-02-26
- ArXiv主页: https://arxiv.org/abs/2502.18965
- 论文链接: https://arxiv.org/pdf/2502.18965
英文摘要
Recently, generative retrieval-based recommendation systems have emerged as a promising paradigm. However, most modern recommender systems adopt a retrieve-and-rank strategy, where the generative model functions only as a selector during the retrieval stage. In this paper, we propose OneRec, which replaces the cascaded learning framework with a unified generative model. To the best of our knowledge, this is the first end-to-end generative model that significantly surpasses current complex and well-designed recommender systems in real-world scenarios. Specifically, OneRec includes: 1) an encoder-decoder structure, which encodes the user’s historical behavior sequences and gradually decodes the videos that the user may be interested in. We adopt sparse Mixture-of-Experts (MoE) to scale model capacity without proportionally increasing computational FLOPs. 2) a session-wise generation approach. In contrast to traditional next-item prediction, we propose a session-wise generation, which is more elegant and contextually coherent than point-by-point generation that relies on hand-crafted rules to properly combine the generated results. 3) an Iterative Preference Alignment module combined with Direct Preference Optimization (DPO) to enhance the quality of the generated results. Unlike DPO in NLP, a recommendation system typically has only one opportunity to display results for each user’s browsing request, making it impossible to obtain positive and negative samples simultaneously. To address this limitation, We design a reward model to simulate user generation and customize the sampling strategy. Extensive experiments have demonstrated that a limited number of DPO samples can align user interest preferences and significantly improve the quality of generated results. We deployed OneRec in the main scene of Kuaishou, achieving a 1.6% increase in watch-time, which is a substantial improvement.
中文摘要
最近,基于生成检索的推荐系统已成为有希望的范式。但是,大多数现代推荐的系统采用检索和级别的策略,在该策略中,生成模型仅在检索阶段作为选择器起作用。在本文中,我们提出了Onerec,该Onerec用统一的生成模型代替了级联的学习框架。据我们所知,这是第一个端到端的生成模型,它在现实世界中显着超过了当前的复杂且精心设计的推荐系统。具体而言,ONEREC包括:1)编码器解码器结构,该结构编码用户的历史行为序列并逐渐解码用户可能感兴趣的视频。我们采用稀疏的Expertuter(MOE)来扩展模型,而无需成比例地增加计算FLOP。2)通过会议的生成方法。与传统的Next-项目预测相反,我们提出了一个会话的一代,它比依靠手工制作的规则正确结合生成的结果的逐点更优雅和上下文相干。3)迭代偏好比对模块与直接偏好优化(DPO)相结合,以提高生成的结果的质量。与NLP中的DPO不同,建议系统通常只有一个机会来显示每个用户浏览请求的结果,因此无法同时获得正面和负面样本。为了解决此限制,我们设计了一个奖励模型来模拟用户生成并自定义采样策略。广泛的实验表明,有限数量的DPO样品可以使用户兴趣偏好保持一致,并显着提高生成的结果的质量。我们在Kuaishou的主要场景中部署了Onerec,在观看时间增加了1.6%,这是一个很大的改进。
MPO:通过META计划优化增强LLM代理
-
标题: MPO: Boosting LLM Agents with Meta Plan Optimization
-
作者: Weimin Xiong, Yifan Song, Qingxiu Dong, Bingchan Zhao, Feifan Song, Xun Wang, Sujian Li
-
日期: 2025-03-04
-
ArXiv主页: https://arxiv.org/abs/2503.02682
-
gitHub仓库: https://github.com/WeiminXiong/MPO
英文摘要
Recent advancements in large language models (LLMs) have enabled LLM-based agents to successfully tackle interactive planning tasks. However, despite their successes, existing approaches often suffer from planning hallucinations and require retraining for each new agent. To address these challenges, we propose the Meta Plan Optimization (MPO) framework, which enhances agent planning capabilities by directly incorporating explicit guidance. Unlike previous methods that rely on complex knowledge, which either require significant human effort or lack quality assurance, MPO leverages high-level general guidance through meta plans to assist agent planning and enables continuous optimization of the meta plans based on feedback from the agent’s task execution. Our experiments conducted on two representative tasks demonstrate that MPO significantly outperforms existing baselines. Moreover, our analysis indicates that MPO provides a plug-and-play solution that enhances both task completion efficiency and generalization capabilities in previous unseen scenarios.
中文摘要
大型语言模型(LLM)的最新进展使基于LLM的代理能够成功处理互动计划任务。但是,尽管取得了成功,但现有的方法通常会遭受计划幻觉的困扰,并且需要为每个新代理商进行再培训。为了应对这些挑战,我们提出了META计划优化(MPO)框架,该框架通过直接合并明确的指导来增强代理计划功能。与以前依靠复杂知识(需要大量人力努力或缺乏质量保证的方法)的方法不同,MPO通过META计划利用高水平的一般指导来协助代理计划,并基于代理商任务执行的反馈来持续优化元计划。我们对两个代表性任务进行的实验表明,MPO明显优于现有基准。此外,我们的分析表明,MPO提供了一个插件解决方案,可在以前的看不见的情况下提高任务完成效率和概括能力。
llm作为电话断开:迭代生成扭曲信息
-
标题: LLM as a Broken Telephone: Iterative Generation Distorts Information
-
作者: Amr Mohamed, Mingmeng Geng, Michalis Vazirgiannis, Guokan Shang
-
日期: 2025-02-27
-
ArXiv主页: https://arxiv.org/abs/2502.20258
-
gitHub仓库: https://github.com/amr-mohamedd/LLM-as-a-Broken-Telephone
英文摘要
As large language models are increasingly responsible for online content, concerns arise about the impact of repeatedly processing their own outputs. Inspired by the “broken telephone” effect in chained human communication, this study investigates whether LLMs similarly distort information through iterative generation. Through translation-based experiments, we find that distortion accumulates over time, influenced by language choice and chain complexity. While degradation is inevitable, it can be mitigated through strategic prompting techniques. These findings contribute to discussions on the long-term effects of AI-mediated information propagation, raising important questions about the reliability of LLM-generated content in iterative workflows.
中文摘要
由于大型语言模型越来越多地负责在线内容,因此对反复处理自己的产出的影响产生了担忧。受到锁定人类交流的“电话破裂”效应的启发,这项研究调查了LLM是否通过迭代产生类似地扭曲信息。通过基于翻译的实验,我们发现失真会随着时间的流逝而累积,受语言选择和链复杂性的影响。尽管不可避免地退化,但可以通过战略提示技术来减轻它。这些发现有助于讨论AI介导的信息传播的长期影响,从而提出了有关迭代工作流程中LLM生成内容可靠性的重要问题。
音频Flamingo 2:具有长期审计和专家推理能力的音频语言模型
- 标题: Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
- 作者: Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, Bryan Catanzaro
- 日期: 2025-03-06
- ArXiv主页: https://arxiv.org/abs/2503.03983
- 论文链接: https://arxiv.org/pdf/2503.03983
- 项目链接: https://huggingface.co/spaces/nvidia/audio-flamingo-2
- gitHub仓库: https://github.com/NVIDIA/audio-flamingo
英文摘要
Understanding and reasoning over non-speech sounds and music are crucial for both humans and AI agents to interact effectively with their environments. In this paper, we introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio reasoning, and (iii) a multi-stage curriculum learning strategy. AF2 achieves state-of-the-art performance with only a 3B parameter small language model, surpassing large open-source and proprietary models across over 20 benchmarks. Next, for the first time, we extend audio understanding to long audio segments (30 secs to 5 mins) and propose LongAudio, a large and novel dataset for training ALMs on long audio captioning and question-answering tasks. Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed LongAudioBench, an expert annotated benchmark for evaluating ALMs on long audio understanding capabilities. We conduct extensive ablation studies to confirm the efficacy of our approach. Project Website: https://research.nvidia.com/labs/adlr/AF2/.
中文摘要
对非语音声音和音乐的理解和推理对于人类和AI代理人与环境有效互动至关重要。在本文中,我们介绍了Audio Flamingo 2(AF2),这是一种音频语言模型(ALM),具有先进的音频理解和推理功能。AF2利用(i)自定义拍手模型,(ii)用于细粒音频推理的合成音频QA数据,以及(iii)多阶段的课程学习策略。AF2仅使用3B参数小语言模型实现最先进的性能,超过20个基准的大型开源和专有模型。接下来,我们第一次将音频理解扩展到长音段(30秒至5分钟),并提出了Longaudio,这是一个大型新颖的数据集,用于训练施舍的长音频字幕和提问任务。在Longaudio上进行的微调AF2可在我们拟议的Longaudiobench上表现出色,这是一个专家注释的基准,用于评估施舍的长音频理解能力。我们进行广泛的消融研究以确认我们方法的功效。项目网站:https://research.nvidia.com/labs/adlr/af2/。
我们可以使用ImageNet进行多远以获得文本图像生成?
- 标题: How far can we go with ImageNet for Text-to-Image generation?
- 作者: L. Degeorge, A. Ghosh, N. Dufour, D. Picard, V. Kalogeiton
- 日期: 2025-02-28
- ArXiv主页: https://arxiv.org/abs/2502.21318
- 论文链接: https://arxiv.org/pdf/2502.21318
- 项目链接: https://lucasdegeorge.github.io/projects/t2i_imagenet/
- gitHub仓库: https://github.com/lucasdegeorge/T2I-ImageNet
英文摘要
Recent text-to-image (T2I) generation models have achieved remarkable results by training on billion-scale datasets, following a `bigger is better’ paradigm that prioritizes data quantity over quality. We challenge this established paradigm by demonstrating that strategic data augmentation of small, well-curated datasets can match or outperform models trained on massive web-scraped collections. Using only ImageNet enhanced with well-designed text and image augmentations, we achieve a +2 overall score over SD-XL on GenEval and +5 on DPGBench while using just 1/10th the parameters and 1/1000th the training images. Our results suggest that strategic data augmentation, rather than massive datasets, could offer a more sustainable path forward for T2I generation.
中文摘要
最近的文本到图像(T2I)生成模型通过对数十亿个数据集进行培训,取得了显着的结果,遵循“更大的IS IS范围”范式,将数据数量优先于质量而优先。我们通过证明小型,精心策划的数据集的战略数据扩大可以匹配或胜过接受大型Web式收藏培训的模型来挑战这种既定的范式。仅使用精心设计的文本和图像增强功能来增强Imagenet,我们仅使用1/10的参数和训练图像的1/10,在Geneval上获得+2的总分,而DPGBench上的SD-XL和+5的总分。我们的结果表明,战略数据增强而不是大量数据集可以为T2I生成提供更可持续的途径。
Lingoly-too:通过语言型和拼字化混淆的推理解开记忆
- 标题: LINGOLY-TOO: Disentangling Memorisation from Reasoning with Linguistic Templatisation and Orthographic Obfuscation
- 作者: Jude Khouja, Karolina Korgul, Simi Hellsten, Lingyi Yang, Vlad Neacs, Harry Mayne, Ryan Kearns, Andrew Bean, Adam Mahdi
- 日期: 2025-03-04
- ArXiv主页: https://arxiv.org/abs/2503.02972
- 论文链接: https://arxiv.org/pdf/2503.02972
- 项目链接: https://huggingface.co/spaces/jkhouja/lingoly-too
- gitHub仓库: https://github.com/jkhouja/L2
英文摘要
Effective evaluation of the reasoning capabilities of large language models (LLMs) are susceptible to overestimation due to data exposure of evaluation benchmarks. We introduce a framework for producing linguistic reasoning problems that reduces the effect of memorisation in model performance estimates and apply this framework to develop LINGOLY-TOO, a challenging evaluation benchmark for linguistic reasoning. By developing orthographic templates, we dynamically obfuscate the writing systems of real languages to generate numerous question variations. These variations preserve the reasoning steps required for each solution while reducing the likelihood of specific problem instances appearing in model training data. Our experiments demonstrate that frontier models, including OpenAI o1-preview and DeepSeem R1, struggle with advanced reasoning. Our analysis also shows that LLMs exhibit noticeable variance in accuracy across permutations of the same problem, and on average perform better on questions appearing in their original orthography. Our findings highlight the opaque nature of response generation in LLMs and provide evidence that prior data exposure contributes to overestimating the reasoning capabilities of frontier models.
中文摘要
有效评估大语言模型(LLM)的推理能力(LLMS)因评估基准的数据暴露而容易高估。我们介绍了一个框架,用于产生语言推理问题,从而减少了模型绩效估算中记忆的影响,并将此框架应用于开发Lingoly-Too,这是语言推理的挑战性评估基准。通过开发拼字模板,我们动态混淆了真实语言的写作系统,以产生许多问题变化。这些变化保留了每个解决方案所需的推理步骤,同时降低了模型培训数据中出现的特定问题实例的可能性。我们的实验表明,包括OpenAI O1-preview和DeepSeem R1在内的边境模型与先进的推理作斗争。我们的分析还表明,LLM在同一问题的排列中表现出明显的准确性差异,并且平均而言,在其原始拼字法中出现的问题上表现更好。我们的发现突出了LLM中响应产生的不透明性质,并提供了证据表明先前的数据暴露有助于高估边境模型的推理能力。
GEN3C:具有精确的相机控制的3D信息世界一致的视频生成
- 标题: GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control
- 作者: Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, Jun Gao
- 日期: 2025-03-05
- ArXiv主页: https://arxiv.org/abs/2503.03751
- 论文链接: https://arxiv.org/pdf/2503.03751
英文摘要
We present GEN3C, a generative video model with precise Camera Control and temporal 3D Consistency. Prior video models already generate realistic videos, but they tend to leverage little 3D information, leading to inconsistencies, such as objects popping in and out of existence. Camera control, if implemented at all, is imprecise, because camera parameters are mere inputs to the neural network which must then infer how the video depends on the camera. In contrast, GEN3C is guided by a 3D cache: point clouds obtained by predicting the pixel-wise depth of seed images or previously generated frames. When generating the next frames, GEN3C is conditioned on the 2D renderings of the 3D cache with the new camera trajectory provided by the user. Crucially, this means that GEN3C neither has to remember what it previously generated nor does it have to infer the image structure from the camera pose. The model, instead, can focus all its generative power on previously unobserved regions, as well as advancing the scene state to the next frame. Our results demonstrate more precise camera control than prior work, as well as state-of-the-art results in sparse-view novel view synthesis, even in challenging settings such as driving scenes and monocular dynamic video. Results are best viewed in videos. Check out our webpage! https://research.nvidia.com/labs/toronto-ai/GEN3C/
中文摘要
我们提出了Gen3C,这是一种具有精确的相机控制和时间3D一致性的生成视频模型。先前的视频模型已经生成了现实的视频,但是它们倾向于利用少量3D信息,导致不一致之处,例如弹出和不存在的对象。相机控制(如果实现)根本不精确,因为相机参数仅是对神经网络的输入,然后必须推断视频如何依赖相机。相比之下,GEN3C由3D缓存:通过预测种子图像或先前生成的帧的像素深度获得的点云。在生成下一个帧时,Gen3C由用户提供的新相机轨迹在3D缓存的2D渲染上进行条件。至关重要的是,这意味着GEN3C既不需要记住它以前产生的内容,也不必须从相机姿势推断出图像结构。相反,该模型可以将其所有生成力重点放在以前未观察到的区域上,并将场景状态推向下一帧。我们的结果证明了比先前的工作更精确的相机控制,并且最新的摄像头会导致稀疏视图综合,即使在诸如驾驶场景和单眼动态视频之类的挑战性设置中也是如此。在视频中最好查看结果。查看我们的网页!https://research.nvidia.com/labs/toronto-ai/gen3c/
梯子:通过递归问题分解自我提出的LLM
-
标题: LADDER: Self-Improving LLMs Through Recursive Problem Decomposition
-
作者: Toby Simonds, Akira Yoshiyama
-
日期: 2025-03-02
-
ArXiv主页: https://arxiv.org/abs/2503.00735
英文摘要
We introduce LADDER (Learning through Autonomous Difficulty-Driven Example Recursion), a framework which enables Large Language Models to autonomously improve their problem-solving capabilities through self-guided learning by recursively generating and solving progressively simpler variants of complex problems. Unlike prior approaches that require curated datasets or human feedback, LADDER leverages a model’s own capabilities to generate easier question variants. We demonstrate LADDER’s effectiveness in the subject of mathematical integration, improving Llama 3.2 3B’s accuracy from 1% to 82% on undergraduate-level problems and enabling Qwen2.5 7B Deepseek-R1 Distilled to achieve 73% on the MIT Integration Bee qualifying examination. We also introduce TTRL (Test-Time Reinforcement Learning), where we perform reinforcement learning on variants of test problems at inference time. TTRL enables Qwen2.5 7B Deepseek-R1 Distilled to achieve a state-of-the-art score of 90% on the MIT Integration Bee qualifying examination, surpassing OpenAI o1’s performance. These results show how self-directed strategic learning can achieve significant capability improvements without relying on architectural scaling or human supervision.
中文摘要
我们介绍梯子(通过自主难度驱动的示例递归学习),该框架使大型语言模型能够通过递归产生和解决复杂问题的更简单的变体来自主地通过自我引导的学习来自主提高其问题解决能力。与需要策划数据集或人类反馈的先前方法不同,梯子利用模型自己的功能来生成更容易的问题变体。我们证明了阶梯在数学整合主题中的有效性,在本科级别的问题上提高了Llama 3.2 3B的准确性从1%提高到82%,并使QWEN2.5 7B DEEPSEEK-R1蒸馏出来,以在MIT集成BEE BEE BEE BEE合格检查中实现73%。我们还介绍了TTRL(测试时间增强学习),在此,我们在推理时对测试问题的变体进行了加强学习。TTRL启用QWEN2.5 7B DEEPSEEK-R1蒸馏蒸馏以在MIT集成Bee合格考试中获得90%的最新分数,超过Openai O1的表现。这些结果表明,自我指导的战略学习如何在不依赖建筑规模或人类监督的情况下实现重大的能力提高。
LLM时代的Wikipedia:进化与风险
-
标题: Wikipedia in the Era of LLMs: Evolution and Risks
-
作者: Siming Huang, Yuliang Xu, Mingmeng Geng, Yao Wan, Dongping Chen
-
日期: 2025-03-04
-
ArXiv主页: https://arxiv.org/abs/2503.02879
-
gitHub仓库: https://github.com/HSM316/LLM_Wikipedia
英文摘要
In this paper, we present a thorough analysis of the impact of Large Language Models (LLMs) on Wikipedia, examining the evolution of Wikipedia through existing data and using simulations to explore potential risks. We begin by analyzing page views and article content to study Wikipedia’s recent changes and assess the impact of LLMs. Subsequently, we evaluate how LLMs affect various Natural Language Processing (NLP) tasks related to Wikipedia, including machine translation and retrieval-augmented generation (RAG). Our findings and simulation results reveal that Wikipedia articles have been influenced by LLMs, with an impact of approximately 1%-2% in certain categories. If the machine translation benchmark based on Wikipedia is influenced by LLMs, the scores of the models may become inflated, and the comparative results among models might shift as well. Moreover, the effectiveness of RAG might decrease if the knowledge base becomes polluted by LLM-generated content. While LLMs have not yet fully changed Wikipedia’s language and knowledge structures, we believe that our empirical findings signal the need for careful consideration of potential future risks.
中文摘要
在本文中,我们对大语言模型(LLM)对Wikipedia的影响进行了详尽的分析,通过现有数据检查Wikipedia的演变,并使用模拟探索潜在的风险。我们首先分析页面视图和文章内容,以研究Wikipedia最近的变化并评估LLM的影响。随后,我们评估了LLM如何影响与Wikipedia相关的各种自然语言处理(NLP)任务,包括机器翻译和检索效果发电(RAG)。我们的发现和仿真结果表明,Wikipedia文章受LLM的影响,在某些类别中的影响约为1%-2%。如果基于Wikipedia的机器翻译基准受LLM的影响,则模型的得分可能会膨胀,并且模型之间的比较结果也可能会发生变化。此外,如果知识库被LLM生成的内容污染,则破布的有效性可能会降低。尽管LLM尚未完全改变Wikipedia的语言和知识结构,但我们认为我们的经验发现表明需要仔细考虑潜在的未来风险。
SOS1:O1和R1般的推理LLM是平方的求解器
-
标题: SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers
-
作者: Kechen Li, Wenqi Zhu, Coralia Cartis, Tianbo Ji, Shiwei Liu
-
日期: 2025-02-27
-
ArXiv主页: https://arxiv.org/abs/2502.20545
-
gitHub仓库: https://github.com/Joe-2002/SoS1
英文摘要
Large Language Models (LLMs) have achieved human-level proficiency across diverse tasks, but their ability to perform rigorous mathematical problem solving remains an open challenge. In this work, we investigate a fundamental yet computationally intractable problem: determining whether a given multivariate polynomial is nonnegative. This problem, closely related to Hilbert’s Seventeenth Problem, plays a crucial role in global polynomial optimization and has applications in various fields. First, we introduce SoS-1K, a meticulously curated dataset of approximately 1,000 polynomials, along with expert-designed reasoning instructions based on five progressively challenging criteria. Evaluating multiple state-of-the-art LLMs, we find that without structured guidance, all models perform only slightly above the random guess baseline 50%. However, high-quality reasoning instructions significantly improve accuracy, boosting performance up to 81%. Furthermore, our 7B model, SoS-7B, fine-tuned on SoS-1K for just 4 hours, outperforms the 671B DeepSeek-V3 and GPT-4o-mini in accuracy while only requiring 1.8% and 5% of the computation time needed for letters, respectively. Our findings highlight the potential of LLMs to push the boundaries of mathematical reasoning and tackle NP-hard problems.
中文摘要
大型语言模型(LLMS)已在各种任务中实现了人类水平的水平,但是他们执行严格的数学问题解决的能力仍然是一个悬而未决的挑战。在这项工作中,我们研究了一个基本但计算上棘手的问题:确定给定的多元多项式是否不负。这个问题与希尔伯特的第十七个问题密切相关,在全球多项式优化中起着至关重要的作用,并且在各个领域都有应用。首先,我们介绍了SOS-1K,这是一个精心策划的大约1,000个多项式的数据集,以及基于五个逐渐挑战的标准的专家设计的推理说明。评估多个最先进的LLMS,我们发现没有结构化指导,所有模型的表现仅略高于随机猜测基线50%。但是,高质量的推理说明可显着提高准确性,从而提高性能高达81%。此外,我们的7B型号SOS-7B在SOS-1K上进行了微调仅4个小时,其准确性优于671B DeepSeek-V3和GPT-4O-Mini,而仅需要字母所需的1.8%和5%的计算时间。我们的发现突出了LLM的潜力,即推动数学推理的界限并解决NP-HARD问题。
更多推荐



所有评论(0)