QUEST: Query Stream for Practical Cooperative Perception论文阅读

多智能体课程结课汇报，结合我自己的CV方向阅读两篇论文汇报，这是第一篇，我们一段一段的解析。每段分原文，翻译和自己的分析内容组成。（分析仅代表个人意见和理解，错误之处欢迎指正）一、Abatract原文：Abstract—翻译：合作感知通过提供额外的视角和扩展感知领域，可以有效地提高个体的感知性能。现有的合作范式要么是可解释的（结果合作），要么是灵活的（特征合作）。本文提出了查询协作的概念，以实现可

阿智的求学之路

451人浏览 · 2025-12-07 18:23:33

阿智的求学之路 · 2025-12-07 18:23:33 发布

多智能体课程结课汇报，结合我自己的CV方向阅读两篇论文汇报，这是第一篇，我们一段一段的解析。每段分原文，翻译和自己的分析内容组成。（分析仅代表个人意见和理解，错误之处欢迎指正）

一、Abatract

原文：Abstract— Cooperative perception can effectively enhance individual perception performance by providing additional viewpoint and expanding the sensing field. Existing cooperation paradigms are either interpretable (result cooperation) or flexible (feature cooperation). In this paper, we propose the concept of query cooperation to enable interpretable instancelevel flexible feature interaction. To specifically explain the concept, we propose a cooperative perception framework, termed QUEST, which let query stream flow among agents. The crossagent queries are interacted via fusion for co-aware instances and complementation for individual unaware instances. Taking camera-based vehicle-infrastructure perception as a typical
practical application scene, the experimental results on the realworld dataset, DAIR-V2X-Seq, demonstrate the effectiveness of QUEST and further reveal the advantage of the query
cooperation paradigm on transmission flexibility and robustness to packet dropout. We hope our work can further facilitate the cross-agent representation interaction for better cooperative
perception in practice.

翻译：合作感知通过提供额外的视角和扩展感知领域，可以有效地提高个体的感知性能。现有的合作范式要么是可解释的（结果合作），要么是灵活的（特征合作）。本文提出了查询协作的概念，以实现可解释的实例级灵活的特征交互。为了具体解释这个概念，我们提出了一个协作感知框架，以基于摄像头的车辆基础设施感知为典型的实际应用场景，给出了在真实数据集DAIR-V2X-Seq上的实验结果，验证了QUEST的有效性，并进一步揭示了查询协作范例在传输灵活性和对丢包鲁棒性方面的优势，希望我们的工作能进一步促进跨在实践中，Agent表示交互以获得更好的合作感知。

分析：还是abstract的基本结构，背景+目标+方法+结果+结论。背景就是合作感知，是干嘛的，提供额外的视角和扩展，提高单一智能体的感知能力，但是现有的方法面临的一个两难困境，简单来说，就是在“容易让人看懂但信息量少”和“信息量大但像天书一样难懂”之间做选择。我们要么选 A（结果协同，清楚但简陋），要么选 B（特征协同，丰富但难懂）。所以他提出了一个新的方法，一个多智能体的协作感知框架，后面results和conclusion就不用多说了。

二、Introduction

原文：Despite the significant progress have been made in individual perception, intelligent vehicles still have to face challenges of unobservable dangers caused by occlusion and
limited perception range. Different from the individual perception which senses the surrounding with its own onboard sensor system, cooperative perception, especially vehicleinfrastructure cooperative perception (VICP), shed light on reliable autonomous driving in a complex traffic environment and have achieved increasing attention recently [1], [2]. Leveraging the roadside sensor system with more flexible mounting height and posture, the cooperative perception
field is effectively extended, and some challenging individual perception cases (e.g., long-range small object detection) can be readily tackled in VICP setting [3], [4].

翻译：

尽管在个体感知方面已经取得了很大的进展，但智能车辆仍然面临着由于遮挡和感知范围有限而导致的不可观测危险的挑战，与个体感知通过自身的车载传感器系统感知周围环境不同，协同感知，特别是车辆基础设施协同感知（VICP），揭示了在复杂交通环境中可靠的自动驾驶，并在最近获得了越来越多的关注[1]，[2]。利用具有更灵活的安装高度和姿态的路边传感器系统，有效地扩展了协作感知领域，并且可以在VICP设置中轻松解决一些具有挑战性的个体感知情况（例如，远距离小物体检测）[3]，[4]。

分析：背景介绍，快速的告诉读者和审稿人，为什么我这个研究方向重要，个体感知已经很好了，但是迫切的问题不是个体感知能已解决的，我这个协同感知领域的研究现状是什么样的嘛，快速的划定研究边界，也是方面下一段，引出边界之外未解决的问题

原文：Advantages are usually followed by new challenges. Naturally, the first and foremost question is how to cooperate between multiple agents. According to what is shared among
agents, there are three typical cooperation paradigms [1],[2], [5], including data cooperation (early fusion), feature cooperation (intermediate fusion), and result cooperation (late fusion). Data cooperation [6], [7] is regarded as the upper bound of performance since the comprehensive information is interchanged along with raw data across agents. However,the high transmission cost of massive data is unbearable in practical applications. Result cooperation is widely deployed in practice due to the advantages of bandwidth-economic, where agents only share predictions [3], [6]. Nevertheless,the significant information loss in result cooperation makes it
highly reliant on accurate individual predictions. Compared with those two paradigms, feature cooperation [8]–[15] is more flexible and performance-bandwidth balanced, as the information loss is controllable via feature selection and compression. Even though some of them have achieved region-level feature selection [16], the interpretability of feature selection and fusion are still limited, since the scenelevel features abstractly represent the whole observable region. It is worth noting that the interaction between predictions in result cooperation is instance-level, resulting in physically interpretable cooperation targets.

翻译：优势通常伴随着新的挑战。自然，首要的问题是多个代理之间如何合作。根据代理之间共享什么，有三种典型的合作范式[1]，[2]，[5]，包括数据合作。（早期融合），功能合作（中间融合），以及成果合作（后期融合）。数据协作[6]，[7]被认为是性能的上限，因为综合信息与原始数据沿着在代理之间交换。然而，大量数据的高传输成本在实际应用中是难以承受的，结果协作由于带宽经济的优点而在实践中被广泛部署，其中代理只共享预测[3]，[6]。然而，结果协作中显著的信息损失使得它高度依赖于准确的个体预测。与这两种范式相比，特征协作[8]-[15]更灵活，性能-带宽平衡，因为信息损失可以通过特征选择和压缩来控制。即使其中一些已经实现了区域级特征选择[16]，由于场景级特征抽象地代表了整个可观察区域，因此特征选择和融合的可解释性仍然有限。值得注意的是，结果协作中预测之间的交互是实例级的，从而导致物理上可解释的协作目标。

分析：问题陈述在abstract中已经提到了，这里立足已有文献，佐证自己研究的必要性，详细的阐述自己要解决的问题是什么，很规范正常的Introduction结构。

原文：Addressing that, we naturally come up with a question: is there an eclectic approach for cooperative perception, which is both interpretable and flexible?

翻译：针对这一点，我们自然会提出一个问题：是否存在一种既可解释又灵活的合作感知折衷方法？

分析：就是要引出自己的方法了

原文：Inspired by the success of transformer-based methods in individual perception tasks [17]–[19], we propose the concept of query cooperation, which is an instance-level feature interaction paradigm based on the query stream across agents, standing on the midpoint between scene-level feature cooperation and instance-level result cooperation (Figure 1).The instance-level cooperation makes it more physically interpretable, and feature interaction introduces more information elasticity. Specifically, we propose a framework, named QUEST, as a representative approach to describe the concept, where queries flow in the stream among agents. Firstly, each agent performs individual transformer-based perception. Every query output from the decoder corresponds to a possible detected object, and the query will be shared if its confidence score meets the requirement of the request agent. As the cross-agent queries arrive, they are utilized for both query fusion and complementation. Theoretically, query fusion can enhance the feature of the sensed instance with the feature from other viewpoints, while query complementation
can directly complement the unaware instance of the local perception system. Then, the queries are used for cooperative perception, resulting in the final perception results.To evaluate the performance of QUEST, we generate the camera-centric cooperation labels on DAIR-V2X-Seq based on the single-side groundtruth labeled at the image-captured timestamps *.
Our contributions are summarized as follows:
• We propose the concept of query cooperation paradigm for cooperative perception task, which is more interpretable than scene-level feature cooperation and more flexible than result cooperation.
• A query cooperation framework, termed QUEST, is proposed as a representative approach. The cross-agent queries interact at the instance level via fusion and complementation.
• We take the camera-based vehicle-infrastructure cooperative object detection as a typical application scene. The experimental results on the real-world dataset, DAIRV2X-Seq, demonstrate the effectiveness of QUEST and further show the advantage of the query cooperation paradigm on flexibility and robustness. Besides, the camera-centric cooperation labels are generated to facilitate the further development of the related researches.

翻译：

受基于 Transformer 的方法在单体感知任务中取得成功的启发 [17]–[19]，我们提出了 Query 协同（Query Cooperation） 的概念。这是一种基于跨智能体 Query 流的实例级（Instance-level）特征交互范式，介于场景级特征协同与实例级结果协同之间（如图1所示）。实例级协同使其更具物理可解释性，而特征交互则引入了更多的信息弹性。具体而言，我们提出了一个名为 QUEST 的框架作为阐述该概念的代表性方法，在该框架中，Query 以流的形式在智能体之间传输。

首先，每个智能体执行独立的基于 Transformer 的感知任务。解码器（Decoder）输出的每一个 Query 对应一个可能检测到的物体；如果其置信度得分满足请求智能体的要求，该 Query 将被共享。当接收到跨智能体的 Query 时，它们将被用于 Query 融合（Fusion）和互补（Complementation）。从理论上讲，Query 融合利用来自其他视角的特征增强已感知实例的特征，而 Query 互补则能直接补充本地感知系统未察觉的实例（即盲区物体）。随后，这些 Query 被用于协同感知，从而生成最终的感知结果。为了评估 QUEST 的性能，我们在 DAIR-V2X-Seq 数据集上，基于图像采集时间戳标记的单端真值（Groundtruth），生成了以相机为中心的协同标签。

我们的贡献总结如下：

我们提出了面向协同感知任务的 Query 协同范式 概念，相较于场景级特征协同，它更具可解释性；相较于结果协同，它更具灵活性。
提出了一个名为 QUEST 的 Query 协同框架作为代表性方法。跨智能体 Query 通过融合与互补机制在实例层面进行交互。
我们以基于相机的车路协同目标检测为典型应用场景。在真实世界数据集 DAIR-V2X-Seq 上的实验结果证明了 QUEST 的有效性，并进一步揭示了 Query 协同范式在传输灵活性和鲁棒性方面的优势。此外，我们生成了以相机为中心的协同标签，以促进相关研究的进一步发展

分析：跨智能体 Query 流的实例级（Instance-level）特征交互范式,具体不知道是干嘛的，好像是把不同智能体（多视角数据）的结果通过Transformer编码器，得到Query特征进行交互融合，这里我很想知道不同视角下的同一个物体是怎么对应起来的？具体的要看Method怎么写的。

二、Related Works

原文：To break the sensing range limitation of onboard sensor systems and eliminate the influences of unobservable dangers, cooperative perception has attracted increasing attention in recent years. The most intuitive approach is data cooperation, which transmits raw sensor data and fundamentally overcomes the occlusion and long-range perception problem. Since 3D data can be directly aggregated, most data cooperation approaches are LiDAR-based [6],[7]. Although raw data reserves comprehensive information, the high transmission cost makes it challenging to deploy in practice. For the convenience of communication, result cooperation only transmits perception predictions, which is the most bandwidth-economic [3], [6]. In addition, the
instance-level bounding box aggregation makes the cooperation more physically interpretable. However, the performance of result cooperation highly relies on the accurate individual perception and precise parameters for coordinate system transformation. Therefore, recent methods pay more attention to feature cooperation, which can achieve better performance-bandwidth balance [8]–[16]. Compared with the simple bounding box, the feature map is more flexible for both fusion and compression, but the scene-level feature cooperation is redundant for object perception and less explainable. Aiming on interpretable flexible cooperation, we propose the concept of query cooperation, which transmits instance-level features across agents.

翻译：为了突破机载传感器系统的感知范围限制，消除不可观测危险的影响，近年来协同感知越来越受到人们的关注。最直观的方法是数据协作，它传输原始传感器数据，从根本上克服遮挡和远距离感知问题。由于3D数据可以直接聚合，大多数数据协作方法都是基于LiDAR的[6]，[7].虽然原始数据保留了全面的信息，但高昂的传输成本使其在实践中部署具有挑战性。为了方便通信，结果协作只传输感知预测，这是最节省带宽的[3]，[6]。此外，实例级边界框聚合使协作更具物理可解释性。然而，结果协作的性能高度依赖于精确的个体感知和精确的坐标系转换参数。因此，最近的方法更加关注特征协作，可以实现更好的性能-带宽平衡[8]-[16]。与简单的包围盒相比，特征映射对于融合和压缩都更灵活，但是场景级特征协作对于对象感知是冗余的，针对可解释的柔性协作，提出了查询协作的概念，在Agent之间传递实例级特征。

分析：相关工作

原文：Since the pioneering work DETR [17] is proposed for 2D object detection, the object query has been adopted for more and more perception tasks, including 3D detection and tracking. Query-based methods typically utilize sparse learnable queries for attentive feature aggregation. DETR3D [18] predicts 3D locations of queries and obtains the corresponding image features via projection. PETR [20] turns to embed image features with 3D position and directly learns the mapping relations using the attention mechanism. BEVFormer [21], [22] tackles the perception from a birdeye view with grid-shaped queries and manages to realize spatial-temporal feature interaction through the deformable transformer. Leveraging temporal information, query-based
methods are also beneficial to object tracking. To model cross-frame object association, MOTR [19] and TrackFormer[23] propose track query based on single frame object query.
MUTR [24] and PF-Track [25] utilizes track query and achieve promising tracking performance for multi-view tasks.All of the existing query-based methods are developed for individual perception, we further extend it to cooperative perception in this paper. Specifically, we propose the QUEST
framework to achieve a query stream across agents and design the cross-agent query interaction module for query fusion and complementation.

翻译：

自从开创性的工作DETR [17]被提出用于2D对象检测以来，对象查询已经被越来越多的感知任务所采用，包括3D检测和跟踪。基于查询的方法通常利用稀疏可学习的查询进行关注的特征聚合。DETR 3D [18]预测查询的3D位置并通过投影获得相应的图像特征。PETR [20]利用注意力机制直接学习映射关系，将图像特征嵌入到三维位置中。

BEVFormer [21]，[22]通过网格状查询处理来自鸟眼视图的感知，并通过可变形Transformer实现时空特征交互。利用时间信息，基于查询的方法也有利于对象跟踪。为了建模跨帧对象关联，MOTR [19]和TrackFormer [23]提出了基于单帧对象查询的跟踪查询。

MUTR [24]和PF-Track [25]利用跟踪查询并实现多视图任务的有希望的跟踪性能。

现有的基于查询的感知方法都是针对个体感知的，本文将其扩展到协作感知，提出了QUEST框架来实现跨Agent的查询流，并设计了跨Agent查询交互模块来实现查询的融合和互补。

分析：相关工作

三、Method

What to share and how to cooperate are the two main concerns for practical cooperative perception, especially considering the limited bandwidth of the wireless communication. To design a better cooperation strategy, it is expected to be both interpretable and flexible, since interpretability leads to controllable cooperation and flexibility provides more operation space and possibilities. Considering that, we propose the query cooperation paradigm, which shares
features across agents and performs cooperation via instancelevel feature interaction.
For clarity, we take vehicle-infrastructure cooperative perception as an example.

Query Generation. Both vehicle and infrastructure perform individual perception all the time, and each perceptionprediction P is corresponded to an object query Q, according to the theory of transformer-based perception,

Query Transmission. The query cooperation is triggered when the vehicle requests additional information from infrastructure side. Noting that the query request can be along with a specific instance-level requirement, like confidence threshold and region mask. Then, the queries met the requirement Qinf are posted to the vehicle side.

Query Interaction. Both the received queries Qinf and local queries Qveh are leveraged for further cooperative perception, and the query interaction strategy is to determine how to enhance and complement the Qveh with Qinf .

Query-based Prediction. Qcoop is further fed into querybased prediction heads for perception tasks, resulting in the final cooperative perception predictions Pcoop.

“共享什么”以及“如何协同”是实用化协同感知面临的两个主要关注点，尤其是在考虑到无线通信带宽受限的情况下。为了设计更优的协同策略，方案应当同时兼顾可解释性（Interpretable）和灵活性（Flexible）。这是因为可解释性能够带来可控的协同，而灵活性则提供了更多的操作空间和可能性。基于上述考虑，我们提出了 Query 协同（Query Cooperation）范式。该范式在智能体之间共享特征，并通过实例级特征交互来实现协同。为了便于阐述，我们以**车路协同感知（Vehicle-Infrastructure Cooperative Perception）为例进行说明。

查询生成。车辆和基础设施始终进行个体感知，每个感知预测P对应一个对象查询Q，根据基于变换器的感知理论。

查询传输。当车辆向基础设施侧请求附加信息时触发查询协作。注意，查询请求可以沿着一个特定的实例级要求，如置信度阈值和区域掩码。然后，满足要求Qinf的查询被发布到车辆侧。

查询交互：利用接收到的查询Qinf和本地查询Qveh进行进一步的协作感知，查询交互策略是确定如何用Qinf增强和补充Qveh。

基于查询的预测。Qcoop被进一步馈送到用于感知任务的基于查询的预测头中，从而产生最终的协作感知预测Pcoop。

分析：

一、设计逻辑：为什么选 Query？(The "Why")

这一部分在回答多智能体系统（MAS）设计中最本质的两个问题：传什么（What to share） 和 怎么合作（How to cooperate）。

痛点分析：
- 现实约束：无线通信带宽是有限的（不能传原图，甚至传特征图都嫌大）。
- 现有矛盾：
  - 结果协同（只传框）：可解释（知道是辆车），但不灵活（丢了特征，没法深度融合）。
  - 特征协同（传特征图）：灵活（信息多），但不可解释（黑盒，难控制）。
核心论点：
- 我们需要一个既能让人看懂（可控），又能保留丰富信息（灵活）的中间方案。
- 解决方案：Query 协同。
- 理由：Query 是实例级的（对应具体物体，可解释），同时它是一个向量（包含高维特征，灵活）。

二、工作流程拆解 (The "How")

这段文字清晰地描述了数据流动的四个阶段，这是你介绍架构图（Framework）时的最佳脚本：

第一步：查询生成 (Query Generation)

动作：车和路各自跑 Transformer 算法。
本质：利用 Transformer 的特性，把感知结果编码成一组 Object Query。
关键点：
每一个预测结果背后都有一个对应的特征向量。

第二步：查询传输 (Query Transmission) —— MAS 课程的高分点

动作：
1. 按需触发：不是一直发，是车端觉得需要了才请求（Request）。
2. 条件筛选：车端可以提要求（比如：“我只要置信度 > 0.6 的 Query” 或 “我只要左前方区域的 Query”）。
3. 发送：路端只把满足条件Qinf发过去。
分析：这体现了极高的通信效率和智能决策。它避免了盲目广播，极大地节省了带宽。

第三步：查询交互 (Query Interaction) —— 遮挡检测的核心

动作：车端拿到了外援Qinf，和自己Qveh的放在一起处理。
策略：
1. Enhance (增强)：对于两个人都看到的物体（共视），把特征融合起来，让检测更准（Fusion）。
2. Complement (互补)：对于车端没看到、路端看到的物体（遮挡/盲区），直接用路端的 Query 补全车端的缺失。
分析：这是解决遮挡问题的物理实现层。