检测中 Denoising Training (去噪训练) 怎么理解？

Denoising Training 的核心：生成带噪声的 GT：从真实 GT 添加随机噪声明确的监督信号：这些 noisy anchors 的目标是原始 GT增加正样本：所有 noisy anchors 都是正样本加速收敛：提供更好的训练信号不是真正的"去噪"，而是"从噪声中恢复"噪声尺度需要仔细设计（位置大，尺寸小）使用 attention mask 控制不同 instances 的交互可以显

Felaim

906人浏览 · 2025-12-13 12:37:13

Felaim · 2025-12-13 12:37:13 发布

检测中 Denoising Training (去噪训练) 怎么理解？

1. 什么是 Denoising？

Denoising Training（去噪训练） 是 DETR 系列方法中引入的一种训练技巧，用于加速 Transformer 检测器的收敛。

1.1 核心思想

Denoising 的核心思想是：给模型一些"带噪声的 GT（Ground Truth）"，让模型学习如何从这些噪声中恢复出正确的 GT。

这类似于：

自监督学习：模型需要从损坏的输入中恢复原始信息
数据增强：通过添加噪声增加训练样本的多样性
正则化：防止模型过拟合

1.2 为什么需要 Denoising？

问题： DETR 系列方法训练收敛慢，需要很多轮训练才能达到好的效果。

原因：

二分匹配的不稳定性：在训练初期，模型预测很差，Hungarian matching 的结果不稳定
缺乏明确的监督信号：模型需要从随机初始化的 queries 学习，没有明确的起始点
正负样本不平衡：大部分 queries 匹配不到 GT，只有少数是正样本

解决方案： Denoising

提供带噪声的 GT作为输入，模型知道这些 noisy anchors 应该回归到哪个 GT
这给模型提供了明确的监督信号，加速收敛
同时训练模型处理有噪声的输入，提高鲁棒性

2. 带噪声的 GT 是什么？

2.1 概念

带噪声的 GT（Noisy Ground Truth） 是指：

从真实的 GT boxes 开始
添加随机噪声（偏移、缩放等）
得到带噪声的 anchor boxes
这些 noisy anchors 的目标仍然是原始的 GT

2.2 具体操作

# 1. 从 GT boxes 开始
gt_boxes = [batch, num_gt, 10]  # 真实的 GT boxes

# 2. 添加随机噪声
noise = torch.rand_like(gt_boxes) * 2 - 1  # [-1, 1]
noise *= noise_scale  # 例如: [2.0, 2.0, 2.0, 0.5, 0.5, ...]
noisy_anchors = gt_boxes + noise

# 3. 这些 noisy anchors 的目标仍然是原始的 GT
target_boxes = gt_boxes  # 目标不变
target_labels = gt_labels  # 标签不变

2.3 噪声的尺度

不同维度使用不同的噪声尺度：

位置 (x, y, z)：较大的噪声（如 2.0），因为位置变化范围大
尺寸 (w, l, h)：较小的噪声（如 0.5），因为尺寸变化范围小
角度 (yaw)：中等噪声
速度 (vx, vy, vz)：较小的噪声

# Sparse4D 中的配置
dn_noise_scale = [2.0] * 3 + [0.5] * 7
#                 位置     尺寸+角度+速度

3. Denoising 的具体操作流程

3.1 生成带噪声的 Anchors

def get_dn_anchors(cls_target, box_target, gt_instance_id=None):
    """
    生成带噪声的 anchors 用于 denoising training
    
    输入:
        cls_target: [batch, num_gt] - GT 类别标签
        box_target: [batch, num_gt, 10] - GT boxes
        gt_instance_id: [batch, num_gt] - GT 实例 ID（用于时序）
    
    输出:
        dn_anchor: [batch, num_dn_groups * num_gt, 10] - 带噪声的 anchors
        dn_box_target: [batch, num_dn_groups * num_gt, 10] - 对应的 GT boxes
        dn_cls_target: [batch, num_dn_groups * num_gt] - 对应的 GT 标签
        attn_mask: [num_dn, num_dn] - attention mask
        valid_mask: [batch, num_dn_groups * num_gt] - 有效掩码
        dn_id_target: [batch, num_dn_groups * num_gt] - 实例 ID
    """
    
    # 1. 准备 GT（padding 到相同长度）
    max_dn_gt = max([len(x) for x in cls_target])
    cls_target = pad_to_length(cls_target, max_dn_gt)
    box_target = pad_to_length(box_target, max_dn_gt)
    
    # 2. 复制 num_dn_groups 次（每组使用不同的噪声）
    if num_dn_groups > 1:
        cls_target = cls_target.tile(num_dn_groups, 1)
        box_target = box_target.tile(num_dn_groups, 1, 1)
    
    # 3. 生成随机噪声
    noise = torch.rand_like(box_target) * 2 - 1  # [-1, 1]
    noise *= dn_noise_scale  # 应用噪声尺度
    dn_anchor = box_target + noise  # 带噪声的 anchor
    
    # 4. （可选）添加负样本
    if add_neg_dn:
        noise_neg = torch.rand_like(box_target) + 1  # [1, 2]
        noise_neg *= random_sign() * dn_noise_scale
        dn_anchor = concat([dn_anchor, box_target + noise_neg], dim=1)
    
    # 5. 使用 Hungarian matching 匹配 noisy anchors 和 GT
    # （虽然我们知道对应关系，但使用 matching 可以处理边界情况）
    cost = compute_box_cost(dn_anchor, box_target)
    anchor_idx, gt_idx = hungarian_matching(cost)
    
    # 6. 分配目标
    dn_box_target[anchor_idx] = box_target[gt_idx]
    dn_cls_target[anchor_idx] = cls_target[gt_idx]
    
    # 7. 生成 attention mask（同一组内的 anchors 可以互相 attention）
    attn_mask = create_group_attention_mask(num_dn_groups, num_gt)
    
    return dn_anchor, dn_box_target, dn_cls_target, attn_mask, valid_mask, dn_id_target

3.2 在模型中使用

# 在 forward 中
def forward(self, feature_maps, metas):
    # 1. 获取正常的 learnable instances
    instance_feature, anchor = self.instance_bank.get(...)
    
    # 2. 生成带噪声的 anchors（仅在训练时）
    if self.training:
        dn_metas = self.sampler.get_dn_anchors(
            metas["gt_labels_3d"],
            metas["gt_bboxes_3d"],
            gt_instance_id
        )
        
        # 3. 拼接 learnable instances 和 noisy instances
        anchor = concat([anchor, dn_anchor], dim=1)
        instance_feature = concat([instance_feature, dn_feature], dim=1)
        
        # 4. 设置 attention mask
        # - learnable instances 之间可以 attention
        # - noisy instances 在同一组内可以 attention
        # - 不同组之间不能 attention
    
    # 5. 通过 decoder
    for layer in decoder_layers:
        instance_feature = layer(instance_feature, anchor, ...)
    
    # 6. 分离预测结果
    prediction = model_output[:, :num_learnable]  # 正常预测
    dn_prediction = model_output[:, num_learnable:]  # denoising 预测
    
    return {
        "prediction": prediction,
        "dn_prediction": dn_prediction,
        "dn_reg_target": dn_box_target,
        "dn_cls_target": dn_cls_target,
        ...
    }

3.3 计算 Loss

def loss(self, model_outs, data):
    # 1. 正常预测的 loss
    normal_loss = compute_loss(
        prediction, 
        reg_target, 
        cls_target
    )
    
    # 2. Denoising 的 loss
    dn_loss = compute_loss(
        dn_prediction,
        dn_reg_target,  # 原始的 GT boxes
        dn_cls_target,  # 原始的 GT 标签
        weight=dn_loss_weight  # 通常权重较大，如 5.0
    )
    
    total_loss = normal_loss + dn_loss
    return total_loss

4. Sparse4D 中的 Denoising

4.1 配置参数

在 sparse4dv3_temporal_r50_1x8_bs6_256x704.py 中：

sampler=dict(
    type="SparseBox3DTarget",
    num_dn_groups=5,              # Denoising 组数
    num_temp_dn_groups=2,         # 时序 denoising 组数
    dn_noise_scale=[2.0] * 3 + [0.5] * 7,  # 噪声尺度
    max_dn_gt=32,                 # 最多使用 32 个 GT
    add_neg_dn=True,              # 是否添加负样本
)

4.2 关键参数说明

参数	说明	默认值
`num_dn_groups`	Denoising 组数，每组使用不同的噪声	5
`num_temp_dn_groups`	时序 denoising 组数（用于时序融合）	2
`dn_noise_scale`	噪声尺度，不同维度使用不同尺度	`[2.0]3 + [0.5]7`
`max_dn_gt`	最多使用的 GT 数量（限制计算量）	32
`add_neg_dn`	是否添加负样本（远离 GT 的 noisy anchors）	True

4.3 工作流程

1. 输入 GT
   └─> gt_boxes: [batch, num_gt, 10]
   └─> gt_labels: [batch, num_gt]

2. 生成带噪声的 Anchors
   └─> 复制 num_dn_groups 次
   └─> 每组添加不同的随机噪声
   └─> dn_anchor: [batch, num_dn_groups * num_gt, 10]

3. 匹配和分配目标
   └─> 使用 Hungarian matching
   └─> 分配对应的 GT boxes 和 labels

4. 拼接到模型输入
   └─> learnable instances: [batch, 900, ...]
   └─> noisy instances: [batch, num_dn_groups * num_gt, ...]
   └─> 总输入: [batch, 900 + num_dn_groups * num_gt, ...]

5. 通过 Decoder
   └─> 所有 instances 一起处理
   └─> 使用 attention mask 控制交互

6. 分离预测结果
   └─> normal_prediction: [batch, 900, ...]
   └─> dn_prediction: [batch, num_dn_groups * num_gt, ...]

7. 计算 Loss
   └─> normal_loss: 正常预测的 loss
   └─> dn_loss: denoising 的 loss (权重 5.0)

4.4 时序 Denoising

Sparse4D 还支持时序 denoising，用于时序融合：

# 在单帧 decoder 后
if len(prediction) == num_single_frame_decoder:
    # 缓存当前帧的 denoising instances
    self.sampler.cache_dn(
        dn_instance_feature,
        dn_anchor,
        dn_cls_target,
        valid_mask,
        dn_id_target
    )

# 在时序 decoder 中
# 使用缓存的时序 denoising instances
# 与当前帧的 denoising instances 进行匹配和融合

5. Attention Mask

Denoising 使用特殊的 attention mask 来控制不同 instances 之间的交互：

# Attention Mask 结构
attn_mask = [
    # Learnable instances (900 个)
    [0, 0, ..., 1, 1, ...],  # 可以互相 attention
    [0, 0, ..., 1, 1, ...],
    ...
    # Noisy instances (num_dn_groups * num_gt 个)
    [1, 1, ..., 0, 0, ...],  # 同一组内可以 attention
    [1, 1, ..., 0, 0, ...],
    ...
]

# 规则：
# - Learnable instances 之间：可以 attention (mask=0)
# - Noisy instances 同一组内：可以 attention (mask=0)
# - Noisy instances 不同组间：不能 attention (mask=1)
# - Learnable 和 Noisy 之间：不能 attention (mask=1)

6. 为什么 Denoising 有效？

6.1 提供明确的监督信号

正常训练：模型需要从随机初始化的 queries 学习，初期匹配不稳定
Denoising：模型知道 noisy anchors 应该回归到哪个 GT，提供明确的监督

6.2 增加正样本数量

正常训练：只有匹配到的 queries 是正样本，数量少
Denoising：所有 noisy anchors 都有对应的 GT，都是正样本

6.3 提高鲁棒性

模型学习从有噪声的输入中恢复正确的输出
提高模型对不完美输入的鲁棒性

6.4 加速收敛

实验表明，使用 denoising 可以将训练轮数减少 50% 以上
同时提高最终的性能

7. 代码示例

7.1 生成带噪声的 Anchors

# 在 SparseBox3DTarget.get_dn_anchors 中

# 1. 准备 GT
box_target = encode_reg_target(gt_boxes)  # [batch, num_gt, 10]

# 2. 复制 num_dn_groups 次
if num_dn_groups > 1:
    box_target = box_target.tile(num_dn_groups, 1, 1)
    # [batch, num_dn_groups * num_gt, 10]

# 3. 生成噪声
noise = torch.rand_like(box_target) * 2 - 1  # [-1, 1]
noise *= torch.tensor([2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
dn_anchor = box_target + noise

# 4. 匹配和分配
cost = compute_box_cost(dn_anchor, box_target)
anchor_idx, gt_idx = hungarian_matching(cost)
dn_box_target[anchor_idx] = box_target[gt_idx]

7.2 在 Forward 中使用

# 在 Sparse4DHead.forward 中

# 1. 获取正常 instances
instance_feature, anchor = self.instance_bank.get(...)

# 2. 生成 denoising instances
if self.training:
    dn_metas = self.sampler.get_dn_anchors(
        metas["gt_labels_3d"],
        metas["gt_bboxes_3d"]
    )
    dn_anchor, dn_box_target, dn_cls_target, attn_mask, valid_mask, dn_id_target = dn_metas
    
    # 3. 拼接
    anchor = torch.cat([anchor, dn_anchor], dim=1)
    instance_feature = torch.cat([
        instance_feature,
        torch.zeros_like(instance_feature[:, :dn_anchor.shape[1]])
    ], dim=1)
    
    # 4. 设置 attention mask
    # attn_mask 已经设置好

7.3 计算 Loss

# 在 Sparse4DHead.loss 中

# 1. 正常预测的 loss
normal_loss = compute_loss(prediction, reg_target, cls_target)

# 2. Denoising 的 loss
if "dn_prediction" in model_outs:
    dn_loss = compute_loss(
        model_outs["dn_prediction"],
        model_outs["dn_reg_target"],
        model_outs["dn_cls_target"],
        weight=self.dn_loss_weight  # 5.0
    )
    
total_loss = normal_loss + dn_loss

8. 总结

Denoising Training 的核心：

生成带噪声的 GT：从真实 GT 添加随机噪声
明确的监督信号：这些 noisy anchors 的目标是原始 GT
增加正样本：所有 noisy anchors 都是正样本
加速收敛：提供更好的训练信号

关键点：

不是真正的"去噪"，而是"从噪声中恢复"
噪声尺度需要仔细设计（位置大，尺寸小）
使用 attention mask 控制不同 instances 的交互
可以显著加速训练并提高性能

类比理解：

就像给模型一些"模糊的答案"（带噪声的 GT）
让模型学习如何"修正"这些模糊答案，得到正确答案
这样模型既能学习检测，又能处理不完美的输入

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

Dify开源LLM应用开发平台研究分享

Dify是一个开源的LLMOps/Agent应用开发平台，提供可视化工厂式的大模型应用开发体验。核心功能包括低代码可视化构建、RAG知识增强、Agent框架、多模型兼容等，支持企业级部署与运维。平台适用于知识问答、智能客服、内容创作等多种场景，相比同类产品更具功能性和开放性。Dify提供五种应用类型：聊天助手、文本生成、Agent、工作流和Chatflow，满足不同业务需求。部署方式支持SaaS云