YOLOv5 详细讲解文档

是一种实时目标检测算法。与传统的两阶段检测器（如R-CNN系列）不同，YOLO将目标检测作为回归问题来处理，只需要一次前向传播就能得到所有目标的位置和类别。1. 准备数据集dataset/│ └── val/├── train/└── val/2. 创建数据配置文件 (data.yaml)# 数据集路径path: ../dataset # 数据集根目录train: images/train # 训

weixin_45690427

1071人浏览 · 2025-11-14 16:33:54

weixin_45690427 · 2025-11-14 16:33:54 发布

YOLOv5 详细讲解文档

1. YOLOv5简介

1.1 什么是YOLO？

YOLO (You Only Look Once) 是一种实时目标检测算法。与传统的两阶段检测器（如R-CNN系列）不同，YOLO将目标检测作为回归问题来处理，只需要一次前向传播就能得到所有目标的位置和类别。

1.2 YOLOv5的特点

速度快：单阶段检测器，推理速度快
精度高：在保持速度的同时，达到了很高的检测精度
易用性强：代码结构清晰，易于训练和部署
多种尺寸：提供n、s、m、l、x五种不同大小的模型

1.3 YOLOv5版本对比

模型	参数量	mAP@0.5	推理速度	适用场景
YOLOv5n	1.9M	28.0	最快	边缘设备
YOLOv5s	7.2M	37.4	快	移动设备
YOLOv5m	21.2M	45.4	中等	通用场景
YOLOv5l	46.5M	49.0	较慢	高精度需求
YOLOv5x	86.7M	50.7	慢	最高精度

2. 目标检测基础概念

2.1 边界框 (Bounding Box)

边界框用于标记图像中目标的位置，通常用四个值表示：

- (x, y): 边界框中心点坐标
- w: 边界框宽度
- h: 边界框高度

表示方式：

中心点表示法 (x_center, y_center, width, height) - YOLO使用
左上角表示法 (x_min, y_min, x_max, y_max) - 也称为xyxy格式

2.2 锚框 (Anchor Boxes)

锚框是预定义的一组不同尺寸和长宽比的边界框，用于帮助网络学习不同形状的目标。

为什么需要锚框？

不同的目标有不同的形状和大小
锚框提供了检测的"起点"
网络只需要预测相对于锚框的偏移量，而不是直接预测绝对坐标

YOLOv5的锚框设置：

# 三个检测层，每层3个锚框
anchors = [
    [10,13, 16,30, 33,23],   # P3/8  - 小目标检测层
    [30,61, 62,45, 59,119],  # P4/16 - 中目标检测层
    [116,90, 156,198, 373,326]  # P5/32 - 大目标检测层
]

2.3 IoU (Intersection over Union)

IoU用于衡量两个边界框的重叠程度：

IoU = (Area of Intersection) / (Area of Union)

应用场景：

匹配预测框和真实框
非极大值抑制(NMS)
评估检测精度

2.4 非极大值抑制 (NMS)

NMS用于去除重复的检测框，只保留最好的那个。

流程：

按置信度对所有预测框排序
选择置信度最高的框A
计算A与其他框的IoU
移除IoU > 阈值的框（认为是重复检测）
重复2-4步，直到处理完所有框

2.5 mAP (mean Average Precision)

mAP是目标检测中最常用的评估指标：

Precision（精确率）：预测为正样本中真正为正样本的比例
Recall（召回率）：所有正样本中被正确预测的比例
AP：不同recall下的precision平均值
mAP：所有类别AP的平均值

常见标记：

mAP@0.5：IoU阈值为0.5时的mAP
mAP@0.5:0.95：IoU从0.5到0.95，步长0.05的平均mAP

3. YOLOv5网络架构

YOLOv5的网络结构可以分为四个部分：

输入(Input) → 骨干网络(Backbone) → 颈部(Neck) → 检测头(Head) → 输出(Output)

3.1 整体架构图

Input (640x640x3)
    ↓
Backbone (CSPDarknet53)
    ├─ Focus/Conv
    ├─ CSP1_1 → P1
    ├─ CSP1_3 → P2
    ├─ CSP2_3 → P3 (80x80) ────┐
    ├─ CSP2_3 → P4 (40x40) ────┼─→ Neck (PANet)
    └─ CSP2_1+SPPF → P5 (20x20)─┘
                ↓
    ┌───────────────────────┐
    │   Neck (PANet/FPN)    │
    │  ┌─────────────────┐  │
    │  │  P5 → Up → P4   │  │
    │  │  P4 → Up → P3   │  │
    │  │  P3 → Down → P4 │  │
    │  │  P4 → Down → P5 │  │
    │  └─────────────────┘  │
    └───────────────────────┘
                ↓
         Detection Head
    ┌────────┬────────┬────────┐
    │  P3    │  P4    │  P5    │
    │ 80x80  │ 40x40  │ 20x20  │
    │ 小目标  │ 中目标  │ 大目标  │
    └────────┴────────┴────────┘
                ↓
    [x, y, w, h, conf, class_probs]

3.2 详细参数流程

以YOLOv5s为例（输入图像640x640）：

层	类型	输出尺寸	参数
0	Focus	320×320×32	-
1	Conv	320×320×64	k=3, s=2
2	C3	320×320×64	n=1
3	Conv	160×160×128	k=3, s=2
4	C3	160×160×128	n=2
5	Conv	80×80×256	k=3, s=2
6	C3	80×80×256	n=3 → P3
7	Conv	40×40×512	k=3, s=2
8	C3	40×40×512	n=1
9	SPPF	40×40×512	k=5 → P5
10	Conv	40×40×256	k=1, s=1
11	Upsample	80×80×256	scale=2
12	Concat	80×80×512	[P3, 11]
13	C3	80×80×256	n=1
14	Conv	80×80×128	k=1, s=1
15	Upsample	160×160×128	scale=2
…	…	…	…

4. 关键模块详解

4.1 Focus模块

作用：在不损失信息的情况下降低计算量

原理：将空间信息集中到通道维度

将 H×W×C 的图像分为4个部分
每个部分间隔采样，然后在通道维度拼接
输出为 H/2×W/2×4C

代码实现：

class Focus(nn.Module):
    """
    将空间信息聚焦到通道空间
    输入: (b, c, h, w)
    输出: (b, 4c, h/2, w/2)
    """
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):
        super().__init__()
        # 输入通道扩大4倍，因为做了4次切片拼接
        self.conv = Conv(c1 * 4, c2, k, s, p, g, act=act)

    def forward(self, x):
        # 间隔采样：[::2, ::2]表示从偶数位置开始，每隔一个取一个
        return self.conv(
            torch.cat([
                x[..., ::2, ::2],    # 左上
                x[..., 1::2, ::2],   # 右上
                x[..., ::2, 1::2],   # 左下
                x[..., 1::2, 1::2]   # 右下
            ], 1)
        )

示例：

输入: 640×640×3
↓
间隔采样得到4个 320×320×3 的特征图
↓
拼接: 320×320×12
↓
卷积: 320×320×32

4.2 Conv模块（标准卷积块）

组成：卷积 + 批归一化 + 激活函数

代码实现：

class Conv(nn.Module):
    """标准卷积块：Conv2d + BatchNorm + Activation"""
    default_act = nn.SiLU()  # 默认激活函数：SiLU (Swish)
    
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        """
        参数说明：
        c1: 输入通道数
        c2: 输出通道数
        k: 卷积核大小
        s: 步长 stride
        p: 填充 padding (None时自动计算)
        g: 分组数 groups
        d: 膨胀率 dilation
        act: 激活函数（True使用默认SiLU）
        """
        super().__init__()
        # 卷积层
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), 
                             groups=g, dilation=d, bias=False)
        # 批归一化
        self.bn = nn.BatchNorm2d(c2)
        # 激活函数
        self.act = self.default_act if act is True else \
                   act if isinstance(act, nn.Module) else nn.Identity()
    
    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

自动填充函数：

def autopad(k, p=None, d=1):
    """
    自动计算padding，使输出尺寸保持不变（当stride=1时）
    """
    if d > 1:
        # 考虑膨胀卷积的实际卷积核大小
        k = d * (k - 1) + 1 if isinstance(k, int) else \
            [d * (x - 1) + 1 for x in k]
    if p is None:
        # 计算same padding
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]
    return p

4.3 Bottleneck模块

作用：类似ResNet的瓶颈结构，减少参数量

特点：

1×1卷积降维 → 3×3卷积 → 1×1卷积升维
残差连接（shortcut）

代码实现：

class Bottleneck(nn.Module):
    """
    标准瓶颈层，带可选的残差连接
    """
    def __init__(self, c1, c2, shortcut=True, g=1, e=0.5):
        """
        c1: 输入通道
        c2: 输出通道
        shortcut: 是否使用残差连接
        g: 分组卷积的组数
        e: 通道扩展比例（隐藏层通道数 = c2 * e）
        """
        super().__init__()
        c_ = int(c2 * e)  # 隐藏层通道数
        self.cv1 = Conv(c1, c_, 1, 1)      # 1×1降维
        self.cv2 = Conv(c_, c2, 3, 1, g=g) # 3×3卷积
        # 只有当输入输出通道相同且shortcut=True时才使用残差
        self.add = shortcut and c1 == c2
    
    def forward(self, x):
        # 如果使用残差：out = x + conv(x)
        # 否则：out = conv(x)
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))

4.4 C3模块（CSP Bottleneck）

作用：YOLOv5的核心模块，基于CSPNet思想

CSP (Cross Stage Partial) 的优势：

减少计算量
增强梯度流动
提高推理速度

结构图：

输入 x
  ├─→ cv1 → Bottleneck序列 → cv3(concat) →┐
  │                                        ├→ 输出
  └─→ cv2 ─────────────────────────────→┘

代码实现：

class C3(nn.Module):
    """
    CSP Bottleneck with 3 convolutions
    """
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):
        """
        c1: 输入通道
        c2: 输出通道
        n: bottleneck重复次数
        shortcut: bottleneck中是否使用残差
        g: 分组卷积组数
        e: 通道扩展比例
        """
        super().__init__()
        c_ = int(c2 * e)  # 隐藏通道数
        self.cv1 = Conv(c1, c_, 1, 1)  # 第一条路径
        self.cv2 = Conv(c1, c_, 1, 1)  # 第二条路径（直连）
        self.cv3 = Conv(2 * c_, c2, 1) # 融合层
        # n个串联的Bottleneck
        self.m = nn.Sequential(*(
            Bottleneck(c_, c_, shortcut, g, e=1.0) 
            for _ in range(n)
        ))
    
    def forward(self, x):
        # 两条路径concat后融合
        return self.cv3(torch.cat((
            self.m(self.cv1(x)),  # 经过Bottleneck序列
            self.cv2(x)           # 直接连接
        ), 1))

为什么使用C3？

分流设计减少了重复的梯度信息
提高了网络的学习效率
在保持精度的同时降低了计算成本

4.5 SPPF模块（Spatial Pyramid Pooling - Fast）

作用：多尺度特征融合，增大感受野

SPP vs SPPF：

SPP：并行多个池化核（5×5, 9×9, 13×13）
SPPF：串行多个相同池化核（5×5）- 更快！

结构对比：

SPP:
  input → conv → ┬─ maxpool(5) ─┐
                 ├─ maxpool(9) ─┤→ concat → conv → output
                 └─ maxpool(13)─┘

SPPF:
  input → conv → maxpool(5) → maxpool(5) → maxpool(5) → concat → conv → output
                     ↓            ↓            ↓
                   保存         保存          保存

代码实现：

class SPPF(nn.Module):
    """
    快速空间金字塔池化
    等价于 SPP(k=(5, 9, 13))，但速度更快
    """
    def __init__(self, c1, c2, k=5):
        """
        c1: 输入通道
        c2: 输出通道
        k: 池化核大小（默认5）
        """
        super().__init__()
        c_ = c1 // 2  # 隐藏通道数
        self.cv1 = Conv(c1, c_, 1, 1)  # 降维
        self.cv2 = Conv(c_ * 4, c2, 1, 1)  # 升维（4倍因为concat了4个特征）
        self.m = nn.MaxPool2d(kernel_size=k, stride=1, padding=k // 2)
    
    def forward(self, x):
        x = self.cv1(x)
        # 串行池化
        y1 = self.m(x)
        y2 = self.m(y1)
        y3 = self.m(y2)
        # 拼接原始x和三次池化结果
        return self.cv2(torch.cat((x, y1, y2, y3), 1))

感受野计算：

单次 5×5 MaxPool: 感受野 = 5
两次串行:         感受野 = 5 + 4 = 9
三次串行:         感受野 = 5 + 4 + 4 = 13

与SPP的k=(5,9,13)等价，但计算更快！

4.6 PANet (Path Aggregation Network)

作用：YOLOv5的Neck部分，用于多尺度特征融合

设计思想：

自底向上：低层特征 → 高层特征（FPN）
自顶向下：高层特征 → 低层特征（额外路径）

结构流程：

Backbone输出:
  P3 (80×80×256)   P4 (40×40×512)   P5 (20×20×512)
      ↓                 ↓                 ↓
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
      │  FPN (自顶向下)  │
      │   P5 → Up → P4   │
      │   P4 → Up → P3   │
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
      │  PAN (自底向上)  │
      │   P3 → Down → P4 │
      │   P4 → Down → P5 │
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
      ↓                 ↓                 ↓
  检测头P3         检测头P4           检测头P5
  (小目标)         (中目标)           (大目标)

为什么需要PANet？

不同尺度目标：小、中、大目标需要不同层次的特征
特征增强：低层特征（细节）+ 高层特征（语义）
定位精度：低层特征有助于精确定位

4.7 Detect模块（检测头）

作用：将特征图转换为检测结果

输入：三个不同尺度的特征图

P3: 80×80 - 检测小目标
P4: 40×40 - 检测中目标
P5: 20×20 - 检测大目标

输出：每个格子预测3个锚框，每个锚框预测：

(x, y): 边界框中心相对于格子的偏移
(w, h): 边界框的宽高
confidence: 目标置信度
class_probs: 各类别概率（假设80类）

输出维度：(bs, na, ny, nx, no)

bs: batch size
na: 每个格子的锚框数 = 3
ny, nx: 特征图高、宽
no: 每个锚框的输出 = 5 + nc (4个坐标 + 1个置信度 + nc个类别)

代码实现：

class Detect(nn.Module):
    """YOLOv5检测头"""
    stride = None  # 相对于输入图像的下采样倍数
    
    def __init__(self, nc=80, anchors=(), ch=(), inplace=True):
        """
        nc: 类别数
        anchors: 锚框配置 [[10,13, 16,30, 33,23],
                          [30,61, 62,45, 59,119],
                          [116,90, 156,198, 373,326]]
        ch: 输入通道数列表 [256, 512, 1024]
        """
        super().__init__()
        self.nc = nc                          # 类别数
        self.no = nc + 5                      # 每个锚框的输出数
        self.nl = len(anchors)                # 检测层数 = 3
        self.na = len(anchors[0]) // 2        # 每层锚框数 = 3
        self.grid = [torch.empty(0) for _ in range(self.nl)]
        self.anchor_grid = [torch.empty(0) for _ in range(self.nl)]
        
        # 注册为buffer，模型保存时会保存这个值
        self.register_buffer('anchors', 
            torch.tensor(anchors).float().view(self.nl, -1, 2))
        
        # 输出卷积：将特征图转换为预测值
        # 输入ch[i]，输出 na * no
        self.m = nn.ModuleList(
            nn.Conv2d(x, self.no * self.na, 1) for x in ch
        )
        self.inplace = inplace
    
    def forward(self, x):
        """
        x: 列表，包含3个特征图
           x[0]: (bs, 256, 80, 80) - P3
           x[1]: (bs, 512, 40, 40) - P4
           x[2]: (bs, 1024, 20, 20) - P5
        
        返回:
           训练时: x - 原始预测值
           推理时: (inference_output, x) - 解码后的预测 + 原始预测
        """
        z = []  # 推理输出
        
        for i in range(self.nl):  # 遍历3个检测层
            x[i] = self.m[i](x[i])  # 卷积预测
            bs, _, ny, nx = x[i].shape  
            # 例如 x[i]: (bs, 255, 80, 80) -> (bs, 3, 85, 80, 80)
            x[i] = x[i].view(bs, self.na, self.no, ny, nx)\
                       .permute(0, 1, 3, 4, 2).contiguous()
            # 现在 x[i]: (bs, 3, 80, 80, 85)
            
            if not self.training:  # 推理模式
                # 生成网格
                if self.grid[i].shape[2:4] != x[i].shape[2:4]:
                    self.grid[i], self.anchor_grid[i] = \
                        self._make_grid(nx, ny, i)
                
                # 解码预测值
                xy, wh, conf = x[i].sigmoid().split((2, 2, self.nc + 1), 4)
                # xy: 中心点相对于格子的偏移 (0~1)
                xy = (xy * 2 + self.grid[i]) * self.stride[i]  # 转为相对于原图的坐标
                # wh: 宽高
                wh = (wh * 2) ** 2 * self.anchor_grid[i]  # 相对于锚框的缩放
                y = torch.cat((xy, wh, conf), 4)
                z.append(y.view(bs, self.na * nx * ny, self.no))
        
        return x if self.training else (torch.cat(z, 1), x)
    
    def _make_grid(self, nx=20, ny=20, i=0):
        """
        生成网格坐标和锚框网格
        nx, ny: 特征图宽高
        i: 第几个检测层
        
        返回: (grid, anchor_grid)
        """
        d = self.anchors[i].device
        t = self.anchors[i].dtype
        shape = 1, self.na, ny, nx, 2  # (1, 3, 20, 20, 2)
        
        # 生成网格坐标
        y, x = torch.arange(ny, device=d, dtype=t), \
               torch.arange(nx, device=d, dtype=t)
        yv, xv = torch.meshgrid(y, x, indexing='ij')
        # grid: 每个格子的左上角坐标
        grid = torch.stack((xv, yv), 2).expand(shape) - 0.5
        
        # anchor_grid: 锚框尺寸 × stride
        anchor_grid = (self.anchors[i] * self.stride[i])\
                      .view((1, self.na, 1, 1, 2)).expand(shape)
        
        return grid, anchor_grid

预测解码过程：

原始预测值 (tx, ty, tw, th, conf, cls)

中心点解码：

# sigmoid将值限制在0~1
# *2 将范围扩大到0~2
# +grid_x, +grid_y 加上格子坐标
# *stride 转换到原图尺度
cx = (sigmoid(tx) * 2 - 0.5 + grid_x) * stride
cy = (sigmoid(ty) * 2 - 0.5 + grid_y) * stride

宽高解码：

# sigmoid将值限制在0~1
# *2 扩大到0~2
# **2 平方，范围0~4
# *anchor_w/h 相对于锚框缩放
w = (sigmoid(tw) * 2) ** 2 * anchor_w
h = (sigmoid(th) * 2) ** 2 * anchor_h

置信度和类别：

conf = sigmoid(conf)      # 目标置信度
cls = sigmoid(cls_logits) # 各类别概率

5. 损失函数

YOLOv5的损失函数由三部分组成：

Total Loss = λ₁ × Box Loss + λ₂ × Object Loss + λ₃ × Class Loss

5.1 边界框损失 (Box Loss)

使用CIoU Loss：

CIoU考虑了：

重叠面积
中心点距离
长宽比

def bbox_iou(box1, box2, CIoU=True):
    """
    计算边界框的IoU或CIoU
    box1, box2: (x, y, w, h) 格式
    """
    # 转换为 (x1, y1, x2, y2) 格式
    b1_x1, b1_y1 = box1[..., 0] - box1[..., 2] / 2, box1[..., 1] - box1[..., 3] / 2
    b1_x2, b1_y2 = box1[..., 0] + box1[..., 2] / 2, box1[..., 1] + box1[..., 3] / 2
    b2_x1, b2_y1 = box2[..., 0] - box2[..., 2] / 2, box2[..., 1] - box2[..., 3] / 2
    b2_x2, b2_y2 = box2[..., 0] + box2[..., 2] / 2, box2[..., 1] + box2[..., 3] / 2
    
    # 交集面积
    inter = (torch.min(b1_x2, b2_x2) - torch.max(b1_x1, b2_x1)).clamp(0) * \
            (torch.min(b1_y2, b2_y2) - torch.max(b1_y1, b2_y1)).clamp(0)
    
    # 并集面积
    w1, h1 = b1_x2 - b1_x1, b1_y2 - b1_y1
    w2, h2 = b2_x2 - b2_x1, b2_y2 - b2_y1
    union = w1 * h1 + w2 * h2 - inter + 1e-16
    
    iou = inter / union
    
    if CIoU:
        # 最小外接矩形
        cw = torch.max(b1_x2, b2_x2) - torch.min(b1_x1, b2_x1)
        ch = torch.max(b1_y2, b2_y2) - torch.min(b1_y1, b2_y1)
        # 对角线距离
        c2 = cw ** 2 + ch ** 2 + 1e-16
        # 中心点距离
        rho2 = ((b2_x1 + b2_x2 - b1_x1 - b1_x2) ** 2 + 
                (b2_y1 + b2_y2 - b1_y1 - b1_y2) ** 2) / 4
        # 长宽比一致性
        v = (4 / math.pi ** 2) * torch.pow(
            torch.atan(w2 / (h2 + 1e-16)) - torch.atan(w1 / (h1 + 1e-16)), 2)
        alpha = v / (v - iou + 1 + 1e-16)
        # CIoU
        return iou - (rho2 / c2 + v * alpha)
    
    return iou

Box Loss计算：

# 预测框和真实框
pbox = torch.cat((pxy, pwh), 1)  # 预测的 (x, y, w, h)
iou = bbox_iou(pbox, tbox[i], CIoU=True).squeeze()
lbox += (1.0 - iou).mean()  # CIoU loss

5.2 目标置信度损失 (Objectness Loss)

使用BCE Loss：

class BCEWithLogitsLoss:
    """二元交叉熵损失（带logits）"""
    pass

# 计算目标置信度损失
BCEobj = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([h['obj_pw']]))

# 目标分配：将IoU作为置信度的目标值
tobj[b, a, gj, gi] = iou.detach().clamp(0).type(tobj.dtype)

# 计算损失
lobj += self.BCEobj(pi[..., 4], tobj) * self.balance[i]

为什么用IoU作为目标？

IoU高 → 预测框与真实框重叠好 → 置信度应该高
IoU低 → 预测框与真实框重叠差 → 置信度应该低

5.3 分类损失 (Classification Loss)

使用BCE Loss（多标签分类）：

# 类别目标（使用标签平滑）
cp, cn = smooth_BCE(eps=0.0)  # positive, negative targets
t = torch.full_like(pcls, cn)  # 初始化为负样本目标
t[range(n), tcls[i]] = cp       # 正样本位置设为正样本目标

# 计算分类损失
lcls += self.BCEcls(pcls, t)

标签平滑 (Label Smoothing)：

def smooth_BCE(eps=0.1):
    """
    标签平滑，防止过拟合
    正样本: 1.0 → 1.0 - 0.5*eps = 0.95
    负样本: 0.0 → 0.5*eps = 0.05
    """
    return 1.0 - 0.5 * eps, 0.5 * eps

5.4 完整损失计算流程

class ComputeLoss:
    """YOLOv5损失计算"""
    
    def __init__(self, model, autobalance=False):
        device = next(model.parameters()).device
        h = model.hyp  # 超参数
        
        # 定义损失函数
        BCEcls = nn.BCEWithLogitsLoss(
            pos_weight=torch.tensor([h['cls_pw']], device=device))
        BCEobj = nn.BCEWithLogitsLoss(
            pos_weight=torch.tensor([h['obj_pw']], device=device))
        
        # 标签平滑
        self.cp, self.cn = smooth_BCE(eps=h.get('label_smoothing', 0.0))
        
        # Focal Loss（可选）
        g = h['fl_gamma']
        if g > 0:
            BCEcls, BCEobj = FocalLoss(BCEcls, g), FocalLoss(BCEobj, g)
        
        m = model.model[-1]  # Detect模块
        self.balance = {3: [4.0, 1.0, 0.4]}.get(m.nl, [4.0, 1.0, 0.25, 0.06, 0.02])
        self.BCEcls, self.BCEobj = BCEcls, BCEobj
        self.hyp = h
        self.na = m.na  # 锚框数
        self.nc = m.nc  # 类别数
        self.nl = m.nl  # 检测层数
        self.anchors = m.anchors
        self.device = device
    
    def __call__(self, p, targets):
        """
        p: 预测值，列表包含3个检测层的输出
        targets: 真实标签 (image_idx, class, x, y, w, h)
        
        返回: (total_loss, loss_items)
        """
        lcls = torch.zeros(1, device=self.device)  # 分类损失
        lbox = torch.zeros(1, device=self.device)  # 边界框损失
        lobj = torch.zeros(1, device=self.device)  # 目标损失
        
        # 构建目标
        tcls, tbox, indices, anchors = self.build_targets(p, targets)
        
        # 遍历每个检测层
        for i, pi in enumerate(p):
            b, a, gj, gi = indices[i]  # image, anchor, gridy, gridx
            tobj = torch.zeros(pi.shape[:4], dtype=pi.dtype, device=self.device)
            
            n = b.shape[0]  # 目标数量
            if n:
                # 提取对应位置的预测
                pxy, pwh, _, pcls = pi[b, a, gj, gi].split((2, 2, 1, self.nc), 1)
                
                # === 边界框损失 ===
                pxy = pxy.sigmoid() * 2 - 0.5
                pwh = (pwh.sigmoid() * 2) ** 2 * anchors[i]
                pbox = torch.cat((pxy, pwh), 1)
                iou = bbox_iou(pbox, tbox[i], CIoU=True).squeeze()
                lbox += (1.0 - iou).mean()
                
                # === 目标置信度 ===
                tobj[b, a, gj, gi] = iou.detach().clamp(0).type(tobj.dtype)
                
                # === 分类损失 ===
                if self.nc > 1:
                    t = torch.full_like(pcls, self.cn, device=self.device)
                    t[range(n), tcls[i]] = self.cp
                    lcls += self.BCEcls(pcls, t)
            
            # 所有位置的目标置信度损失
            obji = self.BCEobj(pi[..., 4], tobj)
            lobj += obji * self.balance[i]
        
        # 加权
        lbox *= self.hyp['box']
        lobj *= self.hyp['obj']
        lcls *= self.hyp['cls']
        bs = tobj.shape[0]
        
        return (lbox + lobj + lcls) * bs, torch.cat((lbox, lobj, lcls)).detach()

5.5 目标分配策略

如何确定哪个锚框负责预测哪个目标？

def build_targets(self, p, targets):
    """
    为每个目标分配合适的锚框
    
    策略：
    1. 锚框匹配：选择与目标宽高比最接近的锚框
    2. 跨网格匹配：允许相邻格子的锚框也参与预测
    """
    na, nt = self.na, targets.shape[0]
    tcls, tbox, indices, anch = [], [], [], []
    
    gain = torch.ones(7, device=self.device)
    # 将每个目标复制na份，为每个锚框准备一个
    ai = torch.arange(na, device=self.device).float().view(na, 1).repeat(1, nt)
    targets = torch.cat((targets.repeat(na, 1, 1), ai[..., None]), 2)
    
    g = 0.5  # 偏移比例
    # 5个方向的偏移：中心、左、上、右、下
    off = torch.tensor([
        [0, 0],
        [1, 0], [0, 1], [-1, 0], [0, -1],
    ], device=self.device).float() * g
    
    for i in range(self.nl):  # 遍历每个检测层
        anchors = self.anchors[i]
        gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]  # xyxy gain
        
        # 将目标坐标转换到当前特征图尺度
        t = targets * gain
        
        if nt:
            # === 锚框匹配 ===
            r = t[..., 4:6] / anchors[:, None]  # 宽高比
            j = torch.max(r, 1 / r).max(2)[0] < self.hyp['anchor_t']  # 比值阈值
            t = t[j]  # 保留匹配的目标
            
            # === 跨网格匹配 ===
            gxy = t[:, 2:4]  # 中心点坐标
            gxi = gain[[2, 3]] - gxy  # 到右下角的距离
            j, k = ((gxy % 1 < g) & (gxy > 1)).T  # 接近左边或上边
            l, m = ((gxi % 1 < g) & (gxi > 1)).T  # 接近右边或下边
            j = torch.stack((torch.ones_like(j), j, k, l, m))
            t = t.repeat((5, 1, 1))[j]
            offsets = (torch.zeros_like(gxy)[None] + off[:, None])[j]
        else:
            t = targets[0]
            offsets = 0
        
        # 提取目标信息
        bc, gxy, gwh, a = t.chunk(4, 1)
        a, (b, c) = a.long().view(-1), bc.long().T
        gij = (gxy - offsets).long()
        gi, gj = gij.T
        
        # 保存结果
        indices.append((b, a, gj.clamp_(0, gain[3] - 1), gi.clamp_(0, gain[2] - 1)))
        tbox.append(torch.cat((gxy - gij, gwh), 1))
        anch.append(anchors[a])
        tcls.append(c)
    
    return tcls, tbox, indices, anch

关键点：

锚框匹配：宽高比在阈值内（默认4.0）的锚框才会匹配
跨网格预测：允许相邻格子的锚框也参与，增加正样本数量
多层预测：每个目标可能在多个检测层被预测

6. 训练过程

6.1 数据准备

数据格式（YOLO格式）：

# 图像文件: images/train/img1.jpg
# 标签文件: labels/train/img1.txt

# 标签格式（每行一个目标）：
class_id x_center y_center width height

示例：

0 0.5 0.5 0.3 0.4  # 类别0，中心(0.5, 0.5)，宽0.3，高0.4（归一化坐标）
2 0.2 0.3 0.1 0.15 # 类别2，中心(0.2, 0.3)，宽0.1，高0.15

6.2 数据增强

YOLOv5使用多种数据增强技术：

1. Mosaic增强：

# 将4张图像拼接成一张
# ┌─────┬─────┐
# │ img1│ img2│
# ├─────┼─────┤
# │ img3│ img4│
# └─────┴─────┘

优势：

增加小目标数量
增加背景多样性
减少GPU数量需求（batch_size可以变相增大）

2. 其他增强：

Random Flip（随机翻转）
Random Scale（随机缩放）
Random Crop（随机裁剪）
Random HSV（色彩抖动）
MixUp
CutOut

6.3 训练配置

超参数文件示例 (hyp.yaml)：

# 优化器参数
lr0: 0.01          # 初始学习率
lrf: 0.1           # 最终学习率 (lr0 * lrf)
momentum: 0.937    # SGD momentum
weight_decay: 0.0005  # 权重衰减

# 损失权重
box: 0.05          # box loss权重
cls: 0.5           # class loss权重
obj: 1.0           # object loss权重

# 锚框参数
anchor_t: 4.0      # 锚框匹配阈值

# 增强参数
hsv_h: 0.015       # HSV-Hue增强
hsv_s: 0.7         # HSV-Saturation增强
hsv_v: 0.4         # HSV-Value增强
degrees: 0.0       # 旋转角度
translate: 0.1     # 平移
scale: 0.5         # 缩放
shear: 0.0         # 剪切
perspective: 0.0   # 透视变换
flipud: 0.0        # 上下翻转概率
fliplr: 0.5        # 左右翻转概率
mosaic: 1.0        # mosaic增强概率
mixup: 0.0         # mixup增强概率

6.4 训练流程

完整训练代码框架：

def train(hyp, opt):
    """
    YOLOv5训练主函数
    
    hyp: 超参数字典
    opt: 训练选项
    """
    # ==================== 1. 初始化 ====================
    # 设置随机种子
    torch.manual_seed(0)
    
    # 选择设备
    device = select_device(opt.device)
    
    # 创建模型
    model = Model(opt.cfg, ch=3, nc=opt.nc).to(device)
    
    # 冻结层（可选）
    freeze = [f'model.{x}.' for x in range(opt.freeze)]
    for k, v in model.named_parameters():
        v.requires_grad = True
        if any(x in k for x in freeze):
            v.requires_grad = False
    
    # ==================== 2. 优化器 ====================
    # 参数分组
    g0, g1, g2 = [], [], []  # 优化器参数组
    for v in model.modules():
        if hasattr(v, 'bias') and isinstance(v.bias, nn.Parameter):
            g2.append(v.bias)  # biases
        if isinstance(v, nn.BatchNorm2d):
            g0.append(v.weight)  # BN权重（不使用weight_decay）
        elif hasattr(v, 'weight') and isinstance(v.weight, nn.Parameter):
            g1.append(v.weight)  # 卷积权重（使用weight_decay）
    
    # 创建优化器
    optimizer = optim.SGD(g0, lr=hyp['lr0'], momentum=hyp['momentum'], nesterov=True)
    optimizer.add_param_group({'params': g1, 'weight_decay': hyp['weight_decay']})
    optimizer.add_param_group({'params': g2})  # biases
    
    # ==================== 3. 学习率调度器 ====================
    lf = lambda x: (1 - x / epochs) * (1.0 - hyp['lrf']) + hyp['lrf']
    scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lf)
    
    # ==================== 4. 数据加载器 ====================
    train_loader = create_dataloader(
        train_path, imgsz, batch_size, stride,
        hyp=hyp, augment=True, cache=opt.cache
    )
    
    val_loader = create_dataloader(
        val_path, imgsz, batch_size, stride,
        hyp=hyp, augment=False, cache=opt.cache
    )
    
    # ==================== 5. 损失函数 ====================
    compute_loss = ComputeLoss(model)
    
    # ==================== 6. 训练循环 ====================
    for epoch in range(epochs):
        model.train()
        
        pbar = tqdm(enumerate(train_loader), total=len(train_loader))
        for i, (imgs, targets, paths, _) in pbar:
            # 数据转移到GPU
            imgs = imgs.to(device).float() / 255.0  # 归一化到0-1
            targets = targets.to(device)
            
            # === 前向传播 ===
            pred = model(imgs)  # 预测
            loss, loss_items = compute_loss(pred, targets)  # 计算损失
            
            # === 反向传播 ===
            optimizer.zero_grad()  # 清空梯度
            loss.backward()         # 反向传播
            optimizer.step()        # 更新参数
            
            # === 记录信息 ===
            pbar.set_description(
                f'Epoch {epoch}/{epochs} '
                f'loss: {loss.item():.4f} '
                f'box: {loss_items[0]:.4f} '
                f'obj: {loss_items[1]:.4f} '
                f'cls: {loss_items[2]:.4f}'
            )
        
        # ==================== 7. 验证 ====================
        if epoch % opt.eval_interval == 0:
            results, maps = validate(
                model, val_loader, device, compute_loss
            )
            
            # 保存最佳模型
            if maps > best_fitness:
                best_fitness = maps
                torch.save({
                    'epoch': epoch,
                    'model': model.state_dict(),
                    'optimizer': optimizer.state_dict(),
                }, 'best.pt')
        
        # ==================== 8. 学习率更新 ====================
        scheduler.step()

6.5 关键训练技巧

1. Warmup（预热）：

# 前几个epoch使用较小的学习率
if epoch < warmup_epochs:
    xi = [0, warmup_epochs]
    for j, x in enumerate(optimizer.param_groups):
        x['lr'] = np.interp(epoch, xi, [warmup_bias_lr if j == 2 else 0.0, x['initial_lr'] * lf(epoch)])

2. EMA（指数移动平均）：

# 使用模型参数的移动平均来提高稳定性
ema = ModelEMA(model)
for epoch in range(epochs):
    # 训练...
    ema.update(model)  # 更新EMA模型

3. 自动锚框：

# 根据数据集自动计算最优锚框
from utils.autoanchor import check_anchors
check_anchors(dataset, model, thr=4.0, imgsz=640)

4. 混合精度训练：

# 使用FP16加速训练
scaler = torch.cuda.amp.GradScaler()
with torch.cuda.amp.autocast():
    pred = model(imgs)
    loss, loss_items = compute_loss(pred, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

7. 预测推理

7.1 推理流程

def detect(model, source, device, conf_thres=0.25, iou_thres=0.45):
    """
    YOLOv5推理函数
    
    model: 训练好的模型
    source: 图像路径或视频路径
    conf_thres: 置信度阈值
    iou_thres: NMS的IoU阈值
    """
    # ==================== 1. 加载模型 ====================
    model.eval()
    model.to(device)
    
    # ==================== 2. 加载图像 ====================
    # 读取图像
    img0 = cv2.imread(source)  # BGR格式
    
    # 预处理
    img = letterbox(img0, new_shape=640)[0]  # 调整大小，保持长宽比
    img = img.transpose((2, 0, 1))[::-1]     # HWC转CHW，BGR转RGB
    img = np.ascontiguousarray(img)          # 连续内存
    img = torch.from_numpy(img).to(device)
    img = img.float() / 255.0                # 归一化
    if img.ndimension() == 3:
        img = img.unsqueeze(0)               # 添加batch维度
    
    # ==================== 3. 推理 ====================
    with torch.no_grad():
        pred = model(img)[0]  # 前向传播
        # pred形状: (1, 25200, 85)
        # 25200 = 80×80×3 + 40×40×3 + 20×20×3
        # 85 = 4(坐标) + 1(置信度) + 80(类别)
    
    # ==================== 4. NMS（非极大值抑制）====================
    pred = non_max_suppression(
        pred, 
        conf_thres=conf_thres,  # 置信度阈值
        iou_thres=iou_thres,    # IoU阈值
        max_det=300             # 最大检测数
    )
    
    # ==================== 5. 后处理 ====================
    for i, det in enumerate(pred):  # 遍历每张图像
        if len(det):
            # 将坐标从640×640映射回原图尺寸
            det[:, :4] = scale_boxes(img.shape[2:], det[:, :4], img0.shape).round()
            
            # 绘制结果
            for *xyxy, conf, cls in reversed(det):
                label = f'{names[int(cls)]} {conf:.2f}'
                plot_one_box(xyxy, img0, label=label, color=colors[int(cls)])
    
    # ==================== 6. 保存结果 ====================
    cv2.imwrite('result.jpg', img0)
    
    return img0

7.2 NMS详细实现

def non_max_suppression(prediction, conf_thres=0.25, iou_thres=0.45, 
                       classes=None, max_det=300):
    """
    非极大值抑制
    
    prediction: (batch_size, num_boxes, 85) 模型预测
    conf_thres: 置信度阈值
    iou_thres: NMS的IoU阈值
    classes: 只保留特定类别（None表示所有类别）
    max_det: 每张图像最大检测数
    
    返回: 列表，每个元素是一张图像的检测结果 (n, 6) [x1, y1, x2, y2, conf, cls]
    """
    # ==================== 1. 筛选 ====================
    # 计算类别置信度 = 目标置信度 × 类别概率
    xc = prediction[..., 4] > conf_thres  # 候选框
    
    # 设置
    min_wh, max_wh = 2, 7680  # 最小/最大宽高（像素）
    max_nms = 30000           # NMS前的最大框数
    time_limit = 10.0         # 超时时间
    
    output = [torch.zeros((0, 6), device=prediction.device)] * prediction.shape[0]
    
    # ==================== 2. 遍历每张图像 ====================
    for xi, x in enumerate(prediction):  # 对每张图像
        x = x[xc[xi]]  # 筛选候选框
        
        if not x.shape[0]:
            continue
        
        # === 计算最终置信度 ===
        x[:, 5:] *= x[:, 4:5]  # conf = obj_conf * cls_conf
        
        # === 转换坐标格式 ===
        box = xywh2xyxy(x[:, :4])  # (center_x, center_y, w, h) → (x1, y1, x2, y2)
        
        # === 多标签处理 ===
        conf, j = x[:, 5:].max(1, keepdim=True)  # 最大类别置信度和索引
        x = torch.cat((box, conf, j.float()), 1)[conf.view(-1) > conf_thres]
        
        # === 类别过滤 ===
        if classes is not None:
            x = x[(x[:, 5:6] == torch.tensor(classes, device=x.device)).any(1)]
        
        # === 限制检测框数量 ===
        n = x.shape[0]
        if not n:
            continue
        elif n > max_nms:
            x = x[x[:, 4].argsort(descending=True)[:max_nms]]
        
        # ==================== 3. NMS ====================
        c = x[:, 5:6] * max_wh  # 类别偏移
        boxes, scores = x[:, :4] + c, x[:, 4]  # boxes偏移，同类别框才会抑制
        i = torchvision.ops.nms(boxes, scores, iou_thres)  # NMS
        if i.shape[0] > max_det:
            i = i[:max_det]
        
        output[xi] = x[i]
    
    return output

NMS工作原理：

1. 按置信度排序：[0.9, 0.8, 0.7, 0.6, ...]
2. 选择最高的框A（0.9）
3. 计算A与其他框的IoU
4. 移除IoU > 阈值的框
5. 重复2-4

示意图：

初始:  [A:0.9]  [B:0.8]  [C:0.7]  [D:0.6]
              ↓
选A:   [A:✓]   [B:?]    [C:?]    [D:?]
              ↓
IoU(A,B)=0.6 > 0.45 → 移除B
IoU(A,C)=0.2 < 0.45 → 保留C
IoU(A,D)=0.7 > 0.45 → 移除D
              ↓
结果:  [A:✓]           [C:✓]

7.3 结果可视化

def plot_one_box(xyxy, img, color=None, label=None, line_thickness=3):
    """
    在图像上绘制一个边界框
    
    xyxy: 边界框坐标 (x1, y1, x2, y2)
    img: 图像 (numpy array)
    color: 框颜色 (B, G, R)
    label: 标签文本
    """
    tl = line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1
    color = color or [random.randint(0, 255) for _ in range(3)]
    c1, c2 = (int(xyxy[0]), int(xyxy[1])), (int(xyxy[2]), int(xyxy[3]))
    
    # 绘制矩形
    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)
    
    # 绘制标签
    if label:
        tf = max(tl - 1, 1)  # 字体粗细
        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]
        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3
        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # 填充
        cv2.putText(img, label, (c1[0], c1[1] - 2), 0, tl / 3, 
                   [225, 255, 255], thickness=tf, lineType=cv2.LINE_AA)

8. 代码实例分析

8.1 完整的训练示例

"""
train.py - YOLOv5训练脚本
"""

import argparse
import torch
from pathlib import Path
from models.yolo import Model
from utils.loss import ComputeLoss
from utils.dataloaders import create_dataloader

def train(opt):
    # === 配置 ===
    epochs = opt.epochs
    batch_size = opt.batch_size
    img_size = opt.img_size
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    # === 加载模型 ===
    model = Model(opt.cfg, ch=3, nc=opt.nc).to(device)
    print(f'Model: {opt.cfg}')
    print(f'Classes: {opt.nc}')
    
    # === 优化器 ===
    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=0.01,
        momentum=0.937,
        weight_decay=0.0005
    )
    
    # === 学习率调度 ===
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer, T_max=epochs
    )
    
    # === 数据加载 ===
    train_loader = create_dataloader(
        path=opt.data / 'train',
        imgsz=img_size,
        batch_size=batch_size,
        augment=True
    )
    
    val_loader = create_dataloader(
        path=opt.data / 'val',
        imgsz=img_size,
        batch_size=batch_size,
        augment=False
    )
    
    # === 损失函数 ===
    compute_loss = ComputeLoss(model)
    
    # === 训练循环 ===
    best_fitness = 0.0
    for epoch in range(epochs):
        model.train()
        
        # 训练一个epoch
        for batch_i, (imgs, targets, paths, _) in enumerate(train_loader):
            imgs = imgs.to(device).float() / 255.0
            targets = targets.to(device)
            
            # 前向
            pred = model(imgs)
            loss, loss_items = compute_loss(pred, targets)
            
            # 反向
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            # 打印
            if batch_i % 10 == 0:
                print(f'Epoch {epoch}/{epochs} '
                      f'Batch {batch_i}/{len(train_loader)} '
                      f'Loss {loss.item():.4f}')
        
        # 验证
        if epoch % 5 == 0:
            fitness = validate(model, val_loader, device)
            if fitness > best_fitness:
                best_fitness = fitness
                torch.save(model.state_dict(), 'best.pt')
                print(f'Saved best model with fitness {fitness:.4f}')
        
        scheduler.step()

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--cfg', type=str, default='yolov5s.yaml')
    parser.add_argument('--data', type=Path, default='data/coco')
    parser.add_argument('--nc', type=int, default=80)
    parser.add_argument('--epochs', type=int, default=300)
    parser.add_argument('--batch-size', type=int, default=16)
    parser.add_argument('--img-size', type=int, default=640)
    opt = parser.parse_args()
    
    train(opt)

8.2 完整的推理示例

"""
detect.py - YOLOv5推理脚本
"""

import argparse
import cv2
import torch
from models.experimental import attempt_load
from utils.general import non_max_suppression, scale_boxes
from utils.plots import plot_one_box

def detect(opt):
    # === 配置 ===
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    # === 加载模型 ===
    model = attempt_load(opt.weights, device=device)
    model.eval()
    stride = int(model.stride.max())
    names = model.names
    
    # === 加载图像 ===
    img0 = cv2.imread(opt.source)
    assert img0 is not None, f'Image Not Found {opt.source}'
    
    # 预处理
    img = letterbox(img0, new_shape=opt.img_size, stride=stride)[0]
    img = img.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGB
    img = np.ascontiguousarray(img)
    img = torch.from_numpy(img).to(device)
    img = img.float() / 255.0
    if img.ndimension() == 3:
        img = img.unsqueeze(0)
    
    # === 推理 ===
    with torch.no_grad():
        pred = model(img)[0]
    
    # === NMS ===
    pred = non_max_suppression(
        pred,
        conf_thres=opt.conf_thres,
        iou_thres=opt.iou_thres
    )
    
    # === 处理检测结果 ===
    for i, det in enumerate(pred):
        if len(det):
            # 坐标映射回原图
            det[:, :4] = scale_boxes(img.shape[2:], det[:, :4], img0.shape).round()
            
            # 打印结果
            for *xyxy, conf, cls in reversed(det):
                label = f'{names[int(cls)]} {conf:.2f}'
                print(f'Detected: {label} at {xyxy}')
                
                # 绘制
                plot_one_box(xyxy, img0, label=label)
    
    # === 保存结果 ===
    cv2.imwrite(opt.output, img0)
    print(f'Results saved to {opt.output}')

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--weights', type=str, default='yolov5s.pt')
    parser.add_argument('--source', type=str, default='data/images/bus.jpg')
    parser.add_argument('--output', type=str, default='result.jpg')
    parser.add_argument('--img-size', type=int, default=640)
    parser.add_argument('--conf-thres', type=float, default=0.25)
    parser.add_argument('--iou-thres', type=float, default=0.45)
    opt = parser.parse_args()
    
    detect(opt)

8.3 自定义数据集训练

1. 准备数据集：

dataset/
├── images/
│   ├── train/
│   │   ├── img1.jpg
│   │   └── img2.jpg
│   └── val/
│       ├── img3.jpg
│       └── img4.jpg
└── labels/
    ├── train/
    │   ├── img1.txt
    │   └── img2.txt
    └── val/
        ├── img3.txt
        └── img4.txt

2. 创建数据配置文件 (data.yaml)：

# 数据集路径
path: ../dataset  # 数据集根目录
train: images/train  # 训练图像路径
val: images/val      # 验证图像路径

# 类别
nc: 3  # 类别数
names: ['cat', 'dog', 'bird']  # 类别名称

3. 创建模型配置文件 (custom.yaml)：

# YOLOv5 custom model

nc: 3  # 类别数
depth_multiple: 0.33  # 深度因子
width_multiple: 0.50  # 宽度因子

anchors:
  - [10,13, 16,30, 33,23]
  - [30,61, 62,45, 59,119]
  - [116,90, 156,198, 373,326]

backbone:
  [[-1, 1, Conv, [64, 6, 2, 2]],  # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],    # 1-P2/4
   [-1, 3, C3, [128]],
   [-1, 1, Conv, [256, 3, 2]],    # 3-P3/8
   [-1, 6, C3, [256]],
   [-1, 1, Conv, [512, 3, 2]],    # 5-P4/16
   [-1, 9, C3, [512]],
   [-1, 1, Conv, [1024, 3, 2]],   # 7-P5/32
   [-1, 3, C3, [1024]],
   [-1, 1, SPPF, [1024, 5]],      # 9
  ]

head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],
   [-1, 3, C3, [512, False]],
   
   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],
   [-1, 3, C3, [256, False]],  # 17 (P3/8-small)

   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 14], 1, Concat, [1]],
   [-1, 3, C3, [512, False]],  # 20 (P4/16-medium)

   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 10], 1, Concat, [1]],
   [-1, 3, C3, [1024, False]],  # 23 (P5/32-large)

   [[17, 20, 23], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)
  ]

4. 训练命令：

python train.py --data data.yaml --cfg custom.yaml --weights yolov5s.pt --epochs 100

总结

本文档详细介绍了YOLOv5的各个方面：

核心要点

网络架构：
- Backbone：CSPDarknet53，负责特征提取
- Neck：PANet，负责多尺度特征融合
- Head：Detect，负责预测边界框和类别
关键模块：
- Focus：降低计算量的同时保留信息
- C3：CSP Bottleneck，提高效率
- SPPF：多尺度特征融合，增大感受野
- PANet：自顶向下和自底向上的特征融合
训练策略：
- 数据增强（Mosaic、MixUp等）
- 多种损失函数（CIoU、BCE）
- 学习率预热和余弦退火
- EMA和混合精度训练
推理优化：
- 非极大值抑制（NMS）
- 多尺度预测
- 后处理和可视化

学习建议

理论学习：先理解目标检测的基本概念（IoU、NMS、mAP等）
代码阅读：从simple到complex，逐步理解各模块
实践训练：从小数据集开始，理解训练过程
调参优化：尝试不同的超参数，理解它们的作用
改进创新：基于理解，尝试改进网络结构或训练策略

进阶方向

轻量化：MobileNet、ShuffleNet等backbone
注意力机制：SE、CBAM等模块
Transformer：引入自注意力机制
后处理优化：Soft-NMS、Weighted-NMS等
特定场景：小目标检测、遮挡处理等

参考资源：

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

dify实战-个人知识库搭建

随着大语言模型（LLM）和人工智能（AI）技术的日趋成熟，在日益复杂的知识库中实现高效检索已成为关键需求。构建个性化的文档检索模型能显著提升信息获取效率，优化工作流程。这类模型不仅能够精准定位目标文档，还能有效过滤冗余信息，助力用户专注于核心任务。其核心价值在于将海量数据转化为可快速访问的知识资产，为决策和创新提供强有力的支持。例如：以上就是今天要讲的内容，本文仅仅简单介绍了pandas的使用，而