上一篇回顾:在第3篇中,我们完成了端侧大模型Llama 2 7B的部署,重点介绍了GPTQ量化和KV-Cache优化。本文将从更基础的层面,深入讲解量化的数学原理、QNN的四种量化方案,以及如何系统地进行NPU性能调优。

前言

在前几篇文章中,我们多次提到“量化”和“NPU加速”,但大多是一笔带过。本文将深入量化技术的细节,以及如何从Hexagon NPU架构层面理解和优化模型推理性能。

如果你遇到过这些问题,本文将给出答案:

  • 量化后精度崩了怎么办?
  • INT8和INT16混合精度怎么配?
  • 为什么模型在NPU上反而比GPU慢?
  • 怎么定位推理瓶颈是计算还是内存带宽?

一、量化基础:从浮点到定点的数学原理

1.1 线性量化公式
量化(浮点 → 定点):q = clip(round(x/scale + zero_point), qmin, qmax)

反量化(定点 → 浮点):x_hat = (q - zero_point) * scale

其中:
scale = (xmax - xmin) / (qmax - qmin)
zero_point = round(qmin - xmin/scale)

INT8: qmin=0, qmax=255(非对称)或 qmin=-128, qmax=127(对称)
INT16: qmin=-32768, qmax=32767
INT4: qmin=-8, qmax=7
1.2 对称 vs 非对称量化
import numpy as np

def symmetric_quantize(x, num_bits=8):
    """对称量化:zero_point=0,范围关于0对称"""
    qmax = 2**(num_bits - 1) - 1  # 127
    scale = np.max(np.abs(x)) / qmax
    q = np.clip(np.round(x / scale), -qmax, qmax).astype(np.int8)
    x_hat = q.astype(np.float32) * scale
    return q, x_hat, scale

def asymmetric_quantize(x, num_bits=8):
    """非对称量化:更精确地利用量化范围"""
    qmin, qmax = 0, 2**num_bits - 1  # 0-255
    scale = (np.max(x) - np.min(x)) / (qmax - qmin)
    zero_point = np.round(qmin - np.min(x) / scale)
    zero_point = np.clip(zero_point, qmin, qmax)
    q = np.clip(np.round(x / scale + zero_point), qmin, qmax).astype(np.uint8)
    x_hat = (q.astype(np.float32) - zero_point) * scale
    return q, x_hat, scale, zero_point

# 测试:ReLU激活值(非负分布)
np.random.seed(42)
activation = np.abs(np.random.randn(1000)) * 2  # 只有正值

_, recon_sym, _ = symmetric_quantize(activation)
mse_sym = np.mean((activation - recon_sym)**2)

_, recon_asym, _, _ = asymmetric_quantize(activation)
mse_asym = np.mean((activation - recon_asym)**2)

print(f"对称量化 MSE: {mse_sym:.6f}")
print(f"非对称量化 MSE: {mse_asym:.6f}")
print(f"非对称量化误差降低: {(1 - mse_asym/mse_sym)*100:.1f}%")
# 对于 ReLU 激活,非对称量化误差约降低50%
1.3 Per-Tensor vs Per-Channel vs Per-Group
精度对比(同一模型,不同量化粒度):
Per-Tensor MSE: 0.0124    (最粗粒度)
Per-Channel MSE: 0.0089   (卷积推荐)
Per-Group(128) MSE: 0.0051(Transformer推荐)
Per-Group(32) MSE: 0.0032 (最细粒度,最慢)

QNN HTP 支持:
- Per-Tensor: 全支持,最快
- Per-Channel: 全支持,Conv2D 默认推荐
- Per-Group: V75+ 支持,Group=128 推荐用于 Linear/MatMul

二、QNN量化实战:四种主要方案

2.1 方案一:训练后量化 (PTQ)
# 最基本的 PTQ — 使用 QNN 工具链
qnn-onnx-converter \
    --input_network model.onnx \
    --output_path model_qnn.cpp \
    --input_dim input 1,3,224,224 \
    --input_list calibration/input_list.txt \
    --act_bw 8 \
    --weight_bw 8 \
    --bias_bw 32 \
    --algorithms cle \
    --use_per_channel_quantization
2.2 方案二:增强型PTQ (EnhancedPTQ)
# QNN 增强量化配置(通过 quantization_overrides.json 控制)
import json

overrides = {
    "activation_encodings": {
        # 对精度敏感的场景使用 INT16
        "/model/layer.0/attention/Softmax_output_0": {
            "dtype": "int16",
            "is_symmetric": True
        },
        "/model/layer.0/attention/MatMul_1_output_0": {
            "dtype": "int16",
            "is_symmetric": True
        }
    },
    "param_encodings": {
        # 第一层和最后一层权重保持 INT8(精度敏感)
        "model.embed_tokens.weight": {
            "dtype": "int8",
            "is_symmetric": True
        },
        "lm_head.weight": {
            "dtype": "int8",
            "is_symmetric": True
        },
        # 中间层权重使用 INT4
        "model.layers.*.self_attn.q_proj.weight": {
            "dtype": "int4",
            "is_symmetric": True
        }
    },
    "supergroups": [
        {"op_list": ["Conv", "Relu"], "fuse": True},
        {"op_list": ["Conv", "BatchNormalization"], "fuse": True}
    ]
}

with open("quantization_overrides.json", "w") as f:
    json.dump(overrides, f, indent=2)
# 使用增强量化配置
qnn-onnx-converter \
    --input_network model.onnx \
    --output_path model_enhanced.cpp \
    --input_dim input 1,3,224,224 \
    --input_list calibration/input_list.txt \
    --act_bw 8 --weight_bw 8 \
    --algorithms cle adaround \
    --quantization_overrides quantization_overrides.json \
    --use_per_channel_quantization \
    --param_quantizer enhanced \
    --act_quantizer enhanced
2.3 方案三:混合精度量化
# 自动混合精度搜索工具
import torch
import numpy as np
from typing import Dict, List, Tuple

class MixedPrecisionSearcher:
    def __init__(self, model, calibration_loader, metric_fn):
        self.model = model
        self.cal_loader = calibration_loader
        self.metric_fn = metric_fn
    
    def compute_layer_sensitivity(self) -> Dict[str, float]:
        """逐层量化敏感度分析"""
        baseline_metric = self.metric_fn(self.model, self.cal_loader)
        print(f"FP32 基准精度: {baseline_metric:.4f}")
        
        sensitivities = {}
        for name, module in self.model.named_modules():
            if not isinstance(module, (torch.nn.Linear, torch.nn.Conv2d)):
                continue
            
            original_weight = module.weight.data.clone()
            quantized_weight = self._simulate_int8(module.weight.data)
            module.weight.data = quantized_weight
            
            metric = self.metric_fn(self.model, self.cal_loader)
            sensitivity = baseline_metric - metric
            sensitivities[name] = sensitivity
            
            module.weight.data = original_weight
            print(f"  {name}: sensitivity = {sensitivity:.6f}")
        
        return sensitivities
    
    def generate_mixed_precision_config(
        self,
        sensitivities: Dict[str, float],
        int16_ratio: float = 0.2
    ) -> Dict[str, str]:
        """根据敏感度生成混合精度配置"""
        sorted_layers = sorted(
            sensitivities.items(), key=lambda x: x[1], reverse=True
        )
        
        num_int16 = int(len(sorted_layers) * int16_ratio)
        config = {}
        for i, (name, sens) in enumerate(sorted_layers):
            if i < num_int16:
                config[name] = "int16"
                print(f"  INT16: {name} (sensitivity: {sens:.6f})")
            else:
                config[name] = "int8"
        return config
    
    @staticmethod
    def _simulate_int8(tensor):
        """模拟INT8对称量化"""
        scale = tensor.abs().max() / 127.0
        quantized = torch.clamp(torch.round(tensor / scale), -128, 127)
        return quantized * scale
2.4 方案四:量化感知训练 (QAT)
# QAT (Quantization-Aware Training) 实战
import torch
import torch.nn as nn
from aimet_torch.quantsim import QuantizationSimModel
from aimet_common.defs import QuantScheme

def run_qat(model, train_loader, val_loader, num_epochs=10):
    """使用 Qualcomm AIMET 进行 QAT"""
    
    # Step 1: 创建量化模拟模型
    dummy_input = torch.randn(1, 3, 224, 224).cuda()
    
    sim = QuantizationSimModel(
        model=model.cuda(),
        dummy_input=dummy_input,
        quant_scheme=QuantScheme.training_range_learning_with_tf_init,
        default_output_bw=8,   # 激活 8-bit
        default_param_bw=8,    # 权重 8-bit
        config_file="qnn_quantsim_config.json"
    )
    
    # Step 2: 校准
    def calibration_callback(model, _):
        model.eval()
        with torch.no_grad():
            for i, (images, _) in enumerate(train_loader):
                if i >= 100: break
                model(images.cuda())
    
    sim.compute_encodings(calibration_callback, None)
    print("校准完成,开始QAT微调...")
    
    # Step 3: QAT微调
    optimizer = torch.optim.SGD(
        sim.model.parameters(), lr=0.001,
        momentum=0.9, weight_decay=1e-4
    )
    criterion = nn.CrossEntropyLoss()
    
    best_acc = 0
    for epoch in range(num_epochs):
        sim.model.train()
        running_loss = 0
        
        for images, labels in train_loader:
            images, labels = images.cuda(), labels.cuda()
            optimizer.zero_grad()
            outputs = sim.model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
        
        # 验证
        sim.model.eval()
        correct, total = 0, 0
        with torch.no_grad():
            for images, labels in val_loader:
                images, labels = images.cuda(), labels.cuda()
                outputs = sim.model(images)
                _, predicted = outputs.max(1)
                correct += (predicted == labels).sum().item()
                total += labels.size(0)
        
        acc = correct / total
        print(f"Epoch {epoch+1}/{num_epochs} "
              f"Loss: {running_loss/len(train_loader):.4f} "
              f"Acc: {acc:.4f}")
        
        if acc > best_acc:
            best_acc = acc
            sim.export(
                path="./qat_export",
                filename_prefix="model_qat",
                dummy_input=dummy_input
            )
    
    return sim

三、Hexagon NPU性能调优

3.1 NPU性能模型
NPU执行时间 = max(计算时间, 内存时间) + 调度开销

计算时间 = 总MAC数 / HTA吞吐量
内存时间 = 数据搬移量 / 内存带宽

骁龙8 Gen3(SM8650)关键指标:
HTA算力:
  INT8: 36 TOPS (Tensor)
  INT16: 18 TOPS
  INT4: 72 TOPS

内存带宽:
  LPDDR5X: 68.3 GB/s(理论)
  NPU可用:~50 GB/s(实测有效带宽)

片上缓存:
  VTCM: 8 MB(向量紧耦合内存)
  L2: 1-2 MB
3.2 Roofline分析
# Roofline 模型分析工具 - 判断模型是计算瓶颈还是带宽瓶颈
import numpy as np

class RooflineAnalyzer:
    def __init__(self,
                 peak_tops=36.0,           # HTA INT8 算力 (TOPS)
                 memory_bandwidth_gbs=50.0 # 有效内存带宽(GB/s)
    ):
        self.peak_ops = peak_tops * 1e12   # ops/s
        self.mem_bw = memory_bandwidth_gbs * 1e9  # bytes/s
        self.ridge_point = self.peak_ops / self.mem_bw  # ops/byte
    
    def analyze_layer(self, name, ops, data_bytes):
        """分析单层的性能特征"""
        arithmetic_intensity = ops / data_bytes  # ops/byte
        
        compute_time = ops / self.peak_ops
        memory_time = data_bytes / self.mem_bw
        
        if arithmetic_intensity >= self.ridge_point:
            bottleneck = "计算瓶颈"
            actual_time = compute_time
            utilization = min(1.0, memory_time / compute_time)
        else:
            bottleneck = "带宽瓶颈"
            actual_time = memory_time
            utilization = min(1.0, compute_time / memory_time)
        
        print(f"[{name}]")
        print(f"  运算量: {ops/1e9:.2f} GOPS | "
              f"数据量: {data_bytes/1e6:.2f} MB")
        print(f"  算术强度: {arithmetic_intensity:.1f} ops/byte "
              f"(脊点: {self.ridge_point:.1f})")
        print(f"  瓶颈: {bottleneck} | "
              f"预计耗时: {actual_time*1000:.3f} ms | "
              f"利用率: {utilization*100:.1f}%")
        
        return {
            "name": name,
            "intensity": arithmetic_intensity,
            "bottleneck": bottleneck,
            "time_ms": actual_time * 1000,
            "utilization": utilization
        }
    
    def analyze_model(self, layers_info):
        """分析整个模型"""
        results = []
        total_time = 0
        print("=" * 60)
        print("Roofline 模型分析报告")
        print("=" * 60)
        for info in layers_info:
            result = self.analyze_layer(**info)
            results.append(result)
            total_time += result["time_ms"]
            print()
        
        compute_bound = sum(1 for r in results if r["bottleneck"] == "计算瓶颈")
        memory_bound = len(results) - compute_bound
        
        print("=" * 60)
        print(f"总预计耗时: {total_time:.3f} ms")
        print(f"计算瓶颈层: {compute_bound} | 带宽瓶颈层: {memory_bound}")
        return results

# 分析MobileNetV2的典型层
analyzer = RooflineAnalyzer(peak_tops=36.0, memory_bandwidth_gbs=50.0)

mobilenet_layers = [
    {"name": "Conv1 3x3/2", "ops": 2*3*3*3*32*112*112, "data_bytes": ...},
    {"name": "DWConv 3x3", "ops": 2*3*3*32*112*112, "data_bytes": ...},
    {"name": "PWConv 1x1", "ops": 2*32*16*112*112, "data_bytes": ...},
    {"name": "BottleNeck Conv", "ops": 2*96*96*3*3*56*56, "data_bytes": ...},
    {"name": "Final FC", "ops": 2*1280*1000, "data_bytes": ...},
]
analyzer.analyze_model(mobilenet_layers)
3.3 QNN Profiling实战
#!/bin/bash
# profile_model.sh - 详细性能分析

QNN_SDK=$QNN_SDK_ROOT

# 1. 运行推理并收集详细 profiling 数据
qnn-net-run \
    --model model_libs/aarch64-android/libmodel_qnn.so \
    --backend $QNN_SDK/lib/aarch64-android/libQnnHtp.so \
    --input_list test_inputs/input_list.txt \
    --perf_profile burst \
    --profiling_level detailed \
    --log_level verbose \
    --output_dir profiling_results \
    --num_inferences 100   # 跑100次取平均

# 2. 解析 profiling 数据
echo ""
echo "==== 逐层耗时 TOP 20 ===="
python3 parse_profiling.py profiling_results/qnn-profiling-data.log
# parse_profiling.py - QNN Profiling 数据解析与可视化
import re
import sys
from collections import defaultdict

def parse_qnn_profiling(log_path):
    layer_times = defaultdict(list)
    
    with open(log_path, "r") as f:
        for line in f:
            match = re.search(
                r"Node\s+(\S+)\s+.*execute.*?(\d+\.?\d*)\s*us", line
            )
            if match:
                node_name = match.group(1)
                time_us = float(match.group(2))
                layer_times[node_name].append(time_us)
    
    results = []
    total_us = 0
    for name, times in layer_times.items():
        avg_us = sum(times) / len(times)
        results.append((name, avg_us))
        total_us += avg_us
    
    results.sort(key=lambda x: x[1], reverse=True)
    
    print(f"{'层名称':<50} {'平均耗时(us)':>12} {'占比':>8}")
    print("-" * 72)
    
    for name, avg_us in results[:20]:
        pct = avg_us / total_us * 100
        bar = "*" * int(pct / 2)
        print(f"{name:<50} {avg_us:>10.1f} us {pct:>6.1f}% {bar}")
    
    print("-" * 72)
    print(f"{'总计':<50} {total_us:>10.1f} us {100:>6.1f}%")
    print(f"\n平均推理延迟: {total_us/1000:.2f} ms")
    return results

if __name__ == "__main__":
    parse_qnn_profiling(sys.argv[1])
3.4 常见性能问题与优化

问题一:算子回退到CPU

def check_fallback_ops(profiling_log):
    """从profiling日志中检测回退到CPU的算子"""
    cpu_ops = []
    htp_ops = []
    
    with open(profiling_log) as f:
        for line in f:
            if "CPU" in line and "execute" in line:
                cpu_ops.append(line.strip())
            elif "HTP" in line and "execute" in line:
                htp_ops.append(line.strip())
    
    if cpu_ops:
        print(f"发现{len(cpu_ops)}个算子回退到CPU:")
        for op in cpu_ops[:10]:
            print(f"  {op}")
        print("\n修复建议:")
        print("1. 检查是否可以替换为HTP支持的等价算子")
        print("2. 使用onnxsim简化模型图结构")
        print("3. 考虑在模型中重写不支持的操作")
    else:
        print("所有算子都在HTP上执行")

问题二:VTCM利用率低

// htp_optimization.json
{
    "graph_config": {
        "vtcm_mb": 8,
        "enable_dlbc": true,
        "dlbc_config": {
            "weight_cache_strategy": "greedy",
            "activation_cache_strategy": "lru",
            "max_weight_cache_mb": 4
        },
        "tiling_config": {
            "prefer_vtcm_tiling": true,
            "max_tile_height": 64,
            "max_tile_width": 64,
            "tile_channels": 32
        }
    }
}

问题三:内存带宽瓶颈

# 内存布局优化 - NHWC vs NCHW
# Hexagon NPU 原生使用 NHWC 布局,如果输入是 NCHW 会产生额外转置开销

def optimize_memory_layout(onnx_path, output_path):
    """将模型输入/输出转换为 NHWC 布局"""
    import onnx
    
    model = onnx.load(onnx_path)
    
    for inp in model.graph.input:
        shape = [d.dim_value for d in inp.type.tensor_type.shape.dim]
        if len(shape) == 4:
            n, c, h, w = shape
            print(f"输入{inp.name}: NCHW[{n},{c},{h},{w}]")
            print(f"建议转换为NHWC[{n},{h},{w},{c}]以避免NPU转置开销")
    
    print("\n使用QNN转换时添加参数:")
    print("  qnn-onnx-converter --input_layout input NHWC")

四、实战案例:ResNet50精度恢复

4.1 问题场景
原始FP32 ResNet50: Top-1 = 76.1%
直接INT8 PTQ: Top-1 = 74.3%(损失1.8%,不可接受)
目标:Top-1 ≥ 75.8%(损失 < 0.3%)
4.2 排查流程
# 量化精度恢复完整流程
import torch
import torchvision.models as models
import numpy as np

# Step 1: 逐层精度分析
print("==== Step 1: 逐层量化敏感度分析 ====")
model = models.resnet50(pretrained=True).eval()

sensitivity_results = {}
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Conv2d):
        weight = module.weight.data
        
        # 模拟INT8量化
        scale = weight.abs().max() / 127.0
        q_weight = torch.clamp(torch.round(weight / scale), -128, 127) * scale
        
        # 计算量化误差
        error = torch.mean((weight - q_weight) ** 2).item()
        snr = 10 * np.log10(
            torch.sum(weight**2).item() /
            (torch.sum((weight - q_weight)**2).item() + 1e-10)
        )
        
        sensitivity_results[name] = {"mse": error, "snr_db": snr}
        
        if snr < 25:  # SNR < 25dB 的层标记为敏感
            print(f"  {name}: SNR={snr:.1f}dB (敏感!)")

# Step 2: 对敏感层使用 INT16
print("\n==== Step 2: 生成混合精度配置 ====")
sensitive_layers = {
    name: info for name, info in sensitivity_results.items()
    if info["snr_db"] < 25
}

overrides = {"activation_encodings": {}, "param_encodings": {}}
for name in sensitive_layers:
    overrides["param_encodings"][name + ".weight"] = {
        "dtype": "int16",
        "is_symmetric": True
    }
    print(f"  {name} -> INT16")

import json
with open("resnet50_mixed_precision.json", "w") as f:
    json.dump(overrides, f, indent=2)

# Step 3: 使用 AdaRound 优化
print("\n==== Step 3: 应用 AdaRound 优化 ====")
print("AdaRound 通过学习最优的舍入策略来最小化量化误差")
print("命令: qnn-onnx-converter --algorithms adaround ...")

# Step 4: 使用 CLE (Cross-Layer Equalization)
print("\n==== Step 4: 应用 CLE 优化 ====")
print("CLE 通过跨层权重均衡化来减少量化范围差异")
print("命令: qnn-onnx-converter --algorithms cle ...")
4.3 最终结果
优化方案 Top-1 推理延迟(NPU)
FP32 基准 76.1% N/A
直接 INT8 PTQ 74.3% 2.1 ms
+ CLE 75.2% 2.1 ms
+ CLE + AdaRound 75.6% 2.1 ms
+ 混合精度(敏感层INT16) 75.9% 2.4 ms
+ QAT(10 epochs) 76.0% 2.1 ms

五、高级优化技巧

5.1 算子融合
融合前(3次内存读写):
Conv2d → 写回DRAM → 读取 → BatchNorm → 写回DRAM → 读取 → ReLU → 写回

融合后(1次内存读写):
[Conv2d + BatchNorm + ReLU] → 写回DRAM

内存流量减少60%+,对带宽瓶颈层效果显著

QNN自动支持的融合模式:

融合模式 加速效果
Conv + BN + ReLU ~30%
Conv + Add (残差) ~15%
MatMul + Add + GELU ~25%
LayerNorm fusion ~20%
Multi-Head Attention ~40%
5.2 性能配置最佳实践
# QNN HTP 后端性能配置参数全解

# perf_profile: 性能档位
# - burst: 最高性能,最高功耗,短时间使用
# - sustained_high: 持续高性能
# - balanced: 平衡性能和功耗
# - power_saver: 省电模式

# 实时推理(如摄像头检测)推荐配置
qnn-net-run \
    --backend libQnnHtp.so \
    --perf_profile sustained_high \
    --profiling_level basic \
    --duration 0 \
    --num_inferences 1000

# 批量处理推荐配置
qnn-net-run \
    --backend libQnnHtp.so \
    --perf_profile burst \
    --profiling_level off \
    --num_inferences 10000
5.3 内存优化(零拷贝)
/**
 * QNN 零拷贝内存管理
 * 使用 ION/DMA-BUF 分配器避免 CPU-NPU 之间的数据拷贝
 */
#include <linux/ion.h>

class ZeroCopyBuffer {
public:
    bool allocate(size_t size) {
        int ion_fd = open("/dev/ion", O_RDONLY);
        struct ion_allocation_data alloc_data = {};
        alloc_data.len = size;
        alloc_data.heap_id_mask = ION_HEAP_SYSTEM_MASK;
        alloc_data.flags = ION_FLAG_CACHED;
        ioctl(ion_fd, ION_IOC_ALLOC, &alloc_data);
        
        buffer_ = mmap(nullptr, size, PROT_READ | PROT_WRITE,
                       MAP_SHARED, alloc_data.fd, 0);
        size_ = size;
        ion_fd_ = alloc_data.fd;
        close(ion_fd);
        return buffer_ != MAP_FAILED;
    }
    
    void* data() { return buffer_; }
    int fd() { return ion_fd_; }
    
    ~ZeroCopyBuffer() {
        if (buffer_ != MAP_FAILED) {
            munmap(buffer_, size_);
            close(ion_fd_);
        }
    }
    
private:
    void* buffer_ = MAP_FAILED;
    size_t size_ = 0;
    int ion_fd_ = -1;
};

六、调优Checklist

□ 模型量化调优Checklist:

□ 校准数据
  □ 样本数量 ≥ 200(推荐500-1000)
  □ 数据分布覆盖实际使用场景
  □ 包含边界情况样本

□ 量化配置
  □ 首选Per-Channel权重量化
  □ 激活值使用合适的观测器(MinMax / Percentile / MSE)
  □ 对称量化用于权重,非对称量化用于ReLU激活

□ 精度验证
  □ 余弦相似度 > 0.995
  □ SNR > 30 dB
  □ 任务指标损失 < 1%

□ 性能优化
  □ 使用 profiling 找到 TOP 10 耗时层
  □ 确认无算子回退到 CPU
  □ VTCM 利用率 > 70%
  □ 输入布局为 NHWC(避免转置开销)

□ 部署前验证
  □ 在目标设备上跑 1000 次稳定性测试
  □ 内存峰值不超过设备可用量的 60%
  □ 温度稳定后性能无明显下降

七、总结

方法 适用场景 精度损失 实施难度
标准 PTQ 大部分CNN 不到1%
增强 PTQ (CLE+AdaRound) 精度敏感模型 不到0.5% ★★
混合精度 Transformer/复杂模型 不到0.3% ★★★
QAT 极端精度要求 不到0.1% ★★★★

核心建议:先用标准PTQ试试,不行就加CLE+AdaRound,再不行上混合精度,最后才考虑QAT。90%的场景在前两步就能解决。

参考资料

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐