Qwen3.5-MoE 多模态大模型架构深度解析

创新说明3:1 比例混合 Linear Attention (Mamba SSM) 与 Full Self-Attention，兼顾 O(n) 效率与全局建模超大规模 MoE512 Expert + Shared Expert，每层 Top-10 路由，总参数 ~400B 但激活量仅 ~15-20BM-RoPE多模态旋转位置编码，三段式编码（高/宽/时间），原生支持图像与视频的空间-时序位置MTP

m0_63217963

1259人浏览 · 2026-02-23 17:14:25

m0_63217963 · 2026-02-23 17:14:25 发布

Qwen3.5-MoE 多模态大模型架构深度解析

文档版本: v1.0
分析日期: 2026-02-22
分析来源: config.json + quant_model_weights.safetensors.index.json
架构标识: Qwen3_5MoeForConditionalGeneration

1. 模型全局概览

维度	值
架构类型	`Qwen3_5MoeForConditionalGeneration`
模型类别	多模态（Vision-Language）MoE
权重总量	~420.7 GB（量化后）
分片文件	99 个 safetensors
权重条目	279,374 条
上下文窗口	262,144 tokens（256K）
词表大小	248,320
精度	bfloat16（部分组件量化为低精度）
Transformers 版本	4.57.0.dev0

1.1 模型四大模块

┌─────────────────────────────────────────────────────────┐
│                   Qwen3.5-MoE                           │
├──────────────┬──────────────┬──────────┬────────────────┤
│ Vision       │ Language     │ MTP      │ LM Head        │
│ Encoder      │ Model        │ Module   │                │
│ (27-layer    │ (60-layer    │ (1-layer │ (线性投影)      │
│  ViT)        │  Hybrid-MoE) │  MoE)    │                │
├──────────────┴──────────────┴──────────┴────────────────┤
│ Shard 分布: Vision=1 | LM=96 | MTP=4 | LM Head=1       │
└─────────────────────────────────────────────────────────┘

2. 视觉编码器（Vision Encoder）

基于 ViT（Vision Transformer）架构，负责将图像/视频帧编码为视觉 token 序列。

2.1 核心参数

参数	值	说明
depth	27	Transformer Block 层数
hidden_size	1152	隐藏层维度
num_heads	16	注意力头数（head_dim = 72）
intermediate_size	4304	FFN 中间层维度
patch_size	16	空间 Patch 大小
temporal_patch_size	2	时间维度 Patch 大小（视频）
spatial_merge_size	2	空间合并因子
in_channels	3	输入通道数（RGB）
num_position_embeddings	2304	位置编码数量
out_hidden_size	4096	输出维度（对齐语言模型）
hidden_act	gelu_pytorch_tanh	激活函数

2.2 结构详解

Input Image/Video
    │
    ▼
┌──────────────────────┐
│  Patch Embedding     │  Conv2d(3, 1152, kernel=16×16)
│  + Position Embed    │  可学习位置编码 (2304 positions)
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  ViT Block × 27      │  每个 Block:
│  ├─ LayerNorm        │    - Pre-Norm (weight + bias)
│  ├─ Fused QKV Attn   │    - 融合 QKV 多头注意力 (16 heads)
│  ├─ LayerNorm        │    - Pre-Norm (weight + bias)
│  └─ MLP (FC1→FC2)    │    - 1152 → 4304 → 1152
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Merger              │  视觉→语言空间映射
│  ├─ LayerNorm        │    归一化
│  ├─ FC1              │    spatial_merge 后投影
│  └─ FC2              │    → out_hidden_size (4096)
└──────────┬───────────┘
           │
           ▼
    Visual Tokens (dim=4096)

2.3 特殊视觉 Token

Token	ID	用途
`vision_start`	248053	视觉序列起始标记
`vision_end`	248054	视觉序列结束标记
`image_token`	248056	图像占位 token
`video_token`	248057	视频占位 token

3. 语言模型（Language Model）

3.1 全局参数

参数	值	说明
num_hidden_layers	60	总层数
hidden_size	4096	隐藏维度
vocab_size	248,320	词表大小
max_position_embeddings	262,144	最大位置（256K）
rms_norm_eps	1e-6	RMSNorm epsilon
hidden_act	silu	FFN 激活函数
tie_word_embeddings	false	Embedding 与 LM Head 不共享

3.2 混合注意力架构（Hybrid Attention）

Qwen3.5-MoE 采用 Linear Attention + Full Self-Attention 混合架构，每 4 层为一组，前 3 层使用线性注意力（SSM/Mamba 风格），第 4 层使用完整自注意力。

full_attention_interval = 4

Layer:  0   1   2  [3]  4   5   6  [7]  8   9  10 [11] ... 56  57  58 [59]
Type:   L   L   L   F   L   L   L   F   L   L   L   F  ...  L   L   L   F

L = Linear Attention (45 层, 75%)
F = Full Self-Attention (15 层, 25%)

3.2.1 Linear Attention（线性注意力 / SSM）— 45 层

基于 Mamba-2 / 状态空间模型（State Space Model）的线性注意力，时间复杂度 O(n) 。

参数	值	说明
linear_key_head_dim	128	Key 头维度
linear_num_key_heads	16	Key 头数（KV 共享结构）
linear_value_head_dim	128	Value 头维度
linear_num_value_heads	64	Value 头数
linear_conv_kernel_dim	4	1D 卷积核大小

权重组成:

权重名	说明
`in_proj_qkv.weight`	融合 QKV 输入投影
`in_proj_z.weight`	门控投影 Z
`in_proj_a.weight`	SSM 参数 A 投影
`in_proj_b.weight`	SSM 参数 B 投影
`A_log`	状态转移矩阵（log 空间）
`dt_bias`	时间步长偏置 Δ
`conv1d.weight`	局部卷积（kernel=4）
`norm.weight`	归一化
`out_proj.weight`	输出投影

3.2.2 Full Self-Attention（完整自注意力）— 15 层

采用 GQA（Grouped-Query Attention） + QK-Norm + Partial Rotary Position Embedding。

参数	值	说明
num_attention_heads	32	Q 头数
num_key_value_heads	2	KV 头数（GQA 比率 16:1）
head_dim	256	每头维度
Q 总维度	32 × 256 = 8192
KV 总维度	2 × 256 = 512
attn_output_gate	true	输出门控
attention_bias	false	无注意力偏置

权重组成: q_proj, k_proj, v_proj, o_proj, q_norm, k_norm

其中 q_proj, k_proj, v_proj 含量化参数 (weight_scale, weight_offset)。

3.2.3 位置编码（RoPE）

参数	值	说明
rope_type	default	标准 RoPE
rope_theta	10,000,000	频率基数（10M，支持超长上下文）
partial_rotary_factor	0.25	仅 25% 维度应用旋转
旋转维度	256 × 0.25 = 64	实际参与 RoPE 的维度
mrope_interleaved	true	交错式多模态 RoPE
mrope_section	[11, 11, 10]	高度/宽度/时间维度分配

M-RoPE（Multimodal RoPE）将 64 维旋转编码拆分为 3 段（11+11+10=32 组，每组 2 维），分别编码空间高度、空间宽度和时间位置，使模型对图像/视频具备原生的位置感知能力。

3.3 MoE FFN（混合专家前馈网络）

每一层（全部 60 层） 均采用 MoE 架构。

参数	值	说明
num_experts	512	专家总数
num_experts_per_tok	10	每 token 激活专家数
moe_intermediate_size	1024	专家中间维度
shared_expert_intermediate_size	1024	共享专家中间维度
router_aux_loss_coef	0.001	路由辅助损失系数

单层 MoE 结构：

hidden_states (4096)
       │
       ├───────────────────────────────────────────┐
       │                                           │
       ▼                                           ▼
┌──────────────┐                          ┌──────────────────┐
│ Router Gate  │                          │ Shared Expert    │
│ (4096 → 512) │                          │ gate_proj (4096→1024)
│ Top-10 选择   │                          │ up_proj   (4096→1024)
└──────┬───────┘                          │ down_proj (1024→4096)
       │                                  └────────┬─────────┘
       ▼                                           │
┌─────────────────────┐                            │
│ Expert_i × 10       │                            │
│ (从 512 中选 10 个)  │                            │
│ gate_proj (4096→1024)│                            │
│ up_proj   (4096→1024)│          ┌────────────────┐│
│ down_proj (1024→4096)│          │shared_expert   ││
└──────────┬──────────┘          │_gate (标量门控) ││
           │                     └───────┬────────┘│
           │                             │         │
           ▼                             ▼         │
     expert_output ──────────────── + ◄────────────┘
                                   │
                                   ▼
                             output (4096)

参数量估算（每层 MoE FFN）：

组件	参数量
512 Expert	512 × 3 × 4096 × 1024 = 6.44B
Shared Expert	3 × 4096 × 1024 = 12.6M
Router Gate	4096 × 512 = 2.1M
每层 MoE 小计	~6.45B

3.4 单层完整结构

                   input
                     │
                     ▼
              ┌─────────────┐
              │ RMSNorm     │  input_layernorm
              └──────┬──────┘
                     │
            ┌────────┴────────┐
            │                 │
     (layer % 4 != 3)  (layer % 4 == 3)
            │                 │
            ▼                 ▼
   ┌─────────────┐  ┌──────────────┐
   │ Linear Attn │  │ Full Self-   │
   │ (Mamba SSM) │  │ Attention    │
   │             │  │ (GQA+RoPE)   │
   └──────┬──────┘  └──────┬───────┘
          │                │
          └────────┬───────┘
                   │
                   + ← residual
                   │
                   ▼
            ┌─────────────┐
            │ RMSNorm     │  post_attention_layernorm
            └──────┬──────┘
                   │
                   ▼
          ┌────────────────┐
          │ MoE FFN        │
          │ 512 Experts    │
          │ + Shared Expert│
          │ Top-10 Routing │
          └────────┬───────┘
                   │
                   + ← residual
                   │
                   ▼
                output

3.5 量化策略

量化仅应用于计算密集型组件，保留关键组件的全精度：

组件	是否量化	说明
MoE Expert FFN (`gate/up/down_proj`)	是	含 `weight_scale` + `weight_offset`
Self-Attention QKV (`q/k/v_proj`)	是	含 `weight_scale` + `weight_offset`
Self-Attention Output (`o_proj`)	否	保持全精度
Linear Attention 全部权重	否	SSM 对精度敏感
Shared Expert	否	始终激活，保持精度
Router Gate	否	路由精度直接影响专家选择
RMSNorm	否	保持全精度
Embedding / LM Head	否	保持全精度

量化参数条目共 184,410 个，约占总权重条目的 66%。

4. MTP 模块（Multi-Token Prediction）

MTP 模块确认存在，共 1,553 个权重条目，分布在 4 个 shard 文件中。

4.1 MTP 配置

参数	值	说明
mtp_num_hidden_layers	1	MTP Transformer 层数
mtp_use_dedicated_embeddings	false	复用主模型 Embedding

4.2 MTP 完整结构

                    ┌─────────────────┐    ┌──────────────────┐
                    │ Embedding of    │    │ Hidden state from│
                    │ current token   │    │ last LM layer    │
                    └────────┬────────┘    └────────┬─────────┘
                             │                      │
                             ▼                      ▼
                    ┌─────────────────┐    ┌──────────────────┐
                    │pre_fc_norm_     │    │pre_fc_norm_      │
                    │embedding        │    │hidden            │
                    │(RMSNorm)        │    │(RMSNorm)         │
                    └────────┬────────┘    └────────┬─────────┘
                             │                      │
                             └──────────┬───────────┘
                                        │ concat / combine
                                        ▼
                               ┌─────────────────┐
                               │ fc.weight        │
                               │ (融合投影层)      │
                               └────────┬────────┘
                                        │
                                        ▼
                            ┌───────────────────────┐
                            │   MTP Transformer     │
                            │   Layer 0             │
                            │ ┌───────────────────┐ │
                            │ │ input_layernorm   │ │
                            │ ├───────────────────┤ │
                            │ │ Self-Attention    │ │
                            │ │ (q/k/v/o_proj +  │ │
                            │ │  q_norm, k_norm)  │ │
                            │ ├───────────────────┤ │
                            │ │post_attn_layernorm│ │
                            │ ├───────────────────┤ │
                            │ │ MoE FFN           │ │
                            │ │ ├ Gate (→512)     │ │
                            │ │ ├ Expert ×512     │ │
                            │ │ ├ Shared Expert   │ │
                            │ │ └ Shared Gate     │ │
                            │ └───────────────────┘ │
                            └───────────┬───────────┘
                                        │
                                        ▼
                               ┌─────────────────┐
                               │ mtp.norm         │
                               │ (RMSNorm)        │
                               └────────┬────────┘
                                        │
                                        ▼
                               ┌─────────────────┐
                               │ lm_head (复用)   │
                               │ 预测 next-next   │
                               │ token            │
                               └─────────────────┘

4.3 MTP 关键设计特征

特征	说明
深度	仅 1 层 Transformer，轻量化设计
注意力类型	Full Self-Attention（非 Linear Attention）
FFN 类型	与主模型完全同构的 MoE（512 Expert + Shared Expert）
Embedding	复用主模型 Embedding（`mtp_use_dedicated_embeddings=false`）
LM Head	复用主模型 `lm_head.weight`
融合方式	对 embedding 和 hidden state 分别 RMSNorm 后通过 FC 融合

4.4 MTP 用途

训练阶段: 提供 next-next-token 预测的辅助监督信号，增强模型表征学习
推理阶段: 可用于 Speculative Decoding（投机解码），MTP 头预测下一个候选 token，主模型并行验证，从而提升推理吞吐量

5. 参数规模估算

5.1 总参数量

模块	估算参数量	说明
Embedding	~1.02B	248,320 × 4,096
LM Head	~1.02B	248,320 × 4,096（不共享）
LM Layers — MoE FFN	~387.1B	60 × 512 × 3 × 4096 × 1024 + shared
LM Layers — Self-Attn (×15)	~1.07B	15 × (Q+K+V+O+norms)
LM Layers — Linear-Attn (×45)	~数 B	45 × SSM 参数
LM Layers — Norms/Router	~0.25B	60 × (2×layernorm + gate)
Vision Encoder	~0.3B	27 层 ViT + Merger
MTP	~6.5B	1 层 MoE Transformer
总参数量（估算）	~400B+

5.2 每 Token 激活参数量

模块	激活参数量
MoE FFN（10/512 Experts）	60 × 10 × 3 × 4096 × 1024 ≈ 7.55B
Shared Expert	60 × 3 × 4096 × 1024 ≈ 0.76B
Attention（平均）	~数 B
Embedding + LM Head	~2.04B
每 Token 激活量（估算）	~15-20B

6. 架构创新总结

6.1 核心创新点

创新	说明
Hybrid Attention	3:1 比例混合 Linear Attention (Mamba SSM) 与 Full Self-Attention，兼顾 O(n) 效率与全局建模
超大规模 MoE	512 Expert + Shared Expert，每层 Top-10 路由，总参数 ~400B 但激活量仅 ~15-20B
M-RoPE	多模态旋转位置编码，三段式编码（高/宽/时间），原生支持图像与视频的空间-时序位置
MTP	DeepSeek-V3 风格的单层多 Token 预测头，训练增强 + 推理投机解码加速
Partial Rotary	仅 25% 维度应用旋转编码（64/256），其余维度自由学习，平衡位置感知与语义表达
选择性量化	仅量化 Expert FFN 和 Self-Attn QKV，保留 SSM、Shared Expert、Router 等关键组件的全精度

6.2 对标分析

维度	Qwen3.5-MoE（本模型）	DeepSeek-V3	Qwen3-235B
总参数	~400B+	671B	235B
激活参数	~15-20B	37B	22B
专家数	512	256	128
激活专家	10	8	8
注意力	Hybrid (SSM+Attn)	Full Attention	Full Attention
MTP	1 层	1 层	无
多模态	原生（ViT + M-RoPE）	无（纯文本）	无（纯文本）

7. 权重文件分布

quant_model_weights-00001-of-00099.safetensors  ← Vision Encoder (全部)
quant_model_weights-00002~00094.safetensors     ← Language Model Layers (共 93 个)
quant_model_weights-00095~00098.safetensors     ← MTP Module (共 4 个)
quant_model_weights-00098.safetensors           ← Embedding + LM Norm
quant_model_weights-00099.safetensors           ← LM Head

附录 A: 关键 Config 字段速查

{
  "text_config": {
    "hidden_size": 4096,
    "num_hidden_layers": 60,
    "num_attention_heads": 32,
    "num_key_value_heads": 2,
    "head_dim": 256,
    "num_experts": 512,
    "num_experts_per_tok": 10,
    "moe_intermediate_size": 1024,
    "shared_expert_intermediate_size": 1024,
    "full_attention_interval": 4,
    "max_position_embeddings": 262144,
    "mtp_num_hidden_layers": 1,
    "rope_theta": 10000000,
    "partial_rotary_factor": 0.25,
    "mrope_section": [11, 11, 10]
  },
  "vision_config": {
    "depth": 27,
    "hidden_size": 1152,
    "num_heads": 16,
    "patch_size": 16,
    "temporal_patch_size": 2,
    "out_hidden_size": 4096
  }
}

附录 B: layer_types 完整映射

Layer	类型	Layer	类型	Layer	类型	Layer	类型
0	Linear	15	Full	30	Linear	45	Linear
1	Linear	16	Linear	31	Full	46	Linear
2	Linear	17	Linear	32	Linear	47	Full
3	Full	18	Linear	33	Linear	48	Linear
4	Linear	19	Full	34	Linear	49	Linear
5	Linear	20	Linear	35	Full	50	Linear
6	Linear	21	Linear	36	Linear	51	Full
7	Full	22	Linear	37	Linear	52	Linear
8	Linear	23	Full	38	Linear	53	Linear
9	Linear	24	Linear	39	Full	54	Linear
10	Linear	25	Linear	40	Linear	55	Full
11	Full	26	Linear	41	Linear	56	Linear
12	Linear	27	Full	42	Linear	57	Linear
13	Linear	28	Linear	43	Full	58	Linear
14	Linear	29	Linear	44	Linear	59	Full