llava-onevision-1.5 训练过程分享与总结（一）

llava-onevision-1.5 是meta ai 最新开源的多模态大模型训练框架（参考资料1），里面包含了数据集、模型以及训练方法，对训练多模态大模型很有帮助，因此将过程分享一下，供批评指正环境配置如下：ubuntu 22.04,conda 23.7.4,cuda 12.6,python 3.11版本，内核版本：Linux ubuntu-workstation 6.8.0-54-gener

robator

605人浏览 · 2025-12-23 19:40:57

robator · 2025-12-23 19:40:57 发布

llava-onevision-1.5 是meta ai 最新开源的多模态大模型训练框架（参考资料1），里面包含了数据集、模型以及训练方法，对训练多模态大模型很有帮助，因此将过程分享一下，供批评指正

环境配置如下：

ubuntu 22.04,conda 23.7.4,cuda 12.6,python 3.11版本，内核版本：Linux ubuntu-workstation 6.8.0-54-generic #56~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Sat Feb 8 11:41:24 UTC 2 x86_64 x86_64 x86_64 GNU/Linux，Intel(R) Core(TM) i9-14900K,单显卡 4090

虚拟环境finetunning配置如下

torch 2.9.1+cuda116，transformer_engine 为2.10

按照readme.md 文件，下面开始每个步骤的训练

readme.md 文件推荐使用docker 的方式来进行，本次没有按照docker的方式进行

1 下载已有模型或手工合并模型

You have two options to get started with LLaVA-OneVision-1.5-stage-0:

Option 1: Download pre-trained model from Hugging Face

Download our LLaVA-OneVision-1.5-4B-stage0 model directly from Hugging Face.

Option 2: Merge initial weights yourself

Alternatively, you can merge the initial weights from the original ViT and LLM:

python ds/merge_model.py \
--vit_path DeepGlint-AI/rice-vit-large-patch14-560 \
--llm_path Qwen/Qwen3-4B-Instruct-2507 \
--output LLaVA-OneVision-1.5-4B-stage0

Note: When merging weights, the adapter component will be initialized with default values.

这一步主要是初始化模型，可以从hugging face直接下载模型，也可以将2个模型合并起来，并初始化参数，并把模型导出到指定目录LLaVA-OneVision-1.5-4B-stage0，模型大致是这个样子

(Pdb) self.model
LLaVAOneVision1_5_Model(
(visual): RiceTransformerPretrainedModel(
(patch_embed): RicePatchEmbed(
(proj): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
)
(rotary_pos_emb): RiceRotaryEmbedding()
(pre_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(blocks): ModuleList(
(0-23): 24 x RiceBlock(
(norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(attn): RiceSdpaAttention(
(qkv): Linear(in_features=1024, out_features=3072, bias=True)
(proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(mlp): RiceMlp(
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(act): GELUActivation()
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
)
)
)
(merger): RicePatchMerger(
(ln_q): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(mlp): Sequential(
(0): Linear(in_features=4096, out_features=4096, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=4096, out_features=2560, bias=True)
)
)
)
(language_model): LLaVAOneVision1_5_TextModel(
(embed_tokens): Embedding(151936, 2560)
(layers): ModuleList(
(0-35): 36 x LLaVAOneVision1_5_DecoderLayer(
(self_attn): LLaVAOneVision1_5_SdpaAttention(
(q_proj): Linear(in_features=2560, out_features=4096, bias=False)
(k_proj): Linear(in_features=2560, out_features=1024, bias=False)
(v_proj): Linear(in_features=2560, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=2560, bias=False)
(q_norm): LLaVAOneVision1_5_RMSNorm((128,), eps=1e-06)
(k_norm): LLaVAOneVision1_5_RMSNorm((128,), eps=1e-06)
)
(mlp): LLaVAOneVision1_5_MLP(
(gate_proj): Linear(in_features=2560, out_features=9728, bias=False)
(up_proj): Linear(in_features=2560, out_features=9728, bias=False)
(down_proj): Linear(in_features=9728, out_features=2560, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): LLaVAOneVision1_5_RMSNorm((2560,), eps=1e-06)
(post_attention_layernorm): LLaVAOneVision1_5_RMSNorm((2560,), eps=1e-06)
)
)
(norm): LLaVAOneVision1_5_RMSNorm((2560,), eps=1e-06)
(rotary_emb): LLaVAOneVision1_5_RotaryEmbedding()
)
)

主要由2个子模型：语言模型和视觉模型，视觉模型是rice-vit,语言模型是qwen，我试了下手工合并，视觉模型在其中一个block运算的时候，输出结果为nan,导致视觉模型校验失败，目前未找到原因，因此是下载的模型

2 开始图片、文本对的对齐训练

（1）数据集

数据集主要是文本对，一张图片对应一个描述，也是从其他数据集迁移过来的。先看一下数据集LLaVA-558K-Webdataset，为了训练方便，这个数据集是和megatron energon 的数据格式是保持一致的，里面有几个压缩包pretrain-000000.tar-pretrain-000004.tar,包含对应的idx 文件，解压pretrain-000000.tar,里面是分好类的图片和对应的json 文件

-r--r--r-- 1 ubuntu ubuntu 33378 Sep 18 17:31 ps_00022075.img0.jpg
-r--r--r-- 1 ubuntu ubuntu 31665 Sep 18 17:31 ps_00022075.img10.jpg
-r--r--r-- 1 ubuntu ubuntu 47355 Sep 18 17:31 ps_00022075.img11.jpg
-r--r--r-- 1 ubuntu ubuntu 36372 Sep 18 17:31 ps_00022075.img12.jpg
-r--r--r-- 1 ubuntu ubuntu 35609 Sep 18 17:31 ps_00022075.img13.jpg
-r--r--r-- 1 ubuntu ubuntu 37153 Sep 18 17:31 ps_00022075.img14.jpg
-r--r--r-- 1 ubuntu ubuntu 19173 Sep 18 17:31 ps_00022075.img15.jpg
-r--r--r-- 1 ubuntu ubuntu 42421 Sep 18 17:31 ps_00022075.img16.jpg
-r--r--r-- 1 ubuntu ubuntu 29117 Sep 18 17:31 ps_00022075.img17.jpg
-r--r--r-- 1 ubuntu ubuntu 13242 Sep 18 17:31 ps_00022075.img18.jpg
-r--r--r-- 1 ubuntu ubuntu 16562 Sep 18 17:31 ps_00022075.img1.jpg
-r--r--r-- 1 ubuntu ubuntu 22301 Sep 18 17:31 ps_00022075.img2.jpg
-r--r--r-- 1 ubuntu ubuntu 47226 Sep 18 17:31 ps_00022075.img3.jpg
-r--r--r-- 1 ubuntu ubuntu 27445 Sep 18 17:31 ps_00022075.img4.jpg
-r--r--r-- 1 ubuntu ubuntu 41999 Sep 18 17:31 ps_00022075.img5.jpg
-r--r--r-- 1 ubuntu ubuntu 16453 Sep 18 17:31 ps_00022075.img6.jpg
-r--r--r-- 1 ubuntu ubuntu 20248 Sep 18 17:31 ps_00022075.img7.jpg
-r--r--r-- 1 ubuntu ubuntu 35984 Sep 18 17:31 ps_00022075.img8.jpg
-r--r--r-- 1 ubuntu ubuntu 42419 Sep 18 17:31 ps_00022075.img9.jpg
-r--r--r-- 1 ubuntu ubuntu 3779 Sep 18 17:31 ps_00022075.json
可以看到一些原文件和一个对应的json 文件，json 文件 json 文件打开后有3个key值：prompts、captions、images,prompts 为需要描述的提示词，captions 为图片的描述答案，images 为路径，json 文件打开后的每个key 对应的value 是1个列表，从先向后对应图片和描述，选取一张图片，看一下效果

它的prompt 为 Write a terse but informative summary of the picture.

它的captions 为the most cat litter and feeding basket gift basket

有点简洁。

开始训练前，先安装必要的包，pip install -r requirements.txt

开始执行AIAK_TRAINING_PATH=../LLaVA-OneVision-1.5-main DATA_PATH=LLaVA-558K-Webdataset TOKENIZER_PATH=LLaVA-OneVision-1.5-4B-stage0 CHECKPOINT_PATH=LLaVA-OneVision-1.5-4B-stage0_mcore_tp1_pp1 bash examples/llava_ov_1_5/quick_start/stage_1_alignment_llava_ov_4b.sh

（2）执行过程的错误

第1个错误：

no module name transformer_engine

关于transformer_engine的安装，请参考资料4

第2个错误：

AIAK_TRAINING_PATH=../LLaVA-OneVision-1.5-main DATA_PATH=LLaVA-558K-Webdataset TOKENIZER_PATH=LLaVA-OneVision-1.5-4B-stage0 CHECKPOINT_PATH=LLaVA-OneVision-1.5-4B-stage0_mcore_tp1_pp1 bash examples/llava_ov_1_5/quick_start/stage_1_alignment_llava_ov_4b.sh

NameError: name 'ApexFusedRMSNorm' is not defined，定位ApexFusedRMSNorm，发现是这么导入的

from apex.normalization.fused_layer_norm import FusedRMSNorm as ApexFusedRMSNorm

apex 包没有安装，安装apex ,pip install apex,再次运行发现依旧报错：

File "/home/ubuntu/anaconda3/envs/finetunning/lib/python3.11/site-packages/apex/__init__.py", line 13, in <module>
from pyramid.session import UnencryptedCookieSessionFactoryConfig
ImportError: cannot import name 'UnencryptedCookieSessionFactoryConfig' from 'pyramid.session' (unknown location)
apex 安装错误，需要按照参考资料2的说明使用以下方法安装apex

git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install

发现不再报错，继续开始stage_1 的训练，

第3个错误

File "/home/ubuntu/anaconda3/envs/finetunning/lib/python3.11/site-packages/torch/cuda/__init__.py", line 567, in set_device
torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)

torchtorch..AcceleratorErrorAcceleratorError: : CUDA error: invalid device ordinal
GPU device may be out of range, do you have enough GPUs?

这台机子只有1块4090显卡，说是序号错了，序号应为0才对，可能是训练分配的gpu 数据不对，查看stage_1_alignment_llava_ov_4b.sh，其中有1行

GPUS_PER_NODE=8，

每个节点分配的gpu 数目为8，远超了我的配置，改为1，再次运行

第4个错误

/usr/include/pybind11/detail/common.h:215:10: fatal error: Python.h: 没有那个文件或目录
215 | #include <Python.h>, 按照参考资料3的解决办法，需要将Python.h 的目录添加至环境变量，不过不是添加的/usr/local/python3.10的目录，而是添加的虚拟环境的目录/home/ubuntu/anaconda3/envs/finetunning/include/python3.11，再次运行

第5个错误

ModuleNotFoundError: No module named 'fused_layer_norm_cuda' when instantiating LocalNorm

apex 安装的时候未添加cuda和c++ 扩展，安装方式更改为：

python setup.py install --cpp_ext --cuda_ext

第6个错误

WARNING:DotProductAttention:flash-attn may provide important feature support or performance improvement. Please install flash-attn >= 2.1.1, <= 2.8.3 by pip3 install flash-attn==<version>.
WARNING:megatron.core.utils:No dot product attention backend is available for the provided inputs. Please run with NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 to find out the reasons for disabling all backends.

[rank0]: ValueError: No dot product attention backend is available for the provided inputs. Please run with NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 to find out the reasons for disabling all backends.

安装flash-attn,pip install flash-attn --no-build-isolation,也可以通过安装预编译的轮子文件或源码安装，预编译的轮子文件要求环境一样，例如torch、cuda和python 的版本要一致才可以。

目前的环境，torch 为2.9.0,transformer_engine 为2.10.0,transformer_engine_torch 为2.10.0,目前的flash_attn2 的版本为2.8.1,flash_attn-2.8.1+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl，支持torch 2.9.0,但是需要python 为3.12，不过降版本也可以解决此问题，前提是cuda、cudnn\torch、flash_attn、transformer_engine 的版本要匹配，修改之后的版本如下

cuda 12.6

torch 2.7.1+cu126

flash_attn 2.8.1(flash_attn-2.8.1+cu12torch2.7cxx11abiTRUE-cp311-cp311-linux_x86_64.whl，whl 文件安装)

transformer_engine 2.7.0
transformer_engine_cu12 2.7.0
transformer_engine_torch 2.7.0
再次运行，这步已经可以正常运行

（3）其它说明

这个步骤是stage 1阶段的训练，这个阶段，按照训练的配置（trainable_modules=['adapter']），只训练adapter 的参数，其它参数是冻结的，所以单机单卡也可以运行

adapter 的组成如下

(adapter): Adapter(
(layernorm): FusedLayerNorm()
(linear_fc1): TELinear()
(linear_fc2): TELinear()
)

可以看出，adapter 的组成比较简单，由2个linear 层组成，总参数量为27271680 ，参数量很小，

参考资料：

1https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5

2 https://zhuanlan.zhihu.com/p/654726878

3 https://blog.csdn.net/weixin_40511249/article/details/109136597

4 https://blog.csdn.net/robator/article/details/156082990?spm=1011.2415.3001.5331

5 https://blog.csdn.net/li_jiaoyang/article/details/117431876

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

OpenClaw + 飞书 Agent 全面体检与调试指南：从故障排查到安全加固

2048 AI社区

收藏！小白程序员必看：6种AI Agent核心设计模式，轻松入门大模型开发

2048 AI社区

大模型入门指南：从“文字接龙”到“数字特工”，小白也能轻松掌握（收藏学习）

大模型是超级统计模型，通过海量数据学习语言规律，而非真正理解。Transformer架构和注意力机制是其核心，使其能处理长文本。大模型通用性强、迁移能力好，但存在幻觉、时效性滞后和复杂逻辑易出错等问题。未来将向多模态和智能体方向发展。使用时应发挥创意、保持警惕，适用于头脑风暴、文字润色等场景，但涉及金钱、医疗等领域需谨慎

2048 AI社区

所有评论(0)

查看更多评论

robator

@robator

已为社区贡献9条内容