llava-onevision-1.5 是meta ai 最新开源的多模态大模型训练框架(参考资料1),里面包含了数据集、模型以及训练方法,对训练多模态大模型很有帮助,因此将过程分享一下,供批评指正

环境配置如下:

ubuntu 22.04,conda 23.7.4,cuda 12.6,python 3.11版本,内核版本:Linux ubuntu-workstation 6.8.0-54-generic #56~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Sat Feb  8 11:41:24 UTC 2 x86_64 x86_64 x86_64 GNU/Linux,Intel(R) Core(TM) i9-14900K,单显卡 4090

虚拟环境finetunning配置如下

torch 2.9.1+cuda116,transformer_engine 为2.10

按照readme.md 文件,下面开始每个步骤的训练

readme.md 文件推荐使用docker 的方式来进行,本次没有按照docker的方式进行

1 下载已有模型或手工合并模型

You have two options to get started with LLaVA-OneVision-1.5-stage-0:

Option 1: Download pre-trained model from Hugging Face

Download our LLaVA-OneVision-1.5-4B-stage0 model directly from Hugging Face.

Option 2: Merge initial weights yourself

Alternatively, you can merge the initial weights from the original ViT and LLM:

python ds/merge_model.py \
--vit_path DeepGlint-AI/rice-vit-large-patch14-560 \
--llm_path Qwen/Qwen3-4B-Instruct-2507 \
--output LLaVA-OneVision-1.5-4B-stage0

Note: When merging weights, the adapter component will be initialized with default values.

这一步主要是初始化模型,可以从hugging face直接下载模型,也可以将2个模型合并起来,并初始化参数,并把模型导出到指定目录LLaVA-OneVision-1.5-4B-stage0,模型大致是这个样子

(Pdb) self.model
LLaVAOneVision1_5_Model(
  (visual): RiceTransformerPretrainedModel(
    (patch_embed): RicePatchEmbed(
      (proj): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
    )
    (rotary_pos_emb): RiceRotaryEmbedding()
    (pre_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (blocks): ModuleList(
      (0-23): 24 x RiceBlock(
        (norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): RiceSdpaAttention(
          (qkv): Linear(in_features=1024, out_features=3072, bias=True)
          (proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (mlp): RiceMlp(
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (act): GELUActivation()
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
    )
    (merger): RicePatchMerger(
      (ln_q): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (mlp): Sequential(
        (0): Linear(in_features=4096, out_features=4096, bias=True)
        (1): GELU(approximate='none')
        (2): Linear(in_features=4096, out_features=2560, bias=True)
      )
    )
  )
  (language_model): LLaVAOneVision1_5_TextModel(
    (embed_tokens): Embedding(151936, 2560)
    (layers): ModuleList(
      (0-35): 36 x LLaVAOneVision1_5_DecoderLayer(
        (self_attn): LLaVAOneVision1_5_SdpaAttention(
          (q_proj): Linear(in_features=2560, out_features=4096, bias=False)
          (k_proj): Linear(in_features=2560, out_features=1024, bias=False)
          (v_proj): Linear(in_features=2560, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=2560, bias=False)
          (q_norm): LLaVAOneVision1_5_RMSNorm((128,), eps=1e-06)
          (k_norm): LLaVAOneVision1_5_RMSNorm((128,), eps=1e-06)
        )
        (mlp): LLaVAOneVision1_5_MLP(
          (gate_proj): Linear(in_features=2560, out_features=9728, bias=False)
          (up_proj): Linear(in_features=2560, out_features=9728, bias=False)
          (down_proj): Linear(in_features=9728, out_features=2560, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LLaVAOneVision1_5_RMSNorm((2560,), eps=1e-06)
        (post_attention_layernorm): LLaVAOneVision1_5_RMSNorm((2560,), eps=1e-06)
      )
    )
    (norm): LLaVAOneVision1_5_RMSNorm((2560,), eps=1e-06)
    (rotary_emb): LLaVAOneVision1_5_RotaryEmbedding()
  )
)

主要由2个子模型:语言模型和视觉模型,视觉模型是rice-vit,语言模型是qwen,我试了下手工合并,视觉模型在其中一个block运算的时候,输出结果为nan,导致视觉模型校验失败,目前未找到原因,因此是下载的模型

2 开始图片、文本对的对齐训练

(1)数据集

数据集主要是文本对,一张图片对应一个描述,也是从其他数据集迁移过来的。先看一下数据集LLaVA-558K-Webdataset,为了训练方便,这个数据集是和megatron energon 的数据格式是保持一致的,里面有几个压缩包pretrain-000000.tar-pretrain-000004.tar,包含对应的idx 文件,解压pretrain-000000.tar,里面是分好类的图片和对应的json 文件

-r--r--r-- 1 ubuntu ubuntu  33378 Sep 18 17:31 ps_00022075.img0.jpg
-r--r--r-- 1 ubuntu ubuntu  31665 Sep 18 17:31 ps_00022075.img10.jpg
-r--r--r-- 1 ubuntu ubuntu  47355 Sep 18 17:31 ps_00022075.img11.jpg
-r--r--r-- 1 ubuntu ubuntu  36372 Sep 18 17:31 ps_00022075.img12.jpg
-r--r--r-- 1 ubuntu ubuntu  35609 Sep 18 17:31 ps_00022075.img13.jpg
-r--r--r-- 1 ubuntu ubuntu  37153 Sep 18 17:31 ps_00022075.img14.jpg
-r--r--r-- 1 ubuntu ubuntu  19173 Sep 18 17:31 ps_00022075.img15.jpg
-r--r--r-- 1 ubuntu ubuntu  42421 Sep 18 17:31 ps_00022075.img16.jpg
-r--r--r-- 1 ubuntu ubuntu  29117 Sep 18 17:31 ps_00022075.img17.jpg
-r--r--r-- 1 ubuntu ubuntu  13242 Sep 18 17:31 ps_00022075.img18.jpg
-r--r--r-- 1 ubuntu ubuntu  16562 Sep 18 17:31 ps_00022075.img1.jpg
-r--r--r-- 1 ubuntu ubuntu  22301 Sep 18 17:31 ps_00022075.img2.jpg
-r--r--r-- 1 ubuntu ubuntu  47226 Sep 18 17:31 ps_00022075.img3.jpg
-r--r--r-- 1 ubuntu ubuntu  27445 Sep 18 17:31 ps_00022075.img4.jpg
-r--r--r-- 1 ubuntu ubuntu  41999 Sep 18 17:31 ps_00022075.img5.jpg
-r--r--r-- 1 ubuntu ubuntu  16453 Sep 18 17:31 ps_00022075.img6.jpg
-r--r--r-- 1 ubuntu ubuntu  20248 Sep 18 17:31 ps_00022075.img7.jpg
-r--r--r-- 1 ubuntu ubuntu  35984 Sep 18 17:31 ps_00022075.img8.jpg
-r--r--r-- 1 ubuntu ubuntu  42419 Sep 18 17:31 ps_00022075.img9.jpg
-r--r--r-- 1 ubuntu ubuntu   3779 Sep 18 17:31 ps_00022075.json
可以看到一些原文件和一个对应的json 文件,json 文件 json 文件打开后有3个key值:prompts、captions、images,prompts 为需要描述的提示词,captions 为图片的描述答案,images 为路径,json 文件打开后的每个key 对应的value 是1个列表,从先向后对应图片和描述,选取一张图片,看一下效果

它 的prompt 为 Write a terse but informative summary of the picture.

它的captions 为the most cat litter and feeding basket gift basket

有点简洁。

开始训练前,先安装必要的包,pip install -r requirements.txt

开始执行AIAK_TRAINING_PATH=../LLaVA-OneVision-1.5-main DATA_PATH=LLaVA-558K-Webdataset TOKENIZER_PATH=LLaVA-OneVision-1.5-4B-stage0 CHECKPOINT_PATH=LLaVA-OneVision-1.5-4B-stage0_mcore_tp1_pp1 bash examples/llava_ov_1_5/quick_start/stage_1_alignment_llava_ov_4b.sh

(2)执行过程的错误

第1个错误:

no module name transformer_engine

关于transformer_engine的安装,请参考资料4

第2个错误:

AIAK_TRAINING_PATH=../LLaVA-OneVision-1.5-main DATA_PATH=LLaVA-558K-Webdataset TOKENIZER_PATH=LLaVA-OneVision-1.5-4B-stage0 CHECKPOINT_PATH=LLaVA-OneVision-1.5-4B-stage0_mcore_tp1_pp1 bash examples/llava_ov_1_5/quick_start/stage_1_alignment_llava_ov_4b.sh

NameError: name 'ApexFusedRMSNorm' is not defined,定位ApexFusedRMSNorm,发现是这么导入的

from apex.normalization.fused_layer_norm import FusedRMSNorm as ApexFusedRMSNorm

apex 包没有安装,安装apex ,pip install apex,再次运行发现依旧报错:

  File "/home/ubuntu/anaconda3/envs/finetunning/lib/python3.11/site-packages/apex/__init__.py", line 13, in <module>
    from pyramid.session import UnencryptedCookieSessionFactoryConfig
ImportError: cannot import name 'UnencryptedCookieSessionFactoryConfig' from 'pyramid.session' (unknown location)
 apex 安装错误,需要按照参考资料2的说明使用以下方法安装apex

git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install 

发现不再报错,继续开始stage_1 的训练,

第3个错误

  File "/home/ubuntu/anaconda3/envs/finetunning/lib/python3.11/site-packages/torch/cuda/__init__.py", line 567, in set_device
        torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)

torchtorch..AcceleratorErrorAcceleratorError: : CUDA error: invalid device ordinal
GPU device may be out of range, do you have enough GPUs?

这台机子只有1块4090显卡,说是序号错了,序号应为0才对,可能是训练分配的gpu 数据不对,查看stage_1_alignment_llava_ov_4b.sh,其中有1行

GPUS_PER_NODE=8,

每个节点分配的gpu 数目为8,远超了我的配置,改为1,再次运行

第4个错误

/usr/include/pybind11/detail/common.h:215:10: fatal error: Python.h: 没有那个文件或目录
  215 | #include <Python.h>, 按照参考资料3的解决办法,需要将Python.h 的目录添加至环境变量,不过不是添加的/usr/local/python3.10的目录,而是添加的虚拟环境的目录/home/ubuntu/anaconda3/envs/finetunning/include/python3.11,再次运行

第5个错误

ModuleNotFoundError: No module named 'fused_layer_norm_cuda' when instantiating LocalNorm

apex 安装的时候未添加cuda和c++ 扩展,安装方式更改为:

python setup.py install --cpp_ext --cuda_ext

第6个错误

WARNING:DotProductAttention:flash-attn may provide important feature support or performance improvement. Please install flash-attn >= 2.1.1, <= 2.8.3 by pip3 install flash-attn==<version>.
WARNING:megatron.core.utils:No dot product attention backend is available for the provided inputs. Please run with NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 to find out the reasons for disabling all backends.

[rank0]: ValueError: No dot product attention backend is available for the provided inputs. Please run with NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 to find out the reasons for disabling all backends.
 

安装flash-attn,pip install flash-attn --no-build-isolation,也可以通过安装预编译的轮子文件或源码安装,预编译的轮子文件要求环境一样,例如torch、cuda和python 的版本要一致才可以。

目前的环境,torch 为2.9.0,transformer_engine 为2.10.0,transformer_engine_torch 为2.10.0,目前的flash_attn2 的版本为2.8.1,flash_attn-2.8.1+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl,支持torch 2.9.0,但是需要python 为3.12,不过降版本也可以解决此问题,前提是cuda、cudnn\torch、flash_attn、transformer_engine 的版本要匹配,修改之后的版本如下

cuda             12.6

torch             2.7.1+cu126

flash_attn               2.8.1(flash_attn-2.8.1+cu12torch2.7cxx11abiTRUE-cp311-cp311-linux_x86_64.whl,whl 文件安装)

transformer_engine                   2.7.0
transformer_engine_cu12              2.7.0
transformer_engine_torch             2.7.0
再次运行,这步已经可以正常运行

[2025-12-23 19:35:20] iteration       41/    2500 | consumed samples:          328 | elapsed time per iteration (ms): 12330.9 | throughput (token/sec/GPU): 3925.3 | learning rate: 9.994915E-05 | global batch size:     8 | lm loss: 4.042104E+00 | loss scale: 1.0 | grad norm: 1.191 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-12-23 19:35:33] iteration       42/    2500 | consumed samples:          336 | elapsed time per iteration (ms): 12731.7 | throughput (token/sec/GPU): 3978.4 | learning rate: 9.994629E-05 | global batch size:     8 | lm loss: 4.256773E+00 | loss scale: 1.0 | grad norm: 0.940 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-12-23 19:35:46] iteration       43/    2500 | consumed samples:          344 | elapsed time per iteration (ms): 13025.3 | throughput (token/sec/GPU): 3899.3 | learning rate: 9.994335E-05 | global batch size:     8 | lm loss: 4.183313E+00 | loss scale: 1.0 | grad norm: 0.834 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-12-23 19:35:58] iteration       44/    2500 | consumed samples:          352 | elapsed time per iteration (ms): 12038.4 | throughput (token/sec/GPU): 3876.6 | learning rate: 9.994033E-05 | global batch size:     8 | lm loss: 4.031400E+00 | loss scale: 1.0 | grad norm: 0.714 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-12-23 19:36:11] iteration       45/    2500 | consumed samples:          360 | elapsed time per iteration (ms): 12927.5 | throughput (token/sec/GPU): 3798.3 | learning rate: 9.993723E-05 | global batch size:     8 | lm loss: 4.208323E+00 | loss scale: 1.0 | grad norm: 0.541 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-12-23 19:36:24] iteration       46/    2500 | consumed samples:          368 | elapsed time per iteration (ms): 12839.2 | throughput (token/sec/GPU): 3844.2 | learning rate: 9.993405E-05 | global batch size:     8 | lm loss: 4.360492E+00 | loss scale: 1.0 | grad norm: 2.003 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-12-23 19:36:37] iteration       47/    2500 | consumed samples:          376 | elapsed time per iteration (ms): 12916.7 | throughput (token/sec/GPU): 3762.7 | learning rate: 9.993080E-05 | global batch size:     8 | lm loss: 4.059428E+00 | loss scale: 1.0 | grad norm: 1.276 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-12-23 19:36:49] iteration       48/    2500 | consumed samples:          384 | elapsed time per iteration (ms): 12855.7 | throughput (token/sec/GPU): 3811.3 | learning rate: 9.992746E-05 | global batch size:     8 | lm loss: 3.937055E+00 | loss scale: 1.0 | grad norm: 0.701 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-12-23 19:37:02] iteration       49/    2500 | consumed samples:          392 | elapsed time per iteration (ms): 12480.5 | throughput (token/sec/GPU): 3861.7 | learning rate: 9.992405E-05 | global batch size:     8 | lm loss: 4.157461E+00 | loss scale: 1.0 | grad norm: 0.605 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |

(3)其它说明

这个步骤是stage 1阶段的训练,这个阶段,按照训练的配置(trainable_modules=['adapter']),只训练adapter 的参数,其它参数是冻结的,所以单机单卡也可以运行

adapter 的组成如下

      (adapter): Adapter(
        (layernorm): FusedLayerNorm()
        (linear_fc1): TELinear()
        (linear_fc2): TELinear()
      )

可以看出,adapter 的组成比较简单,由2个linear 层组成,总参数量为27271680 ,参数量很小,

参考资料:

1https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5

https://zhuanlan.zhihu.com/p/654726878

https://blog.csdn.net/weixin_40511249/article/details/109136597

https://blog.csdn.net/robator/article/details/156082990?spm=1011.2415.3001.5331

5 https://blog.csdn.net/li_jiaoyang/article/details/117431876

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐