微调步骤很简单:

1.下载两个资源

SenseVoice

- FunASR

阅读SenseVoice的README_zh,可知想要微调SenseVoice,运行finetune.sh即可(都在SenseVoice文件夹内),README_zh还提到“注意修改 `finetune.sh` 中 `train_tool` 为你前面安装 FunASR 路径中 `funasr/bin/train_ds.py` 绝对路径”,这也是为什么需要下载FunASR。

2.将FunASR文件夹内的funasr部署为本地python包,部署成功之后,pip list会显示:

funasr             1.2.7         /home/****/*****/*****/FunASR

这里大家应该可以注意到了,上面readme里面提及的funasr实际上是FunASR文件夹下面的子文件夹。FunASR的其余数据我们并不用到。

3.数据准备,这个很关键,我最开始忘记在哪查看了半截数据输入格式,脑袋一拍俺寻思就自己用代码搓了一个jsonl

# 标准的格式
{"key": "000559", "source": "/to_your_path/a.mp3", "source_len": 74, "target": "你的音频内容", "target_len": 2, "with_or_wo_itn": "<|woitn|>", "text_language": "<|zh|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>"}

# 手搓的残缺版
{"key": "000559", "source": "/to_your_path/a.mp3","target": "你的音频内容"}

这个数据格式同样在README_zh明确记录了,复用就好,当然,读者可能注意自己手上的数据并没有标准集上面那么多的标签,这个也不要紧,阅读readme的数据准备部分,在手上只有source和target时,按标准准备好txt和scp文件,然后执行下面的命令:

sensevoice2jsonl \
++scp_file_list='["../../../data/list/train_wav.scp", "../../../data/list/train_text.txt"]' \
++data_type_list='["source", "target"]' \
++jsonl_file_out="../../../data/list/train.jsonl" \
++model_dir='iic/SenseVoiceSmall'

就可以生成finetune需要的jsonl文件了,如果读者还有其他的标签,可以具体阅读readme,使用其他的指令。

这个数据格式很重要,上面我手搓的数据集也能跑,但结果是跑了好几次,精度都没有任何变换,另外记得路径最好用绝对路径,忘记是啥坑了,只记得得出来的结论是绝对路径更保险

5.微调

bash finetune.sh

在终端内定位到具体目录,然后输入上述命令即可。

先看看运行部分的参数

torchrun $DISTRIBUTED_ARGS \
${train_tool} \
++model="${model_name_or_model_dir}" \
++trust_remote_code=true \
++train_data_set_list="${train_data}" \
++valid_data_set_list="${val_data}" \
++dataset_conf.data_split_num=1 \
++dataset_conf.batch_sampler="BatchSampler" \
++dataset_conf.batch_size=600  \
++dataset_conf.sort_size=1024 \
++dataset_conf.batch_type="token" \
++dataset_conf.num_workers=4 \
++train_conf.max_epoch=200 \
++train_conf.log_interval=10 \
++train_conf.resume=true \
++train_conf.validate_interval=2000 \
++train_conf.save_checkpoint_interval=2000 \
++train_conf.keep_nbest_models=20 \
++train_conf.avg_nbest_model=10 \
++train_conf.use_deepspeed=false \
++train_conf.deepspeed_config=${deepspeed_config} \
++optim_conf.lr=0.0002 \
++output_dir="${output_dir}" &> ${log_file}

这个在sensevoice的readme里面没有解释,但在FunASR的doc下面的readme有大部分的解释,可以对着修改成你需要的,或则一键ai解释

真正需要修改的部分主要有两点:

1.数据集路径

2.train_ds.py的路径(上面的readme有提到过,调用的阿里funasr的)

这里我直接给出finetune.sh的代码部分,稍作修改就能用,pwd指的是SenseVoice文件夹的目录,所以将数据集丢SenseVoice的data文件夹下

# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
#  MIT License  (https://opensource.org/licenses/MIT)

workspace=`pwd`

# which gpu to train or finetune
export CUDA_VISIBLE_DEVICES="0,1,2,3"
gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')

# model_name from model_hub, or model_dir in local path

## option 1, download model automatically
model_name_or_model_dir="iic/SenseVoiceSmall"

## option 2, download model by git
#local_path_root=${workspace}/modelscope_models
#mkdir -p ${local_path_root}/${model_name_or_model_dir}
#git clone https://www.modelscope.cn/${model_name_or_model_dir}.git ${local_path_root}/${model_name_or_model_dir}
#model_name_or_model_dir=${local_path_root}/${model_name_or_model_dir}


# data dir, which contains: train.json, val.json
train_data=${workspace}/data/train.jsonl
val_data=${workspace}/data/val.jsonl

# exp output dir
output_dir="./outputs"
log_file="${output_dir}/log.txt"

deepspeed_config=${workspace}/deepspeed_conf/ds_stage1.json

mkdir -p ${output_dir}
echo "log_file: ${log_file}"

DISTRIBUTED_ARGS="
    --nnodes ${WORLD_SIZE:-1} \
    --nproc_per_node $gpu_num \
    --node_rank ${RANK:-0} \
    --master_addr ${MASTER_ADDR:-127.0.0.1} \
    --master_port ${MASTER_PORT:-26669}
"

echo $DISTRIBUTED_ARGS

# funasr trainer path

# if [ -f `dirname $(which funasr)`/train_ds.py ]; then
#     train_tool=`dirname $(which funasr)`/train_ds.py
# elif [ -f `dirname $(which funasr)`/../lib/python*/site-packages/funasr/bin/train_ds.py ]; then
#     train_tool=`dirname $(which funasr)`/../lib/python*/site-packages/funasr/bin/train_ds.py
# else
#     echo "Error: train_ds.py not found in funasr bin directory."
#     exit 1
# fi
# ABSOLUTE_PATH=$(cd $(dirname $train_tool); pwd)

train_tool=/to/your/path/FunASR/funasr/bin/train_ds.py

# 检查文件是否存在
if [ ! -f "$train_tool" ]; then
    echo "Error: train_ds.py not found at $train_tool"
    exit 1
fi
# ABSOLUTE_PATH=/to/your/path/FunASR-main/funasr/bin
# train_tool=${ABSOLUTE_PATH}/train_ds.py
# echo "Using funasr trainer: ${train_tool}" 

# ++dataset_conf.batch_sampler="example" \
torchrun $DISTRIBUTED_ARGS \
${train_tool} \
++model="${model_name_or_model_dir}" \
++trust_remote_code=true \
++train_data_set_list="${train_data}" \
++valid_data_set_list="${val_data}" \
++dataset_conf.data_split_num=1 \
++dataset_conf.batch_sampler="BatchSampler" \
++dataset_conf.batch_size=600  \
++dataset_conf.sort_size=1024 \
++dataset_conf.batch_type="token" \
++dataset_conf.num_workers=4 \
++train_conf.max_epoch=200 \
++train_conf.log_interval=10 \
++train_conf.resume=true \
++train_conf.validate_interval=2000 \
++train_conf.save_checkpoint_interval=2000 \
++train_conf.keep_nbest_models=20 \
++train_conf.avg_nbest_model=10 \
++train_conf.use_deepspeed=false \
++train_conf.deepspeed_config=${deepspeed_config} \
++optim_conf.lr=0.0002 \
++output_dir="${output_dir}" &> ${log_file}

6.训练完后,SenseVoice会多一个outputs文件夹,下面的pt模型就是可以使用的,这里有一个简单的测试方法,找到webui.py

model = "iic/SenseVoiceSmall"

# 这个是调用原模型
# model = AutoModel(model=model,
# 				  vad_model="iic/speech_fsmn_vad_zh-cn-16k-common-pytorch",
# 				  vad_kwargs={"max_single_segment_time": 30000},
# 				  trust_remote_code=True,
# 				  )


# 下面两种是调用微调后的模型
model = AutoModel(
				  model=model,
                  init_param="/to/your/path/SenseVoice-main/outputs/model.pt",
				  vad_model="iic/speech_fsmn_vad_zh-cn-16k-common-pytorch",
				  vad_kwargs={"max_single_segment_time": 30000},
				  trust_remote_code=True,
				  disable_update=True
				  )

# model = AutoModel(
# 				  model="/to/your/path/SenseVoice-main/outputs"
# 				  )

7.tensorboard的查看指令这里也给一下

tensorboard --port 6007 --logdir outputs/tensorboard

头秃啊,第一遍看readme少看了,数据结构错了之后,回回能训练,但是测试没效果,tensorboard打开还是空,一遍一遍看log,没有找到异常,重新看readme,发现可能是数据集问题

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐