1.transformers介绍

这里的Transformers指的是huggingface开发的大模型库,为huggingface上数以万计的预训练大模型提供预测、训练等服务。

  • Transformers 提供了数以千计的预训练模型,支持 100 多种语言的文本分类、信息抽取、问答、摘要、翻译、文本生成。它的宗旨是让最先进的 NLP 技术人人易用。
  • Transformers 提供了便于快速下载和使用的API,让你可以把预训练模型用在给定文本、在你的数据集上微调然后通过 model hub 与社区共享。同时,每个定义的 Python 模块均完全独立,方便修改和快速研究实验。
  • Transformers 支持三个最热门的深度学习库: Jax, PyTorch 以及 TensorFlow — 并与之无缝整合。你可以直接使用一个框架训练你的模型然后用另一个加载和推理。

2.运行qwen3模型用例

from transformers import AutoModelForCausalLM, AutoTokenizer
import time

# 指定模型文件的本地目录路径
model_path = "/home/claudius/qwen/"

# 加载分词器
tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    trust_remote_code=True  # Qwen模型通常需要此参数:cite[6]:cite[9]
)

# 加载模型
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",  # 指定模型计算的数据类型,通常使用bfloat16或float16以节省显存:cite[10]
    device_map="auto",           # 自动将模型层分配到可用的GPU设备上
    trust_remote_code=True      # Qwen模型通常需要此参数:cite[6]:cite[9]
)

print("模型与分词器加载成功!")

# 准备对话内容
messages = [
    {"role": "user", "content": "介绍一下claudius这个人物"}
]

# 应用聊天模板,将对话格式化为模型接受的文本
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,  # 这里我们不进行分词,只获取格式化后的文本字符串
    add_generation_prompt=True  # 添加提示,告诉模型开始生成回复
)

# 将文本转换为模型可接受的输入格式
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

start_time = time.time()
# 模型生成回复
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512  # 控制生成内容的最大长度
)

# 解码生成的token,得到可读文本,并跳过特殊token
response = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
end_time = time.time()
execution_time = end_time - start_time
print(f"代码执行时间: {execution_time:.6f} 秒")
print("模型回复:", response)

下面是qwen模型的相关文件:

LAPTOP-QO2TSM24:~/qwen$ ls -l
-rwxrwxrwx 1         726 Nov  1 17:36 config.json
-rwxrwxrwx 1         239 Nov  1 17:36 generation_config.json
-rwxrwxrwx 1     1671853 Nov  1 17:36 merges.txt
-rwxrwxrwx 1  1503300328 Nov  1 17:25 model.safetensors
-rwxrwxrwx 1    11422654 Nov  1 17:36 tokenizer.json
-rwxrwxrwx 1        9732 Nov  1 17:36 tokenizer_config.json
-rwxrwxrwx 1     2776833 Nov  1 17:36 vocab.json

3. AutoTokenizer

 AutoTokenizer 是 Hugging Face transformers 库中的一个 自动分词器(tokenizer)加载器,用于根据 预训练模型的名称 自动选择合适的分词器(Tokenizer)。它的主要作用是让用户无需手动指定模型对应的分词方式,而是通过模型名称自动加载相匹配的分词器。
在 transformers 库中,每种模型都有自己对应的分词器。有一个TOKENIZER_MAPPING_NAMES字典记录了模型到分词器的映射关系。

比如qwen3模型对应Qwen2Tokenizer和Qwen2TokenizerFast 分词器

通过查看 config.json 文件,可以看到:"model_type": "qwen3"

通过查看 tokenizer_config.json 文件,可以看到:"tokenizer_class": "Qwen2Tokenizer",

在下面的映射表中,也可以看到qwen3对应的分词器。

# Explicit rather than inferred generics to significantly improves completion suggestion performance for language servers.
TOKENIZER_MAPPING_NAMES = OrderedDict[str, tuple[Optional[str], Optional[str]]](
    [
        (
            "aimv2",
            (
                "CLIPTokenizer",
                "CLIPTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "albert",
            (
                "AlbertTokenizer" if is_sentencepiece_available() else None,
                "AlbertTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("align", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
        ("arcee", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
        ("aria", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
        ("aya_vision", (None, "CohereTokenizerFast" if is_tokenizers_available() else None)),
        ("bark", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
        ("bart", ("BartTokenizer", "BartTokenizerFast")),
        (
            "barthez",
            (
                "BarthezTokenizer" if is_sentencepiece_available() else None,
                "BarthezTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("bartpho", ("BartphoTokenizer", None)),
        ("bert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
        ("bert-generation", ("BertGenerationTokenizer" if is_sentencepiece_available() else None, None)),
        ("bert-japanese", ("BertJapaneseTokenizer", None)),
        ("bertweet", ("BertweetTokenizer", None)),
        (
            "big_bird",
            (
                "BigBirdTokenizer" if is_sentencepiece_available() else None,
                "BigBirdTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("bigbird_pegasus", ("PegasusTokenizer", "PegasusTokenizerFast" if is_tokenizers_available() else None)),
        ("biogpt", ("BioGptTokenizer", None)),
        ("bitnet", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
        ("blenderbot", ("BlenderbotTokenizer", "BlenderbotTokenizerFast")),
        ("blenderbot-small", ("BlenderbotSmallTokenizer", None)),
        ("blip", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
        ("blip-2", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
        ("bloom", (None, "BloomTokenizerFast" if is_tokenizers_available() else None)),
        ("blt", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
        ("bridgetower", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
        ("bros", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
        ("byt5", ("ByT5Tokenizer", None)),
        (
            "camembert",
            (
                "CamembertTokenizer" if is_sentencepiece_available() else None,
                "CamembertTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("canine", ("CanineTokenizer", None)),
        (
            "chameleon",
            (
                "LlamaTokenizer" if is_sentencepiece_available() else None,
                "LlamaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("chinese_clip", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
        (
            "clap",
            (
                "RobertaTokenizer",
                "RobertaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "clip",
            (
                "CLIPTokenizer",
                "CLIPTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "clipseg",
            (
                "CLIPTokenizer",
                "CLIPTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("clvp", ("ClvpTokenizer", None)),
        (
            "code_llama",
            (
                "CodeLlamaTokenizer" if is_sentencepiece_available() else None,
                "CodeLlamaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("codegen", ("CodeGenTokenizer", "CodeGenTokenizerFast" if is_tokenizers_available() else None)),
        ("cohere", (None, "CohereTokenizerFast" if is_tokenizers_available() else None)),
        ("cohere2", (None, "CohereTokenizerFast" if is_tokenizers_available() else None)),
        ("colpali", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
        ("colqwen2", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
        ("convbert", ("ConvBertTokenizer", "ConvBertTokenizerFast" if is_tokenizers_available() else None)),
        (
            "cpm",
            (
                "CpmTokenizer" if is_sentencepiece_available() else None,
                "CpmTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("cpmant", ("CpmAntTokenizer", None)),
        ("csm", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
        ("ctrl", ("CTRLTokenizer", None)),
        ("data2vec-audio", ("Wav2Vec2CTCTokenizer", None)),
        ("data2vec-text", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
        ("dbrx", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
        ("deberta", ("DebertaTokenizer", "DebertaTokenizerFast" if is_tokenizers_available() else None)),
        (
            "deberta-v2",
            (
                "DebertaV2Tokenizer" if is_sentencepiece_available() else None,
                "DebertaV2TokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "deepseek_v2",
            (
                "LlamaTokenizer" if is_sentencepiece_available() else None,
                "LlamaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "deepseek_v3",
            (
                "LlamaTokenizer" if is_sentencepiece_available() else None,
                "LlamaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "deepseek_vl",
            (
                "LlamaTokenizer" if is_sentencepiece_available() else None,
                "LlamaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "deepseek_vl_hybrid",
            (
                "LlamaTokenizer" if is_sentencepiece_available() else None,
                "LlamaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("dia", ("DiaTokenizer", None)),
        (
            "diffllama",
            (
                "LlamaTokenizer" if is_sentencepiece_available() else None,
                "LlamaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("distilbert", ("DistilBertTokenizer", "DistilBertTokenizerFast" if is_tokenizers_available() else None)),
        (
            "dpr",
            (
                "DPRQuestionEncoderTokenizer",
                "DPRQuestionEncoderTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("electra", ("ElectraTokenizer", "ElectraTokenizerFast" if is_tokenizers_available() else None)),
        ("emu3", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
        ("ernie", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
        ("ernie4_5", (None, "LlamaTokenizerFast" if is_tokenizers_available() else None)),
        ("ernie4_5_moe", (None, "LlamaTokenizerFast" if is_tokenizers_available() else None)),
        ("ernie_m", ("ErnieMTokenizer" if is_sentencepiece_available() else None, None)),
        ("esm", ("EsmTokenizer", None)),
        (
            "exaone4",
            (
                "GPT2Tokenizer" if is_tokenizers_available() else None,
                "GPT2TokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("falcon", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
        ("falcon_mamba", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
        (
            "fastspeech2_conformer",
            ("FastSpeech2ConformerTokenizer" if is_g2p_en_available() else None, None),
        ),
        ("flaubert", ("FlaubertTokenizer", None)),
        ("flex_olmo", (None, "GPT2TokenizerFast" if is_tokenizers_available() else None)),
        ("fnet", ("FNetTokenizer", "FNetTokenizerFast" if is_tokenizers_available() else None)),
        ("fsmt", ("FSMTTokenizer", None)),
        ("funnel", ("FunnelTokenizer", "FunnelTokenizerFast" if is_tokenizers_available() else None)),
        (
            "gemma",
            (
                "GemmaTokenizer" if is_sentencepiece_available() else None,
                "GemmaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "gemma2",
            (
                "GemmaTokenizer" if is_sentencepiece_available() else None,
                "GemmaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "gemma3",
            (
                "GemmaTokenizer" if is_sentencepiece_available() else None,
                "GemmaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "gemma3_text",
            (
                "GemmaTokenizer" if is_sentencepiece_available() else None,
                "GemmaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "gemma3n",
            (
                "GemmaTokenizer" if is_sentencepiece_available() else None,
                "GemmaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "gemma3n_text",
            (
                "GemmaTokenizer" if is_sentencepiece_available() else None,
                "GemmaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("git", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
        ("glm", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
        ("glm4", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
        ("glm4_moe", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
        ("glm4v", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
        ("glm4v_moe", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
        ("gpt-sw3", ("GPTSw3Tokenizer" if is_sentencepiece_available() else None, None)),
        ("gpt2", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
        ("gpt_bigcode", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
        ("gpt_neo", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
        ("gpt_neox", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
        ("gpt_neox_japanese", ("GPTNeoXJapaneseTokenizer", None)),
        ("gpt_oss", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
        ("gptj", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
        ("gptsan-japanese", ("GPTSanJapaneseTokenizer", None)),
        ("granite", ("GPT2Tokenizer", None)),
        ("granitemoe", ("GPT2Tokenizer", None)),
        ("granitemoehybrid", ("GPT2Tokenizer", None)),
        ("granitemoeshared", ("GPT2Tokenizer", None)),
        ("grounding-dino", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
        ("groupvit", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)),
        ("helium", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
        ("herbert", ("HerbertTokenizer", "HerbertTokenizerFast" if is_tokenizers_available() else None)),
        ("hubert", ("Wav2Vec2CTCTokenizer", None)),
        ("ibert", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
        ("idefics", (None, "LlamaTokenizerFast" if is_tokenizers_available() else None)),
        ("idefics2", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
        ("idefics3", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
        ("instructblip", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
        ("instructblipvideo", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
        ("internvl", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
        (
            "jamba",
            (
                "LlamaTokenizer" if is_sentencepiece_available() else None,
                "LlamaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("janus", (None, "LlamaTokenizerFast" if is_tokenizers_available() else None)),
        (
            "jetmoe",
            (
                "LlamaTokenizer" if is_sentencepiece_available() else None,
                "LlamaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("jukebox", ("JukeboxTokenizer", None)),
        (
            "kosmos-2",
            (
                "XLMRobertaTokenizer" if is_sentencepiece_available() else None,
                "XLMRobertaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("kosmos-2.5", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
        ("layoutlm", ("LayoutLMTokenizer", "LayoutLMTokenizerFast" if is_tokenizers_available() else None)),
        ("layoutlmv2", ("LayoutLMv2Tokenizer", "LayoutLMv2TokenizerFast" if is_tokenizers_available() else None)),
        ("layoutlmv3", ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast" if is_tokenizers_available() else None)),
        ("layoutxlm", ("LayoutXLMTokenizer", "LayoutXLMTokenizerFast" if is_tokenizers_available() else None)),
        ("led", ("LEDTokenizer", "LEDTokenizerFast" if is_tokenizers_available() else None)),
        ("lilt", ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast" if is_tokenizers_available() else None)),
        (
            "llama",
            (
                "LlamaTokenizer" if is_sentencepiece_available() else None,
                "LlamaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "llama4",
            (
                "LlamaTokenizer" if is_sentencepiece_available() else None,
                "LlamaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "llama4_text",
            (
                "LlamaTokenizer" if is_sentencepiece_available() else None,
                "LlamaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("llava", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
        ("llava_next", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
        ("llava_next_video", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
        ("llava_onevision", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
        ("longformer", ("LongformerTokenizer", "LongformerTokenizerFast" if is_tokenizers_available() else None)),
        (
            "longt5",
            (
                "T5Tokenizer" if is_sentencepiece_available() else None,
                "T5TokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("luke", ("LukeTokenizer", None)),
        ("lxmert", ("LxmertTokenizer", "LxmertTokenizerFast" if is_tokenizers_available() else None)),
        ("m2m_100", ("M2M100Tokenizer" if is_sentencepiece_available() else None, None)),
        ("mamba", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
        ("mamba2", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
        ("marian", ("MarianTokenizer" if is_sentencepiece_available() else None, None)),
        (
            "mbart",
            (
                "MBartTokenizer" if is_sentencepiece_available() else None,
                "MBartTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "mbart50",
            (
                "MBart50Tokenizer" if is_sentencepiece_available() else None,
                "MBart50TokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("mega", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
        ("megatron-bert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
        (
            "metaclip_2",
            (
                "XLMRobertaTokenizer",
                "XLMRobertaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("mgp-str", ("MgpstrTokenizer", None)),
        (
            "minimax",
            (
                "GPT2Tokenizer" if is_sentencepiece_available() else None,
                "GPT2TokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "mistral",
            (
                "MistralCommonTokenizer"
                if is_mistral_common_available()
                else ("LlamaTokenizer" if is_sentencepiece_available() else None),
                "LlamaTokenizerFast" if is_tokenizers_available() and not is_mistral_common_available() else None,
            ),
        ),
        (
            "mixtral",
            (
                "MistralCommonTokenizer"
                if is_mistral_common_available()
                else ("LlamaTokenizer" if is_sentencepiece_available() else None),
                "LlamaTokenizerFast" if is_tokenizers_available() and not is_mistral_common_available() else None,
            ),
        ),
        ("mllama", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
        ("mluke", ("MLukeTokenizer" if is_sentencepiece_available() else None, None)),
        ("mm-grounding-dino", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
        ("mobilebert", ("MobileBertTokenizer", "MobileBertTokenizerFast" if is_tokenizers_available() else None)),
        ("modernbert", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
        ("moonshine", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
        ("moshi", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
        ("mpnet", ("MPNetTokenizer", "MPNetTokenizerFast" if is_tokenizers_available() else None)),
        ("mpt", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
        ("mra", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
        (
            "mt5",
            (
                "MT5Tokenizer" if is_sentencepiece_available() else None,
                "MT5TokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("musicgen", ("T5Tokenizer", "T5TokenizerFast" if is_tokenizers_available() else None)),
        ("musicgen_melody", ("T5Tokenizer", "T5TokenizerFast" if is_tokenizers_available() else None)),
        ("mvp", ("MvpTokenizer", "MvpTokenizerFast" if is_tokenizers_available() else None)),
        ("myt5", ("MyT5Tokenizer", None)),
        ("nemotron", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
        ("nezha", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
        (
            "nllb",
            (
                "NllbTokenizer" if is_sentencepiece_available() else None,
                "NllbTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "nllb-moe",
            (
                "NllbTokenizer" if is_sentencepiece_available() else None,
                "NllbTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "nystromformer",
            (
                "AlbertTokenizer" if is_sentencepiece_available() else None,
                "AlbertTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("olmo", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
        ("olmo2", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
        ("olmo3", (None, "GPT2TokenizerFast" if is_tokenizers_available() else None)),
        ("olmoe", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
        (
            "omdet-turbo",
            ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None),
        ),
        ("oneformer", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)),
        (
            "openai-gpt",
            ("OpenAIGPTTokenizer", "OpenAIGPTTokenizerFast" if is_tokenizers_available() else None),
        ),
        ("opt", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
        ("owlv2", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)),
        ("owlvit", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)),
        ("paligemma", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
        ("parakeet", ("ParakeetCTCTokenizer", None)),
        (
            "pegasus",
            (
                "PegasusTokenizer" if is_sentencepiece_available() else None,
                "PegasusTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "pegasus_x",
            (
                "PegasusTokenizer" if is_sentencepiece_available() else None,
                "PegasusTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "perceiver",
            (
                "PerceiverTokenizer",
                None,
            ),
        ),
        (
            "persimmon",
            (
                "LlamaTokenizer" if is_sentencepiece_available() else None,
                "LlamaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("phi", ("CodeGenTokenizer", "CodeGenTokenizerFast" if is_tokenizers_available() else None)),
        ("phi3", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
        ("phimoe", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
        ("phobert", ("PhobertTokenizer", None)),
        ("pix2struct", ("T5Tokenizer", "T5TokenizerFast" if is_tokenizers_available() else None)),
        (
            "pixtral",
            (
                None,
                "MistralCommonTokenizer"
                if is_mistral_common_available()
                else ("PreTrainedTokenizerFast" if is_tokenizers_available() else None),
            ),
        ),
        ("plbart", ("PLBartTokenizer" if is_sentencepiece_available() else None, None)),
        ("prophetnet", ("ProphetNetTokenizer", None)),
        ("qdqbert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
        (
            "qwen2",
            (
                "Qwen2Tokenizer",
                "Qwen2TokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("qwen2_5_omni", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
        ("qwen2_5_vl", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
        ("qwen2_audio", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
        (
            "qwen2_moe",
            (
                "Qwen2Tokenizer",
                "Qwen2TokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("qwen2_vl", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
        (
            "qwen3",
            (
                "Qwen2Tokenizer",
                "Qwen2TokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "qwen3_moe",
            (
                "Qwen2Tokenizer",
                "Qwen2TokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "qwen3_next",
            (
                "Qwen2Tokenizer",
                "Qwen2TokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("qwen3_omni_moe", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
        ("qwen3_vl", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
        ("qwen3_vl_moe", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
        ("rag", ("RagTokenizer", None)),
        ("realm", ("RealmTokenizer", "RealmTokenizerFast" if is_tokenizers_available() else None)),
        (
            "recurrent_gemma",
            (
                "GemmaTokenizer" if is_sentencepiece_available() else None,
                "GemmaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "reformer",
            (
                "ReformerTokenizer" if is_sentencepiece_available() else None,
                "ReformerTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "rembert",
            (
                "RemBertTokenizer" if is_sentencepiece_available() else None,
                "RemBertTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("retribert", ("RetriBertTokenizer", "RetriBertTokenizerFast" if is_tokenizers_available() else None)),
        ("roberta", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
        (
            "roberta-prelayernorm",
            ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None),
        ),
        ("roc_bert", ("RoCBertTokenizer", None)),
        ("roformer", ("RoFormerTokenizer", "RoFormerTokenizerFast" if is_tokenizers_available() else None)),
        ("rwkv", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
        (
            "seamless_m4t",
            (
                "SeamlessM4TTokenizer" if is_sentencepiece_available() else None,
                "SeamlessM4TTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "seamless_m4t_v2",
            (
                "SeamlessM4TTokenizer" if is_sentencepiece_available() else None,
                "SeamlessM4TTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "shieldgemma2",
            (
                "GemmaTokenizer" if is_sentencepiece_available() else None,
                "GemmaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("siglip", ("SiglipTokenizer" if is_sentencepiece_available() else None, None)),
        (
            "siglip2",
            (
                "GemmaTokenizer" if is_sentencepiece_available() else None,
                "GemmaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("smollm3", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
        ("speech_to_text", ("Speech2TextTokenizer" if is_sentencepiece_available() else None, None)),
        ("speech_to_text_2", ("Speech2Text2Tokenizer", None)),
        ("speecht5", ("SpeechT5Tokenizer" if is_sentencepiece_available() else None, None)),
        ("splinter", ("SplinterTokenizer", "SplinterTokenizerFast")),
        (
            "squeezebert",
            ("SqueezeBertTokenizer", "SqueezeBertTokenizerFast" if is_tokenizers_available() else None),
        ),
        ("stablelm", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
        ("starcoder2", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
        (
            "switch_transformers",
            (
                "T5Tokenizer" if is_sentencepiece_available() else None,
                "T5TokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "t5",
            (
                "T5Tokenizer" if is_sentencepiece_available() else None,
                "T5TokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "t5gemma",
            (
                "GemmaTokenizer" if is_sentencepiece_available() else None,
                "GemmaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("tapas", ("TapasTokenizer", None)),
        ("tapex", ("TapexTokenizer", None)),
        ("transfo-xl", ("TransfoXLTokenizer", None)),
        ("tvp", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
        (
            "udop",
            (
                "UdopTokenizer" if is_sentencepiece_available() else None,
                "UdopTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "umt5",
            (
                "T5Tokenizer" if is_sentencepiece_available() else None,
                "T5TokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("video_llava", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
        ("vilt", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
        ("vipllava", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
        ("visual_bert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
        ("vits", ("VitsTokenizer", None)),
        (
            "voxtral",
            (
                "MistralCommonTokenizer"
                if is_mistral_common_available()
                else ("LlamaTokenizer" if is_sentencepiece_available() else None),
                "LlamaTokenizerFast" if is_tokenizers_available() and not is_mistral_common_available() else None,
            ),
        ),
        ("wav2vec2", ("Wav2Vec2CTCTokenizer", None)),
        ("wav2vec2-bert", ("Wav2Vec2CTCTokenizer", None)),
        ("wav2vec2-conformer", ("Wav2Vec2CTCTokenizer", None)),
        ("wav2vec2_phoneme", ("Wav2Vec2PhonemeCTCTokenizer", None)),
        ("whisper", ("WhisperTokenizer", "WhisperTokenizerFast" if is_tokenizers_available() else None)),
        ("xclip", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)),
        (
            "xglm",
            (
                "XGLMTokenizer" if is_sentencepiece_available() else None,
                "XGLMTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("xlm", ("XLMTokenizer", None)),
        ("xlm-prophetnet", ("XLMProphetNetTokenizer" if is_sentencepiece_available() else None, None)),
        (
            "xlm-roberta",
            (
                "XLMRobertaTokenizer" if is_sentencepiece_available() else None,
                "XLMRobertaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "xlm-roberta-xl",
            (
                "XLMRobertaTokenizer" if is_sentencepiece_available() else None,
                "XLMRobertaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "xlnet",
            (
                "XLNetTokenizer" if is_sentencepiece_available() else None,
                "XLNetTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        ("xlstm", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
        (
            "xmod",
            (
                "XLMRobertaTokenizer" if is_sentencepiece_available() else None,
                "XLMRobertaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "yoso",
            (
                "AlbertTokenizer" if is_sentencepiece_available() else None,
                "AlbertTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "zamba",
            (
                "LlamaTokenizer" if is_sentencepiece_available() else None,
                "LlamaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
        (
            "zamba2",
            (
                "LlamaTokenizer" if is_sentencepiece_available() else None,
                "LlamaTokenizerFast" if is_tokenizers_available() else None,
            ),
        ),
    ]
)

3.1 AutoTokenizer.from_pretrained

通过调用AutoTokenizer.from_pretrained 类函数,该函数内部会自动根据模型配置文件,最终返回模型所对应的分词器,在这里具体是Qwen2TokenizerFast tokenizer 分词器

# 加载分词器
tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    trust_remote_code=True  # Qwen模型通常需要此参数:cite[6]:cite[9]
)

该函数的作用是从预训练模型的词汇表中实例化一个分词器类, 根据qwen模型config.json文件中的`model_type`属性选择要实例化的分词器类。

接下来看里面的实现细节:

3.1.1 读取tokenizer_config 文件

读取tokenizer_config.json文件,返回对应的python 字典 tokenizer_config

tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)

3.1.2 获取tokenizer_class字段

从tokenizer_config字典中获取"tokenizer_class"字段,config_tokenizer_class ="Qwen2Tokenizer"

config_tokenizer_class = tokenizer_config.get("tokenizer_class")

3.1.3 根据tokenizer_class名称获取分词器类

if use_fast and not config_tokenizer_class.endswith("Fast"):
    tokenizer_class_candidate = f"{config_tokenizer_class}Fast"
    tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)

在这里: use_fast 默认为True,因此会优先获取Fast版本,即tokenizer_class_candidate=“Qwen2TokenizerFast”

然后根据该名称获取对应的分词器类即:tokenizer_class为

<class 'transformers.models.qwen2.tokenization_qwen2_fast.Qwen2TokenizerFast'>

3.1.4 调用对应的分词器类的from_pretrained函数

调用该模型对应的分词器类Qwen2TokenizerFast的from_pretrained函数。

return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)

其实是基类PreTrainedTokenizerBase的from_pretrained函数。下面是类的继承关系:

class Qwen2TokenizerFast(PreTrainedTokenizerFast):
class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):

3.1.5 PreTrainedTokenizerBase::from_pretrained

保存相关文件路径到resolved_vocab_files 字典中,打印如下:

'vocab_file': '/home/boy/qwen/vocab.json', 
'merges_file': '/home/boy/qwen/merges.txt',
'tokenizer_file': '/home/boy/qwen/tokenizer.json',
'added_tokens_file': None,
'special_tokens_map_file': None, 
'tokenizer_config_file': '/home/boy/qwen/tokenizer_config.json',
'chat_template_file': None

然后再调用类方法:PreTrainedTokenizerBase::_from_pretrained。这个类函数内部会实例化Qwen2TokenizerFast对象并将对象返回出来。

return cls._from_pretrained(resolved_vocab_files,pretrained_model_name_or_path,

3.1.6 PreTrainedTokenizerBase::_from_pretrained

1. 读取tokenizer_config 文件内容到init_kwargs字典中

with open(tokenizer_config_file, encoding="utf-8") as tokenizer_config_handle:
    init_kwargs = json.load(tokenizer_config_handle)

2. 将added_tokens_decoder 字段转为AddedToken,并放在init_kwargs字典中

#### Handle tokenizer serialization of added and special tokens
added_tokens_decoder: dict[int, AddedToken] = {}
added_tokens_map: dict[str, AddedToken] = {}
# if we have info on the slow added tokens
if "added_tokens_decoder" in init_kwargs:
    for idx, token in init_kwargs["added_tokens_decoder"].items():
        if isinstance(token, dict):
            token = AddedToken(**token)
        if isinstance(token, AddedToken):
            added_tokens_decoder[int(idx)] = token
            added_tokens_map[str(token)] = token
        else:
            raise TypeError(
                f"Found a {token.__class__} in the saved `added_tokens_decoder`, should be a dictionary or an AddedToken instance"
            )

3.替换特殊token

将tokenizer_config对应的字典init_kwargs中的特殊token字段替换为对应的token。

例如:

替换之前:tokenizer_config中"eos_token": "<|im_end|>",

替换之后:tokenizer_config中"eos_token": AddedToken,

4. 实例化Qwen2TokenizerFast对象

通过上述的tokenizer_config对应的字典init_kwargs 实例化模型对应的分词器对象

tokenizer = cls(*init_inputs, **init_kwargs)

3.1.7 Qwen2TokenizerFast

初始化函数

class Qwen2TokenizerFast(PreTrainedTokenizerFast):
    """
    Construct a "fast" Qwen2 tokenizer (backed by HuggingFace's *tokenizers* library). Based on byte-level
    Byte-Pair-Encoding.

    Same with GPT2Tokenizer, this tokenizer has been trained to treat spaces like parts of the tokens so a word will
    be encoded differently whether it is at the beginning of the sentence (without space) or not:

    ```python
    >>> from transformers import Qwen2TokenizerFast

    >>> tokenizer = Qwen2TokenizerFast.from_pretrained("Qwen/Qwen-tokenizer")
    >>> tokenizer("Hello world")["input_ids"]
    [9707, 1879]

    >>> tokenizer(" Hello world")["input_ids"]
    [21927, 1879]
    ```
    """

    vocab_files_names = VOCAB_FILES_NAMES
    model_input_names = ["input_ids", "attention_mask"]
    slow_tokenizer_class = Qwen2Tokenizer

    def __init__(
        self,
        vocab_file=None,
        merges_file=None,
        tokenizer_file=None,
        unk_token="<|endoftext|>",
        bos_token=None,
        eos_token="<|endoftext|>",
        pad_token="<|endoftext|>",
        **kwargs,
    ):
        # We need to at least pass vocab_file and merges_file to base class
        # in case a slow tokenizer needs to be initialized; other can be
        # configured through files.
        # following GPT2TokenizerFast, also adding unk_token, bos_token, and eos_token

        bos_token = (
            AddedToken(bos_token, lstrip=False, rstrip=False, special=True, normalized=False)
            if isinstance(bos_token, str)
            else bos_token
        )
        eos_token = (
            AddedToken(eos_token, lstrip=False, rstrip=False, special=True, normalized=False)
            if isinstance(eos_token, str)
            else eos_token
        )
        unk_token = (
            AddedToken(unk_token, lstrip=False, rstrip=False, special=True, normalized=False)
            if isinstance(unk_token, str)
            else unk_token
        )
        pad_token = (
            AddedToken(pad_token, lstrip=False, rstrip=False, special=True, normalized=False)
            if isinstance(pad_token, str)
            else pad_token
        )

        super().__init__(
            vocab_file=vocab_file,
            merges_file=merges_file,
            tokenizer_file=tokenizer_file,
            unk_token=unk_token,
            bos_token=bos_token,
            eos_token=eos_token,
            pad_token=pad_token,
            **kwargs,
        )

3.1.8 PreTrainedTokenizerFast init函数

该函数代码很多,其中一部分是:

fast_tokenizer_file='/home/boy/qwen/tokenizer.json'

fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
self._tokenizer = fast_tokenizer

总之根据模型的token相关文件初始化了Qwen2TokenizerFast对象,并对其返回。

3.2. apply_chat_template

# 准备对话内容
messages = [
    {"role": "user", "content": "介绍一下qwen3模型"}
]
print(tokenizer.chat_template)
# 应用聊天模板,将对话格式化为模型接受的文本
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,  # 这里我们不进行分词,只获取格式化后的文本字符串
    add_generation_prompt=True  # 添加提示,告诉模型开始生成回复
)
print(text)

apply_chat_template是Hugging Face Transformers库中Tokenizer的核心方法,用于将对话消息数组转换为模型可识别的文本格式。以下是关键要点:

核心功能

  • 输入要求:消息对象必须包含role(角色)和content(内容)字段,角色仅支持system/user/assistant三种取值 。 ‌
  • 输出形式:根据模型的chat_template属性,将对话格式化为模型输入字符串(如<|USER|>内容<|END_USER|>) 。 ‌

参数说明

  • tokenize:若设为true,返回token ID数组;若设为false(默认),返回纯文本字符串 。 ‌
  • add_generation_prompt:若设为true,在末尾添加模型生成回复的提示标记(如<s>assistant:</s>) 。 ‌

自定义模板
需遵循Jinja2语法,包含三类变量:

  • 角色变量:{{role}}
  • 内容变量:{{content}}(自动转义HTML)
  • 控制结构:{% if add_generation_prompt %}条件渲染 。 ‌

上面的代码经过运行,打印:

3.2.1 chat_template

{%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0].role == 'system' %}
        {{- messages[0].content + '\n\n' }}
    {%- endif %}
    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0].role == 'system' %}
        {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
        {%- set ns.multi_step_tool = false %}
        {%- set ns.last_query_index = index %}
    {%- endif %}
{%- endfor %}
{%- for message in messages %}
    {%- if message.content is string %}
        {%- set content = message.content %}
    {%- else %}
        {%- set content = '' %}
    {%- endif %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set reasoning_content = '' %}
        {%- if message.reasoning_content is string %}
            {%- set reasoning_content = message.reasoning_content %}
        {%- else %}
            {%- if '</think>' in content %}
                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
            {%- endif %}
        {%- endif %}
        {%- if loop.index0 > ns.last_query_index %}
            {%- if loop.last or (not loop.last and reasoning_content) %}
                {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
            {%- else %}
                {{- '<|im_start|>' + message.role + '\n' + content }}
            {%- endif %}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- endif %}
        {%- if message.tool_calls %}
            {%- for tool_call in message.tool_calls %}
                {%- if (loop.first and content) or (not loop.first) %}
                    {{- '\n' }}
                {%- endif %}
                {%- if tool_call.function %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {{- '<tool_call>\n{"name": "' }}
                {{- tool_call.name }}
                {{- '", "arguments": ' }}
                {%- if tool_call.arguments is string %}
                    {{- tool_call.arguments }}
                {%- else %}
                    {{- tool_call.arguments | tojson }}
                {%- endif %}
                {{- '}\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- content }}
        {{- '\n</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
    {%- if enable_thinking is defined and enable_thinking is false %}
        {{- '<think>\n\n</think>\n\n' }}
    {%- endif %}
{%- endif %}

上面是一个Jinja模板,用于格式化对话历史,并支持工具调用(function calling)和推理内容(reasoning content)。这个模板比较复杂,让我们逐步解析。

首先,模板中使用了Jinja2的语法,包括条件判断、循环、变量赋值和过滤器等。

模板的主要结构如下:

  1. 处理工具定义(如果存在工具)

    • 如果存在工具,则首先输出一个系统消息,其中包含工具的定义。

    • 如果第一条消息是系统消息,则将其内容也包含在内,然后加上工具定义的说明和工具列表。

    • 如果不存在工具,但第一条消息是系统消息,则只输出该系统消息。

  2. 设置命名空间(namespace)变量,用于在循环中记录状态。

    • 这里设置了一个变量ns,其中multi_step_tool初始为true,last_query_index初始为消息列表长度减一。

    • 然后,从后往前遍历消息列表,以确定最后一个用户查询的位置(即不是工具响应的用户消息)。

  3. 遍历消息列表(正向遍历),根据消息的角色和内容进行格式化:

    • 如果消息内容是字符串,则直接使用;否则设置为空字符串。

    • 对于用户消息(user)和系统消息(system,且不是第一条)则直接格式化为:<|im_start|>{role}\n{content}<|im_end|>\n

    • 对于助手消息(assistant)则更复杂:

      • 尝试从消息中提取推理内容(reasoning_content)。如果消息有reasoning_content属性则使用,否则尝试从内容中提取<think>标签内的内容。

      • 如果当前消息的索引大于之前找到的最后一个用户查询的索引(即位于最后一个用户查询之后),并且是最后一条消息或者有推理内容,则会将推理内容放在<think>标签中,然后输出剩余内容。

      • 否则,直接输出内容。

      • 如果消息中有工具调用(tool_calls),则逐个输出工具调用,格式为<tool_call>\n{"name": tool_name, "arguments": tool_arguments}\n</tool_call>

    • 对于工具消息(tool)则格式化为工具响应,多个连续的工具消息会被合并到一个用户消息中,用<tool_response>标签包裹。

  4. 如果设置了add_generation_prompt,则在最后添加助手的起始标记,并根据是否启用思考(enable_thinking)来决定是否添加空的思考标签。

3.2.2 格式化后的文本字符串:

<|im_start|>user
介绍一下qwen3模型<|im_end|>
<|im_start|>assistant

3.3 encode

# 将格式化之后的文本转换为模型可接受的输入格式
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
  • 将文本转换为模型可理解的输入:机器学习模型,尤其是神经网络,处理的是数值数据,而不是原始文本。因此,我们需要将文本转换成数字(即token的ID),模型才能处理。
  • 标准化和预处理:编码过程通常包括文本的标准化(如小写化、去除重音等)、分词(将文本拆分成单词或子词单元)以及添加特殊标记(如[CLS]、[SEP]等),这些步骤有助于模型更好地理解文本。
  • 处理未知词汇:通过分词技术(如BPE、WordPiece等),编码器可以将未知词汇分解成已知的子词单元,从而避免模型遇到从未见过的词汇。
  • 保持上下文信息:编码过程中,我们可以添加特殊标记来指示句子的开始、结束,或者区分两个句子,这对于理解上下文至关重要。
  • 批量处理:编码过程可以将不同长度的文本转换为相同长度的向量(通过填充或截断),以便进行批量处理。

首先调用:

transformers/tokenization_utils_base.py:class PreTrainedTokenizerBase类的 __call__函数

最后调用了

transformers/tokenization_utils_fast.py::PreTrainedTokenizerFast类的_batch_encode_plus函数

在这个函数里面

步骤1:调用encode_batch函数

_tokenizer 是Tokenizer 类,这实际上是Python绑定,底层是基于 Rust 的高性能实现

encodings = self._tokenizer.encode_batch(
    batch_text_or_text_pairs,
    add_special_tokens=add_special_tokens,
    is_pretokenized=is_split_into_words,
)

Tokenizer的encoding实现涉及多个步骤:

  1. 文本预处理:根据模型的需求,可能包括小写化、Unicode规范化、去除重音符号等。
  2. 分词:将文本拆分成token(子词、单词或字符)。
  3. 转换为ID:将token映射到词汇表中的整数ID。
  4. 添加特殊标记:如[CLS]、[SEP]、[PAD]等。
  5. 填充和截断:确保所有序列长度一致。

# 这实际上是Python绑定,底层是基于 Rust 的高性能实现
from tokenizers import Tokenizer

步骤2:对encodings进行转换得到input_ids和attention_mask,放在sanitized_tokens中

tokens_and_encodings = [
    self._convert_encoding(
        encoding=encoding,
        return_token_type_ids=return_token_type_ids,
        return_attention_mask=return_attention_mask,
        return_overflowing_tokens=return_overflowing_tokens,
        return_special_tokens_mask=return_special_tokens_mask,
        return_offsets_mapping=return_offsets_mapping,
        return_length=return_length,
        verbose=verbose,
    )
    for encoding in encodings
]

sanitized_tokens是个字典,包含了input_ids和attention_mask。最后返回BatchEncoding对象

return BatchEncoding(sanitized_tokens, sanitized_encodings, tensor_type=return_tensors)

执行完encode之后,下一步进行generate

从model_inputs中获取input_ids和attention_mask,作为generate输入。

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512  # 控制生成内容的最大长度
)

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐