transformers之tokenizer
1.transformers介绍
这里的Transformers指的是huggingface开发的大模型库,为huggingface上数以万计的预训练大模型提供预测、训练等服务。
- Transformers 提供了数以千计的预训练模型,支持 100 多种语言的文本分类、信息抽取、问答、摘要、翻译、文本生成。它的宗旨是让最先进的 NLP 技术人人易用。
- Transformers 提供了便于快速下载和使用的API,让你可以把预训练模型用在给定文本、在你的数据集上微调然后通过 model hub 与社区共享。同时,每个定义的 Python 模块均完全独立,方便修改和快速研究实验。
- Transformers 支持三个最热门的深度学习库: Jax, PyTorch 以及 TensorFlow — 并与之无缝整合。你可以直接使用一个框架训练你的模型然后用另一个加载和推理。
2.运行qwen3模型用例
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
# 指定模型文件的本地目录路径
model_path = "/home/claudius/qwen/"
# 加载分词器
tokenizer = AutoTokenizer.from_pretrained(
model_path,
trust_remote_code=True # Qwen模型通常需要此参数:cite[6]:cite[9]
)
# 加载模型
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype="auto", # 指定模型计算的数据类型,通常使用bfloat16或float16以节省显存:cite[10]
device_map="auto", # 自动将模型层分配到可用的GPU设备上
trust_remote_code=True # Qwen模型通常需要此参数:cite[6]:cite[9]
)
print("模型与分词器加载成功!")
# 准备对话内容
messages = [
{"role": "user", "content": "介绍一下claudius这个人物"}
]
# 应用聊天模板,将对话格式化为模型接受的文本
text = tokenizer.apply_chat_template(
messages,
tokenize=False, # 这里我们不进行分词,只获取格式化后的文本字符串
add_generation_prompt=True # 添加提示,告诉模型开始生成回复
)
# 将文本转换为模型可接受的输入格式
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
start_time = time.time()
# 模型生成回复
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512 # 控制生成内容的最大长度
)
# 解码生成的token,得到可读文本,并跳过特殊token
response = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
end_time = time.time()
execution_time = end_time - start_time
print(f"代码执行时间: {execution_time:.6f} 秒")
print("模型回复:", response)
下面是qwen模型的相关文件:
LAPTOP-QO2TSM24:~/qwen$ ls -l
-rwxrwxrwx 1 726 Nov 1 17:36 config.json
-rwxrwxrwx 1 239 Nov 1 17:36 generation_config.json
-rwxrwxrwx 1 1671853 Nov 1 17:36 merges.txt
-rwxrwxrwx 1 1503300328 Nov 1 17:25 model.safetensors
-rwxrwxrwx 1 11422654 Nov 1 17:36 tokenizer.json
-rwxrwxrwx 1 9732 Nov 1 17:36 tokenizer_config.json
-rwxrwxrwx 1 2776833 Nov 1 17:36 vocab.json
3. AutoTokenizer
AutoTokenizer 是 Hugging Face transformers 库中的一个 自动分词器(tokenizer)加载器,用于根据 预训练模型的名称 自动选择合适的分词器(Tokenizer)。它的主要作用是让用户无需手动指定模型对应的分词方式,而是通过模型名称自动加载相匹配的分词器。
在 transformers 库中,每种模型都有自己对应的分词器。有一个TOKENIZER_MAPPING_NAMES字典记录了模型到分词器的映射关系。
比如qwen3模型对应Qwen2Tokenizer和Qwen2TokenizerFast 分词器
通过查看 config.json 文件,可以看到:"model_type": "qwen3"
通过查看 tokenizer_config.json 文件,可以看到:"tokenizer_class": "Qwen2Tokenizer",
在下面的映射表中,也可以看到qwen3对应的分词器。
# Explicit rather than inferred generics to significantly improves completion suggestion performance for language servers.
TOKENIZER_MAPPING_NAMES = OrderedDict[str, tuple[Optional[str], Optional[str]]](
[
(
"aimv2",
(
"CLIPTokenizer",
"CLIPTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"albert",
(
"AlbertTokenizer" if is_sentencepiece_available() else None,
"AlbertTokenizerFast" if is_tokenizers_available() else None,
),
),
("align", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("arcee", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("aria", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("aya_vision", (None, "CohereTokenizerFast" if is_tokenizers_available() else None)),
("bark", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("bart", ("BartTokenizer", "BartTokenizerFast")),
(
"barthez",
(
"BarthezTokenizer" if is_sentencepiece_available() else None,
"BarthezTokenizerFast" if is_tokenizers_available() else None,
),
),
("bartpho", ("BartphoTokenizer", None)),
("bert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("bert-generation", ("BertGenerationTokenizer" if is_sentencepiece_available() else None, None)),
("bert-japanese", ("BertJapaneseTokenizer", None)),
("bertweet", ("BertweetTokenizer", None)),
(
"big_bird",
(
"BigBirdTokenizer" if is_sentencepiece_available() else None,
"BigBirdTokenizerFast" if is_tokenizers_available() else None,
),
),
("bigbird_pegasus", ("PegasusTokenizer", "PegasusTokenizerFast" if is_tokenizers_available() else None)),
("biogpt", ("BioGptTokenizer", None)),
("bitnet", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("blenderbot", ("BlenderbotTokenizer", "BlenderbotTokenizerFast")),
("blenderbot-small", ("BlenderbotSmallTokenizer", None)),
("blip", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("blip-2", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("bloom", (None, "BloomTokenizerFast" if is_tokenizers_available() else None)),
("blt", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("bridgetower", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
("bros", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("byt5", ("ByT5Tokenizer", None)),
(
"camembert",
(
"CamembertTokenizer" if is_sentencepiece_available() else None,
"CamembertTokenizerFast" if is_tokenizers_available() else None,
),
),
("canine", ("CanineTokenizer", None)),
(
"chameleon",
(
"LlamaTokenizer" if is_sentencepiece_available() else None,
"LlamaTokenizerFast" if is_tokenizers_available() else None,
),
),
("chinese_clip", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
(
"clap",
(
"RobertaTokenizer",
"RobertaTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"clip",
(
"CLIPTokenizer",
"CLIPTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"clipseg",
(
"CLIPTokenizer",
"CLIPTokenizerFast" if is_tokenizers_available() else None,
),
),
("clvp", ("ClvpTokenizer", None)),
(
"code_llama",
(
"CodeLlamaTokenizer" if is_sentencepiece_available() else None,
"CodeLlamaTokenizerFast" if is_tokenizers_available() else None,
),
),
("codegen", ("CodeGenTokenizer", "CodeGenTokenizerFast" if is_tokenizers_available() else None)),
("cohere", (None, "CohereTokenizerFast" if is_tokenizers_available() else None)),
("cohere2", (None, "CohereTokenizerFast" if is_tokenizers_available() else None)),
("colpali", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("colqwen2", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
("convbert", ("ConvBertTokenizer", "ConvBertTokenizerFast" if is_tokenizers_available() else None)),
(
"cpm",
(
"CpmTokenizer" if is_sentencepiece_available() else None,
"CpmTokenizerFast" if is_tokenizers_available() else None,
),
),
("cpmant", ("CpmAntTokenizer", None)),
("csm", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("ctrl", ("CTRLTokenizer", None)),
("data2vec-audio", ("Wav2Vec2CTCTokenizer", None)),
("data2vec-text", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
("dbrx", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("deberta", ("DebertaTokenizer", "DebertaTokenizerFast" if is_tokenizers_available() else None)),
(
"deberta-v2",
(
"DebertaV2Tokenizer" if is_sentencepiece_available() else None,
"DebertaV2TokenizerFast" if is_tokenizers_available() else None,
),
),
(
"deepseek_v2",
(
"LlamaTokenizer" if is_sentencepiece_available() else None,
"LlamaTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"deepseek_v3",
(
"LlamaTokenizer" if is_sentencepiece_available() else None,
"LlamaTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"deepseek_vl",
(
"LlamaTokenizer" if is_sentencepiece_available() else None,
"LlamaTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"deepseek_vl_hybrid",
(
"LlamaTokenizer" if is_sentencepiece_available() else None,
"LlamaTokenizerFast" if is_tokenizers_available() else None,
),
),
("dia", ("DiaTokenizer", None)),
(
"diffllama",
(
"LlamaTokenizer" if is_sentencepiece_available() else None,
"LlamaTokenizerFast" if is_tokenizers_available() else None,
),
),
("distilbert", ("DistilBertTokenizer", "DistilBertTokenizerFast" if is_tokenizers_available() else None)),
(
"dpr",
(
"DPRQuestionEncoderTokenizer",
"DPRQuestionEncoderTokenizerFast" if is_tokenizers_available() else None,
),
),
("electra", ("ElectraTokenizer", "ElectraTokenizerFast" if is_tokenizers_available() else None)),
("emu3", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("ernie", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("ernie4_5", (None, "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("ernie4_5_moe", (None, "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("ernie_m", ("ErnieMTokenizer" if is_sentencepiece_available() else None, None)),
("esm", ("EsmTokenizer", None)),
(
"exaone4",
(
"GPT2Tokenizer" if is_tokenizers_available() else None,
"GPT2TokenizerFast" if is_tokenizers_available() else None,
),
),
("falcon", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("falcon_mamba", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
(
"fastspeech2_conformer",
("FastSpeech2ConformerTokenizer" if is_g2p_en_available() else None, None),
),
("flaubert", ("FlaubertTokenizer", None)),
("flex_olmo", (None, "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("fnet", ("FNetTokenizer", "FNetTokenizerFast" if is_tokenizers_available() else None)),
("fsmt", ("FSMTTokenizer", None)),
("funnel", ("FunnelTokenizer", "FunnelTokenizerFast" if is_tokenizers_available() else None)),
(
"gemma",
(
"GemmaTokenizer" if is_sentencepiece_available() else None,
"GemmaTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"gemma2",
(
"GemmaTokenizer" if is_sentencepiece_available() else None,
"GemmaTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"gemma3",
(
"GemmaTokenizer" if is_sentencepiece_available() else None,
"GemmaTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"gemma3_text",
(
"GemmaTokenizer" if is_sentencepiece_available() else None,
"GemmaTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"gemma3n",
(
"GemmaTokenizer" if is_sentencepiece_available() else None,
"GemmaTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"gemma3n_text",
(
"GemmaTokenizer" if is_sentencepiece_available() else None,
"GemmaTokenizerFast" if is_tokenizers_available() else None,
),
),
("git", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("glm", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("glm4", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("glm4_moe", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("glm4v", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("glm4v_moe", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("gpt-sw3", ("GPTSw3Tokenizer" if is_sentencepiece_available() else None, None)),
("gpt2", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("gpt_bigcode", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("gpt_neo", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("gpt_neox", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
("gpt_neox_japanese", ("GPTNeoXJapaneseTokenizer", None)),
("gpt_oss", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("gptj", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("gptsan-japanese", ("GPTSanJapaneseTokenizer", None)),
("granite", ("GPT2Tokenizer", None)),
("granitemoe", ("GPT2Tokenizer", None)),
("granitemoehybrid", ("GPT2Tokenizer", None)),
("granitemoeshared", ("GPT2Tokenizer", None)),
("grounding-dino", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("groupvit", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)),
("helium", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("herbert", ("HerbertTokenizer", "HerbertTokenizerFast" if is_tokenizers_available() else None)),
("hubert", ("Wav2Vec2CTCTokenizer", None)),
("ibert", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
("idefics", (None, "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("idefics2", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("idefics3", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("instructblip", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("instructblipvideo", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("internvl", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
(
"jamba",
(
"LlamaTokenizer" if is_sentencepiece_available() else None,
"LlamaTokenizerFast" if is_tokenizers_available() else None,
),
),
("janus", (None, "LlamaTokenizerFast" if is_tokenizers_available() else None)),
(
"jetmoe",
(
"LlamaTokenizer" if is_sentencepiece_available() else None,
"LlamaTokenizerFast" if is_tokenizers_available() else None,
),
),
("jukebox", ("JukeboxTokenizer", None)),
(
"kosmos-2",
(
"XLMRobertaTokenizer" if is_sentencepiece_available() else None,
"XLMRobertaTokenizerFast" if is_tokenizers_available() else None,
),
),
("kosmos-2.5", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("layoutlm", ("LayoutLMTokenizer", "LayoutLMTokenizerFast" if is_tokenizers_available() else None)),
("layoutlmv2", ("LayoutLMv2Tokenizer", "LayoutLMv2TokenizerFast" if is_tokenizers_available() else None)),
("layoutlmv3", ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast" if is_tokenizers_available() else None)),
("layoutxlm", ("LayoutXLMTokenizer", "LayoutXLMTokenizerFast" if is_tokenizers_available() else None)),
("led", ("LEDTokenizer", "LEDTokenizerFast" if is_tokenizers_available() else None)),
("lilt", ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast" if is_tokenizers_available() else None)),
(
"llama",
(
"LlamaTokenizer" if is_sentencepiece_available() else None,
"LlamaTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"llama4",
(
"LlamaTokenizer" if is_sentencepiece_available() else None,
"LlamaTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"llama4_text",
(
"LlamaTokenizer" if is_sentencepiece_available() else None,
"LlamaTokenizerFast" if is_tokenizers_available() else None,
),
),
("llava", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("llava_next", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("llava_next_video", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("llava_onevision", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("longformer", ("LongformerTokenizer", "LongformerTokenizerFast" if is_tokenizers_available() else None)),
(
"longt5",
(
"T5Tokenizer" if is_sentencepiece_available() else None,
"T5TokenizerFast" if is_tokenizers_available() else None,
),
),
("luke", ("LukeTokenizer", None)),
("lxmert", ("LxmertTokenizer", "LxmertTokenizerFast" if is_tokenizers_available() else None)),
("m2m_100", ("M2M100Tokenizer" if is_sentencepiece_available() else None, None)),
("mamba", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
("mamba2", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
("marian", ("MarianTokenizer" if is_sentencepiece_available() else None, None)),
(
"mbart",
(
"MBartTokenizer" if is_sentencepiece_available() else None,
"MBartTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"mbart50",
(
"MBart50Tokenizer" if is_sentencepiece_available() else None,
"MBart50TokenizerFast" if is_tokenizers_available() else None,
),
),
("mega", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
("megatron-bert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
(
"metaclip_2",
(
"XLMRobertaTokenizer",
"XLMRobertaTokenizerFast" if is_tokenizers_available() else None,
),
),
("mgp-str", ("MgpstrTokenizer", None)),
(
"minimax",
(
"GPT2Tokenizer" if is_sentencepiece_available() else None,
"GPT2TokenizerFast" if is_tokenizers_available() else None,
),
),
(
"mistral",
(
"MistralCommonTokenizer"
if is_mistral_common_available()
else ("LlamaTokenizer" if is_sentencepiece_available() else None),
"LlamaTokenizerFast" if is_tokenizers_available() and not is_mistral_common_available() else None,
),
),
(
"mixtral",
(
"MistralCommonTokenizer"
if is_mistral_common_available()
else ("LlamaTokenizer" if is_sentencepiece_available() else None),
"LlamaTokenizerFast" if is_tokenizers_available() and not is_mistral_common_available() else None,
),
),
("mllama", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("mluke", ("MLukeTokenizer" if is_sentencepiece_available() else None, None)),
("mm-grounding-dino", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("mobilebert", ("MobileBertTokenizer", "MobileBertTokenizerFast" if is_tokenizers_available() else None)),
("modernbert", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("moonshine", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("moshi", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("mpnet", ("MPNetTokenizer", "MPNetTokenizerFast" if is_tokenizers_available() else None)),
("mpt", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
("mra", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
(
"mt5",
(
"MT5Tokenizer" if is_sentencepiece_available() else None,
"MT5TokenizerFast" if is_tokenizers_available() else None,
),
),
("musicgen", ("T5Tokenizer", "T5TokenizerFast" if is_tokenizers_available() else None)),
("musicgen_melody", ("T5Tokenizer", "T5TokenizerFast" if is_tokenizers_available() else None)),
("mvp", ("MvpTokenizer", "MvpTokenizerFast" if is_tokenizers_available() else None)),
("myt5", ("MyT5Tokenizer", None)),
("nemotron", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("nezha", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
(
"nllb",
(
"NllbTokenizer" if is_sentencepiece_available() else None,
"NllbTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"nllb-moe",
(
"NllbTokenizer" if is_sentencepiece_available() else None,
"NllbTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"nystromformer",
(
"AlbertTokenizer" if is_sentencepiece_available() else None,
"AlbertTokenizerFast" if is_tokenizers_available() else None,
),
),
("olmo", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
("olmo2", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
("olmo3", (None, "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("olmoe", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
(
"omdet-turbo",
("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None),
),
("oneformer", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)),
(
"openai-gpt",
("OpenAIGPTTokenizer", "OpenAIGPTTokenizerFast" if is_tokenizers_available() else None),
),
("opt", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("owlv2", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)),
("owlvit", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)),
("paligemma", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("parakeet", ("ParakeetCTCTokenizer", None)),
(
"pegasus",
(
"PegasusTokenizer" if is_sentencepiece_available() else None,
"PegasusTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"pegasus_x",
(
"PegasusTokenizer" if is_sentencepiece_available() else None,
"PegasusTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"perceiver",
(
"PerceiverTokenizer",
None,
),
),
(
"persimmon",
(
"LlamaTokenizer" if is_sentencepiece_available() else None,
"LlamaTokenizerFast" if is_tokenizers_available() else None,
),
),
("phi", ("CodeGenTokenizer", "CodeGenTokenizerFast" if is_tokenizers_available() else None)),
("phi3", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("phimoe", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("phobert", ("PhobertTokenizer", None)),
("pix2struct", ("T5Tokenizer", "T5TokenizerFast" if is_tokenizers_available() else None)),
(
"pixtral",
(
None,
"MistralCommonTokenizer"
if is_mistral_common_available()
else ("PreTrainedTokenizerFast" if is_tokenizers_available() else None),
),
),
("plbart", ("PLBartTokenizer" if is_sentencepiece_available() else None, None)),
("prophetnet", ("ProphetNetTokenizer", None)),
("qdqbert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
(
"qwen2",
(
"Qwen2Tokenizer",
"Qwen2TokenizerFast" if is_tokenizers_available() else None,
),
),
("qwen2_5_omni", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
("qwen2_5_vl", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
("qwen2_audio", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
(
"qwen2_moe",
(
"Qwen2Tokenizer",
"Qwen2TokenizerFast" if is_tokenizers_available() else None,
),
),
("qwen2_vl", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
(
"qwen3",
(
"Qwen2Tokenizer",
"Qwen2TokenizerFast" if is_tokenizers_available() else None,
),
),
(
"qwen3_moe",
(
"Qwen2Tokenizer",
"Qwen2TokenizerFast" if is_tokenizers_available() else None,
),
),
(
"qwen3_next",
(
"Qwen2Tokenizer",
"Qwen2TokenizerFast" if is_tokenizers_available() else None,
),
),
("qwen3_omni_moe", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
("qwen3_vl", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
("qwen3_vl_moe", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
("rag", ("RagTokenizer", None)),
("realm", ("RealmTokenizer", "RealmTokenizerFast" if is_tokenizers_available() else None)),
(
"recurrent_gemma",
(
"GemmaTokenizer" if is_sentencepiece_available() else None,
"GemmaTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"reformer",
(
"ReformerTokenizer" if is_sentencepiece_available() else None,
"ReformerTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"rembert",
(
"RemBertTokenizer" if is_sentencepiece_available() else None,
"RemBertTokenizerFast" if is_tokenizers_available() else None,
),
),
("retribert", ("RetriBertTokenizer", "RetriBertTokenizerFast" if is_tokenizers_available() else None)),
("roberta", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
(
"roberta-prelayernorm",
("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None),
),
("roc_bert", ("RoCBertTokenizer", None)),
("roformer", ("RoFormerTokenizer", "RoFormerTokenizerFast" if is_tokenizers_available() else None)),
("rwkv", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
(
"seamless_m4t",
(
"SeamlessM4TTokenizer" if is_sentencepiece_available() else None,
"SeamlessM4TTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"seamless_m4t_v2",
(
"SeamlessM4TTokenizer" if is_sentencepiece_available() else None,
"SeamlessM4TTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"shieldgemma2",
(
"GemmaTokenizer" if is_sentencepiece_available() else None,
"GemmaTokenizerFast" if is_tokenizers_available() else None,
),
),
("siglip", ("SiglipTokenizer" if is_sentencepiece_available() else None, None)),
(
"siglip2",
(
"GemmaTokenizer" if is_sentencepiece_available() else None,
"GemmaTokenizerFast" if is_tokenizers_available() else None,
),
),
("smollm3", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("speech_to_text", ("Speech2TextTokenizer" if is_sentencepiece_available() else None, None)),
("speech_to_text_2", ("Speech2Text2Tokenizer", None)),
("speecht5", ("SpeechT5Tokenizer" if is_sentencepiece_available() else None, None)),
("splinter", ("SplinterTokenizer", "SplinterTokenizerFast")),
(
"squeezebert",
("SqueezeBertTokenizer", "SqueezeBertTokenizerFast" if is_tokenizers_available() else None),
),
("stablelm", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
("starcoder2", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
(
"switch_transformers",
(
"T5Tokenizer" if is_sentencepiece_available() else None,
"T5TokenizerFast" if is_tokenizers_available() else None,
),
),
(
"t5",
(
"T5Tokenizer" if is_sentencepiece_available() else None,
"T5TokenizerFast" if is_tokenizers_available() else None,
),
),
(
"t5gemma",
(
"GemmaTokenizer" if is_sentencepiece_available() else None,
"GemmaTokenizerFast" if is_tokenizers_available() else None,
),
),
("tapas", ("TapasTokenizer", None)),
("tapex", ("TapexTokenizer", None)),
("transfo-xl", ("TransfoXLTokenizer", None)),
("tvp", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
(
"udop",
(
"UdopTokenizer" if is_sentencepiece_available() else None,
"UdopTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"umt5",
(
"T5Tokenizer" if is_sentencepiece_available() else None,
"T5TokenizerFast" if is_tokenizers_available() else None,
),
),
("video_llava", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("vilt", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("vipllava", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("visual_bert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("vits", ("VitsTokenizer", None)),
(
"voxtral",
(
"MistralCommonTokenizer"
if is_mistral_common_available()
else ("LlamaTokenizer" if is_sentencepiece_available() else None),
"LlamaTokenizerFast" if is_tokenizers_available() and not is_mistral_common_available() else None,
),
),
("wav2vec2", ("Wav2Vec2CTCTokenizer", None)),
("wav2vec2-bert", ("Wav2Vec2CTCTokenizer", None)),
("wav2vec2-conformer", ("Wav2Vec2CTCTokenizer", None)),
("wav2vec2_phoneme", ("Wav2Vec2PhonemeCTCTokenizer", None)),
("whisper", ("WhisperTokenizer", "WhisperTokenizerFast" if is_tokenizers_available() else None)),
("xclip", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)),
(
"xglm",
(
"XGLMTokenizer" if is_sentencepiece_available() else None,
"XGLMTokenizerFast" if is_tokenizers_available() else None,
),
),
("xlm", ("XLMTokenizer", None)),
("xlm-prophetnet", ("XLMProphetNetTokenizer" if is_sentencepiece_available() else None, None)),
(
"xlm-roberta",
(
"XLMRobertaTokenizer" if is_sentencepiece_available() else None,
"XLMRobertaTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"xlm-roberta-xl",
(
"XLMRobertaTokenizer" if is_sentencepiece_available() else None,
"XLMRobertaTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"xlnet",
(
"XLNetTokenizer" if is_sentencepiece_available() else None,
"XLNetTokenizerFast" if is_tokenizers_available() else None,
),
),
("xlstm", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
(
"xmod",
(
"XLMRobertaTokenizer" if is_sentencepiece_available() else None,
"XLMRobertaTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"yoso",
(
"AlbertTokenizer" if is_sentencepiece_available() else None,
"AlbertTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"zamba",
(
"LlamaTokenizer" if is_sentencepiece_available() else None,
"LlamaTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"zamba2",
(
"LlamaTokenizer" if is_sentencepiece_available() else None,
"LlamaTokenizerFast" if is_tokenizers_available() else None,
),
),
]
)
3.1 AutoTokenizer.from_pretrained
通过调用AutoTokenizer.from_pretrained 类函数,该函数内部会自动根据模型配置文件,最终返回模型所对应的分词器,在这里具体是Qwen2TokenizerFast tokenizer 分词器
# 加载分词器
tokenizer = AutoTokenizer.from_pretrained(
model_path,
trust_remote_code=True # Qwen模型通常需要此参数:cite[6]:cite[9]
)
该函数的作用是从预训练模型的词汇表中实例化一个分词器类, 根据qwen模型config.json文件中的`model_type`属性选择要实例化的分词器类。
接下来看里面的实现细节:
3.1.1 读取tokenizer_config 文件
读取tokenizer_config.json文件,返回对应的python 字典 tokenizer_config
tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
3.1.2 获取tokenizer_class字段
从tokenizer_config字典中获取"tokenizer_class"字段,config_tokenizer_class ="Qwen2Tokenizer"
config_tokenizer_class = tokenizer_config.get("tokenizer_class")
3.1.3 根据tokenizer_class名称获取分词器类
if use_fast and not config_tokenizer_class.endswith("Fast"):
tokenizer_class_candidate = f"{config_tokenizer_class}Fast"
tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
在这里: use_fast 默认为True,因此会优先获取Fast版本,即tokenizer_class_candidate=“Qwen2TokenizerFast”
然后根据该名称获取对应的分词器类即:tokenizer_class为
<class 'transformers.models.qwen2.tokenization_qwen2_fast.Qwen2TokenizerFast'>
3.1.4 调用对应的分词器类的from_pretrained函数
调用该模型对应的分词器类Qwen2TokenizerFast的from_pretrained函数。
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
其实是基类PreTrainedTokenizerBase的from_pretrained函数。下面是类的继承关系:
class Qwen2TokenizerFast(PreTrainedTokenizerFast):
class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
3.1.5 PreTrainedTokenizerBase::from_pretrained
保存相关文件路径到resolved_vocab_files 字典中,打印如下:
'vocab_file': '/home/boy/qwen/vocab.json',
'merges_file': '/home/boy/qwen/merges.txt',
'tokenizer_file': '/home/boy/qwen/tokenizer.json',
'added_tokens_file': None,
'special_tokens_map_file': None,
'tokenizer_config_file': '/home/boy/qwen/tokenizer_config.json',
'chat_template_file': None
然后再调用类方法:PreTrainedTokenizerBase::_from_pretrained。这个类函数内部会实例化Qwen2TokenizerFast对象并将对象返回出来。
return cls._from_pretrained(resolved_vocab_files,pretrained_model_name_or_path,
3.1.6 PreTrainedTokenizerBase::_from_pretrained
1. 读取tokenizer_config 文件内容到init_kwargs字典中
with open(tokenizer_config_file, encoding="utf-8") as tokenizer_config_handle:
init_kwargs = json.load(tokenizer_config_handle)
2. 将added_tokens_decoder 字段转为AddedToken,并放在init_kwargs字典中
#### Handle tokenizer serialization of added and special tokens
added_tokens_decoder: dict[int, AddedToken] = {}
added_tokens_map: dict[str, AddedToken] = {}
# if we have info on the slow added tokens
if "added_tokens_decoder" in init_kwargs:
for idx, token in init_kwargs["added_tokens_decoder"].items():
if isinstance(token, dict):
token = AddedToken(**token)
if isinstance(token, AddedToken):
added_tokens_decoder[int(idx)] = token
added_tokens_map[str(token)] = token
else:
raise TypeError(
f"Found a {token.__class__} in the saved `added_tokens_decoder`, should be a dictionary or an AddedToken instance"
)
3.替换特殊token
将tokenizer_config对应的字典init_kwargs中的特殊token字段替换为对应的token。
例如:
替换之前:tokenizer_config中"eos_token": "<|im_end|>",
替换之后:tokenizer_config中"eos_token": AddedToken,
4. 实例化Qwen2TokenizerFast对象
通过上述的tokenizer_config对应的字典init_kwargs 实例化模型对应的分词器对象
tokenizer = cls(*init_inputs, **init_kwargs)
3.1.7 Qwen2TokenizerFast
初始化函数
class Qwen2TokenizerFast(PreTrainedTokenizerFast):
"""
Construct a "fast" Qwen2 tokenizer (backed by HuggingFace's *tokenizers* library). Based on byte-level
Byte-Pair-Encoding.
Same with GPT2Tokenizer, this tokenizer has been trained to treat spaces like parts of the tokens so a word will
be encoded differently whether it is at the beginning of the sentence (without space) or not:
```python
>>> from transformers import Qwen2TokenizerFast
>>> tokenizer = Qwen2TokenizerFast.from_pretrained("Qwen/Qwen-tokenizer")
>>> tokenizer("Hello world")["input_ids"]
[9707, 1879]
>>> tokenizer(" Hello world")["input_ids"]
[21927, 1879]
```
"""
vocab_files_names = VOCAB_FILES_NAMES
model_input_names = ["input_ids", "attention_mask"]
slow_tokenizer_class = Qwen2Tokenizer
def __init__(
self,
vocab_file=None,
merges_file=None,
tokenizer_file=None,
unk_token="<|endoftext|>",
bos_token=None,
eos_token="<|endoftext|>",
pad_token="<|endoftext|>",
**kwargs,
):
# We need to at least pass vocab_file and merges_file to base class
# in case a slow tokenizer needs to be initialized; other can be
# configured through files.
# following GPT2TokenizerFast, also adding unk_token, bos_token, and eos_token
bos_token = (
AddedToken(bos_token, lstrip=False, rstrip=False, special=True, normalized=False)
if isinstance(bos_token, str)
else bos_token
)
eos_token = (
AddedToken(eos_token, lstrip=False, rstrip=False, special=True, normalized=False)
if isinstance(eos_token, str)
else eos_token
)
unk_token = (
AddedToken(unk_token, lstrip=False, rstrip=False, special=True, normalized=False)
if isinstance(unk_token, str)
else unk_token
)
pad_token = (
AddedToken(pad_token, lstrip=False, rstrip=False, special=True, normalized=False)
if isinstance(pad_token, str)
else pad_token
)
super().__init__(
vocab_file=vocab_file,
merges_file=merges_file,
tokenizer_file=tokenizer_file,
unk_token=unk_token,
bos_token=bos_token,
eos_token=eos_token,
pad_token=pad_token,
**kwargs,
)
3.1.8 PreTrainedTokenizerFast init函数
该函数代码很多,其中一部分是:
fast_tokenizer_file='/home/boy/qwen/tokenizer.json'
fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
self._tokenizer = fast_tokenizer
总之根据模型的token相关文件初始化了Qwen2TokenizerFast对象,并对其返回。
3.2. apply_chat_template
# 准备对话内容
messages = [
{"role": "user", "content": "介绍一下qwen3模型"}
]
print(tokenizer.chat_template)
# 应用聊天模板,将对话格式化为模型接受的文本
text = tokenizer.apply_chat_template(
messages,
tokenize=False, # 这里我们不进行分词,只获取格式化后的文本字符串
add_generation_prompt=True # 添加提示,告诉模型开始生成回复
)
print(text)
apply_chat_template是Hugging Face Transformers库中Tokenizer的核心方法,用于将对话消息数组转换为模型可识别的文本格式。以下是关键要点:
核心功能
- 输入要求:消息对象必须包含role(角色)和content(内容)字段,角色仅支持system/user/assistant三种取值 。
- 输出形式:根据模型的chat_template属性,将对话格式化为模型输入字符串(如<|USER|>内容<|END_USER|>) 。
参数说明
- tokenize:若设为true,返回token ID数组;若设为false(默认),返回纯文本字符串 。
- add_generation_prompt:若设为true,在末尾添加模型生成回复的提示标记(如<s>assistant:</s>) 。
自定义模板
需遵循Jinja2语法,包含三类变量:
- 角色变量:{{role}}
- 内容变量:{{content}}(自动转义HTML)
- 控制结构:{% if add_generation_prompt %}条件渲染 。
上面的代码经过运行,打印:
3.2.1 chat_template
{%- if tools %}
{{- '<|im_start|>system\n' }}
{%- if messages[0].role == 'system' %}
{{- messages[0].content + '\n\n' }}
{%- endif %}
{{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
{%- for tool in tools %}
{{- "\n" }}
{{- tool | tojson }}
{%- endfor %}
{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
{%- if messages[0].role == 'system' %}
{{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
{%- set index = (messages|length - 1) - loop.index0 %}
{%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
{%- set ns.multi_step_tool = false %}
{%- set ns.last_query_index = index %}
{%- endif %}
{%- endfor %}
{%- for message in messages %}
{%- if message.content is string %}
{%- set content = message.content %}
{%- else %}
{%- set content = '' %}
{%- endif %}
{%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
{{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
{%- elif message.role == "assistant" %}
{%- set reasoning_content = '' %}
{%- if message.reasoning_content is string %}
{%- set reasoning_content = message.reasoning_content %}
{%- else %}
{%- if '</think>' in content %}
{%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
{%- set content = content.split('</think>')[-1].lstrip('\n') %}
{%- endif %}
{%- endif %}
{%- if loop.index0 > ns.last_query_index %}
{%- if loop.last or (not loop.last and reasoning_content) %}
{{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
{%- else %}
{{- '<|im_start|>' + message.role + '\n' + content }}
{%- endif %}
{%- else %}
{{- '<|im_start|>' + message.role + '\n' + content }}
{%- endif %}
{%- if message.tool_calls %}
{%- for tool_call in message.tool_calls %}
{%- if (loop.first and content) or (not loop.first) %}
{{- '\n' }}
{%- endif %}
{%- if tool_call.function %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{{- '<tool_call>\n{"name": "' }}
{{- tool_call.name }}
{{- '", "arguments": ' }}
{%- if tool_call.arguments is string %}
{{- tool_call.arguments }}
{%- else %}
{{- tool_call.arguments | tojson }}
{%- endif %}
{{- '}\n</tool_call>' }}
{%- endfor %}
{%- endif %}
{{- '<|im_end|>\n' }}
{%- elif message.role == "tool" %}
{%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
{{- '<|im_start|>user' }}
{%- endif %}
{{- '\n<tool_response>\n' }}
{{- content }}
{{- '\n</tool_response>' }}
{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
{{- '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- if enable_thinking is defined and enable_thinking is false %}
{{- '<think>\n\n</think>\n\n' }}
{%- endif %}
{%- endif %}
上面是一个Jinja模板,用于格式化对话历史,并支持工具调用(function calling)和推理内容(reasoning content)。这个模板比较复杂,让我们逐步解析。
首先,模板中使用了Jinja2的语法,包括条件判断、循环、变量赋值和过滤器等。
模板的主要结构如下:
-
处理工具定义(如果存在工具)
-
如果存在工具,则首先输出一个系统消息,其中包含工具的定义。
-
如果第一条消息是系统消息,则将其内容也包含在内,然后加上工具定义的说明和工具列表。
-
如果不存在工具,但第一条消息是系统消息,则只输出该系统消息。
-
-
设置命名空间(namespace)变量,用于在循环中记录状态。
-
这里设置了一个变量
ns,其中multi_step_tool初始为true,last_query_index初始为消息列表长度减一。 -
然后,从后往前遍历消息列表,以确定最后一个用户查询的位置(即不是工具响应的用户消息)。
-
-
遍历消息列表(正向遍历),根据消息的角色和内容进行格式化:
-
如果消息内容是字符串,则直接使用;否则设置为空字符串。
-
对于用户消息(user)和系统消息(system,且不是第一条)则直接格式化为:
<|im_start|>{role}\n{content}<|im_end|>\n -
对于助手消息(assistant)则更复杂:
-
尝试从消息中提取推理内容(reasoning_content)。如果消息有
reasoning_content属性则使用,否则尝试从内容中提取<think>标签内的内容。 -
如果当前消息的索引大于之前找到的最后一个用户查询的索引(即位于最后一个用户查询之后),并且是最后一条消息或者有推理内容,则会将推理内容放在
<think>标签中,然后输出剩余内容。 -
否则,直接输出内容。
-
如果消息中有工具调用(tool_calls),则逐个输出工具调用,格式为
<tool_call>\n{"name": tool_name, "arguments": tool_arguments}\n</tool_call>。
-
-
对于工具消息(tool)则格式化为工具响应,多个连续的工具消息会被合并到一个用户消息中,用
<tool_response>标签包裹。
-
-
如果设置了
add_generation_prompt,则在最后添加助手的起始标记,并根据是否启用思考(enable_thinking)来决定是否添加空的思考标签。
3.2.2 格式化后的文本字符串:
<|im_start|>user
介绍一下qwen3模型<|im_end|>
<|im_start|>assistant
3.3 encode
# 将格式化之后的文本转换为模型可接受的输入格式
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
- 将文本转换为模型可理解的输入:机器学习模型,尤其是神经网络,处理的是数值数据,而不是原始文本。因此,我们需要将文本转换成数字(即token的ID),模型才能处理。
- 标准化和预处理:编码过程通常包括文本的标准化(如小写化、去除重音等)、分词(将文本拆分成单词或子词单元)以及添加特殊标记(如[CLS]、[SEP]等),这些步骤有助于模型更好地理解文本。
- 处理未知词汇:通过分词技术(如BPE、WordPiece等),编码器可以将未知词汇分解成已知的子词单元,从而避免模型遇到从未见过的词汇。
- 保持上下文信息:编码过程中,我们可以添加特殊标记来指示句子的开始、结束,或者区分两个句子,这对于理解上下文至关重要。
- 批量处理:编码过程可以将不同长度的文本转换为相同长度的向量(通过填充或截断),以便进行批量处理。
首先调用:
transformers/tokenization_utils_base.py:class PreTrainedTokenizerBase类的 __call__函数
最后调用了
transformers/tokenization_utils_fast.py::PreTrainedTokenizerFast类的_batch_encode_plus函数
在这个函数里面
步骤1:调用encode_batch函数
_tokenizer 是Tokenizer 类,这实际上是Python绑定,底层是基于 Rust 的高性能实现
encodings = self._tokenizer.encode_batch(
batch_text_or_text_pairs,
add_special_tokens=add_special_tokens,
is_pretokenized=is_split_into_words,
)
Tokenizer的encoding实现涉及多个步骤:
- 文本预处理:根据模型的需求,可能包括小写化、Unicode规范化、去除重音符号等。
- 分词:将文本拆分成token(子词、单词或字符)。
- 转换为ID:将token映射到词汇表中的整数ID。
- 添加特殊标记:如[CLS]、[SEP]、[PAD]等。
- 填充和截断:确保所有序列长度一致。
# 这实际上是Python绑定,底层是基于 Rust 的高性能实现
from tokenizers import Tokenizer
步骤2:对encodings进行转换得到input_ids和attention_mask,放在sanitized_tokens中
tokens_and_encodings = [
self._convert_encoding(
encoding=encoding,
return_token_type_ids=return_token_type_ids,
return_attention_mask=return_attention_mask,
return_overflowing_tokens=return_overflowing_tokens,
return_special_tokens_mask=return_special_tokens_mask,
return_offsets_mapping=return_offsets_mapping,
return_length=return_length,
verbose=verbose,
)
for encoding in encodings
]
sanitized_tokens是个字典,包含了input_ids和attention_mask。最后返回BatchEncoding对象
return BatchEncoding(sanitized_tokens, sanitized_encodings, tensor_type=return_tensors)
执行完encode之后,下一步进行generate
从model_inputs中获取input_ids和attention_mask,作为generate输入。
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512 # 控制生成内容的最大长度
)
更多推荐



所有评论(0)