2026年1月14日学习计划

本文主要讨论BPE算法优化和GPT-2分词实现。作者计划本周完成BPE算法优化（包括更新算法和多线程训练），收尾happy_llm项目，并开始minimind学习。重点分析了GPT-2的BPE分词规则，发现其会在单词前保留空格，这可能是为了保持上下文边界。在优化方面，指出合并best_pair的过程可以并行处理，但确定全局最高频best_pair仍需单线程完成。代码示例展示了如何使用正则表达式实现

victory0431

382人浏览 · 2026-01-14 21:37:41

victory0431 · 2026-01-14 21:37:41 发布

文章目录

本周计划
GPT-2 BPE匹配规则

本周计划

学透BPE算法，动手构建，优化2部分 1 更新算法 2 多线程训练

待办事项	进展	完成情况
为什么不去空格	1 标记词汇边界 2 生成无需空格	✅
优化更新函数	1 增量更新更新对中符号 2 更新前后对计数	✅
优化多线程计算	计数部分全局进行串行更新部分可以并行	✅

依赖豆包代码实现，好处是代码撰写速度快，坏处是存在细节错误影响整体理解。应该找到可靠实现版本如https://github.com/karpathy/minbpe 直接对照学习

GPT-2 官方实现 https://github.com/openai/gpt-2/blob/master/src/encoder.py

收尾happy_llm，小批量训练tokenizer ❌️
小参数预训练happy_llm 目标：能够对话看到loss收敛
在happly_llm, minimind两个预训练数据集上训练✅
开始minimind的学习，快过预训练，尽快开始SFT和强化学习。
预训练✅
看参数计算文章，深入理解架构 ✅ https://blog.csdn.net/victory0431/article/details/156993193
SFT ❌
RLHF ❌
周三：今天必须开始看李宏毅强化学习课！✅
阅读技术报告 ❌
Llama 2 / Llama 3 Technical Reports
https://ai.meta.com/blog/meta-llama-3-1/
OpenAI 2020 年的《Scaling Laws for Neural Language Models》
chinchilla Training Compute-Optimal Large Language Models
部署5个模型服务 ❌
看happy_llm SFT部分补全SFT理论同Gemini讨论 ❌
完成full和lora训练 ❌
周五看 1小时强化学习课程https://www.youtube.com/playlist?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49 ❌
https://www.youtube.com/watch?v=OAKAZhFmYoI https://www.youtube.com/watch?v=OAKAZhFmYoI

GPT-2 BPE匹配规则

import regex as re
GPT2_PATTERN = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
compiled_pattern = re.compile(GPT2_PATTERN, re.UNICODE)

pretokenized = []
    for segment in doc_segments:
        if not segment.strip():
            continue
        pre_tokens = compiled_pattern.findall(segment)
        
        print(f"pretokens: {pre_tokens}")
        # break
        for pt in pre_tokens:
            if not pt.strip():
                continue
            # byte_seq = tuple(pt) #.encode("utf-8"))
            # pt = pt.strip()
            byte_seq = tuple(char.encode("utf-8") for char in pt)
            # 此步直接将单词拆分成数字序列
            pretokenized.append(byte_seq)
    return pretokenized

当我打印出来pre_token才发现，每个单词前面都有一个空格 ater’, ’ to’, ’ make’, ’ it’, ’ nice’, ’ and’, ’ bubbly’, ‘.’, ’ He’, ’ relaxed’, ’ again’, ’ and’, ’ felt’, ’ all’, ’ the’, ’ worries’, ’ wash’, ’ away’, ‘.’, ‘\n’, ‘The’, ’ king’, ’ was’, ’ so’, ’ happy’, ’ that’, ’ he’, ’ had’, ’ been’, ’ able’, ’ to’, ’ clean’, ’ up’, ’ the’, ’ mess’, ’ he’, ’ had’, ’ made’, ’ and’, ’ enjoy’, ’ a’, ’ nice’, ’ soak’, ‘.’, ’ He’, ’ dried’, ’ off’, ’ and’, ’ wrapped’, ’ himself’, ’ up’, ’ in’, ’ a’, ’ big’, ’ towel’, ‘.’, ’ Then’, ‘,’, ’ the’, ’ king’, ’ wen这是因为我使用了GPT-2的pattern进行切分的对吗？GPT-2如此做的更深层次原因是什么呢

二、核心优化方向 2：多线程 / 多进程并行（针对独立序列）
你的判断是对的：每个预分词后的序列（如单个单词 / 短语）的合并逻辑完全独立，因此可以并行处理。但注意：
❌ 合并规则（选最高频best_pair）是全局的，必须单线程确定；
✅ 合并best_pair到各序列的过程（_merge_byte_pair）、统计各序列内的字节对频次，均可并行。