hugging face transformers 库使用手册（三）：精调预训练模型操作指南（API: Trainer）

使用 huggingface transformers 库进行模型精调

Cleo_Gao

1324人浏览 · 2025-02-24 16:27:37

Cleo_Gao · 2025-02-24 16:27:37 发布

文章目录

简单示例
🤗 Datasets
Model 怎么弄
TrainingArguments
🤗 Evaluate
🤗 Trainer

精调模型说到底，还是训练模型，不过是在预训练模型的基础上继续训练罢了。

有很多训练模型的框架可以选择：

用 🤗 Transformers Trainer ：这个库我个人看下来，感觉比较窄，就是没那么灵活。凑活能用，但是不如原生 pytorch 自由；但是吧，本来训练过程就有很多超参而且不太好组织，用原生 pytorch 写也还是比较乱；我现在还没有见过一种比较好的训练代码模式
TensorFlow + Keras
native PyTorch

简单示例

这里只教怎么用 🤗 Transformers 库，以下列出大致步骤，并给出最简单的示例（示例来源）

准备 dataset

# 1.下载一个现成的数据集
from datasets import load_dataset
dataset = load_dataset("yelp_review_full")
>>> dataset["train"][100]
{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But ...'}
 
# 2.定义 Preprocessor（对于文本数据来说我们需要 tokenzierb + pad + truncation）
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)
# 调用 datasets.map() 可以把 preprocessing function 应用到整个数据集
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# 3. 如果你喜欢，也可以随机 sample 出更小的数据集来训练
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

用 transformers.Trainer
The Trainer API supports a wide range of training options and features such as logging, gradient accumulation, and mixed precision.

# 1. 先拿到一个模型（用万能的 from_pretrained() ）
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

# 2. 设置训练参数(注意AutoConfig是模型的参数，这里要的是训练的参数，要用 TrainingArguments 对象)
from transformers import TrainingArguments
training_args = TrainingArguments(output_dir="test_trainer")

# 3. 设置 Evaluator function，毕竟你需要在训练过程中看看哪个 epoch 表现比较好
import numpy as np
import evaluate  # huggingface 同样提供好用的 evaluate 包
# 这个分类任务用 accuaracy 就好了
metric = evaluate.load("accuracy")

# 4. 获得 Trainer 实例并开始训练
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)
trainer.train()

通过以上示例，就明白了了，关键要素有：

dataset
model
args
metrics

把 metrics 和 args 分开是合理的。在 native pytorch training code 中与 metrics 贴的比较近的代码一般是 optimizer，这个应该放在 args 里面，因为这个是训练相关的，而 metrics 是训练无关的，是评价相关的。

要弄清 dataset 怎么弄，下面会介绍 🤗 datasets 库；
要弄清 model 怎么弄，这个就自己写类继承 PreTrainedModel 。因为这是一个保姆教程，所以这里也会介绍到底怎么写自己的 Model；
要弄清 args 怎么弄，要知道 TrainingArguments 类，这个也是 🤗 transformers 库里的，下面也会介绍；
要弄清 metrics怎么弄，要用 🤗 Evaluate 库，之后也会介绍

🤗 Datasets

`Dataset`

The base class Dataset implements a Dataset backed by an Apache Arrow table.

`Feature`

The Features format is simple: dict[column_name, column_type]. It is a dictionary of column name and column type pairs.

>>> from datasets import load_dataset
>>> dataset = load_dataset('glue', 'mrpc', split='train')
>>> dataset.features
{'idx': Value(dtype='int32', id=None),
 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
 'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
}

quick start

使用 load_dataset() 获取 Dataset 对象

from datasets import load_dataset, Image

dataset = load_dataset("beans", split="train")

设置数据增广方法，可以用任何你喜欢的库，这里演示使用 torchvision（实际上熟练 pytorch 的人一般也就是用这个方法）
对图像和对文本的处理方法各有不同

from torchvision.transforms import Compose, ColorJitter, ToTensor
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# 对图像
jitter = Compose(
    [ColorJitter(brightness=0.5, hue=0.5), ToTensor()]
)

# 对文本
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

写一个方法调用数据增广对象 Create a function to apply your transform to the dataset and generate the model input: pixel_values.

# 对图像
def transforms(examples):
    examples["pixel_values"] = [jitter(image.convert("RGB")) for image in examples["image"]]
    return examples

# 对文本
def encode(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length")

对图像数据增广，调用 train_dataset.with_transform() ，把你写的 function 传进去；对文本 tokenization，调用 train_dataset.map() 把你写的 function 传进去

# 对图像
dataset = dataset.with_transform(transforms)

# 对文本
dataset = dataset.map(encode, batched=True)
>>> dataset[0]
{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
'label': 1,
'idx': 0,
'input_ids': array([  101,  7277,  2180,  5303,  4806,  1117,  1711,   117,  2292, 1119,  ...]),
'token_type_ids': array([0, 0, 0, 0, 0, ..., 0, 1, 1, ...]),
'attention_mask': array([1, 1, 1, 1, 1, 1, 1, ...])}

Set the dataset format according to the machine learning framework you’re using.（对于 pytorch 来说就是把 dataset 传到 dataloader 里面）

from torch.utils.data import DataLoader

def collate_fn(examples):
    images = []
    labels = []
    for example in examples:
        images.append((example["pixel_values"]))
        labels.append(example["labels"])
        
    pixel_values = torch.stack(images)
    labels = torch.tensor(labels)
    return {"pixel_values": pixel_values, "labels": labels}
dataloader = DataLoader(dataset, collate_fn=collate_fn, batch_size=4)

dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
dataloader = DataLoader(dataset, batch_size=32)

看看提供的 dataset 的信息再决定要不要下载

用 load_dataset_builder() 看信息，用 load_dataset() 加载（下载）

from datasets import load_dataset_builder

>>> ds_builder = load_dataset_builder("rotten_tomatoes")
>>> # Inspect dataset description
>>> ds_builder.info.description
Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.

>>> # Inspect dataset features
>>> ds_builder.info.features
{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None),
 'text': Value(dtype='string', id=None)}

看了上述信息确定是你想要的数据集，用 load_dataset() 下载（加载）

from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes", split="train")

数据集可能被划成训练集、测试集、验证集，get_dataset_split_names 看看有哪些集：

from datasets import get_dataset_split_names

>>> get_dataset_split_names("rotten_tomatoes")
['train', 'validation', 'test']

Create an image dataset

两种方案：

Create an image dataset with ImageFolder and some metadata：无代码无痛做数据集
Create an image dataset by writing a loading script. 更麻烦但是更自由，可以自己决定 how a dataset is defined, downloaded

ImageFolder

按照如下结构组织你的图片数据：

folder/train/dog/golden_retriever.png
folder/train/cat/maine_coon.png
folder/test/dog/german_shepherd.png
folder/test/cat/bengal.png

然后调用 load_dataset() ，类型选 imagefolder，传入 folder 的地址，得到 dataset

from datasets import load_dataset

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder")

从分类任务拓展到 caption / object detection

如果还有别的信息要加入数据集，比如 text captions or bounding boxes，那么把这些信息写到 metadata.csv 文件里面，这样就不止可以做分类任务，还可以做 text captioning or object detection，变成这样：

folder/train/metadata.csv
folder/train/0001.png
folder/train/0002.png
folder/train/0003.png

甚至图片文件是 zip 也没关系

folder/metadata.csv
folder/train.zip
folder/test.zip
folder/valid.zip

那个 csv 文件里面可以这样写：

file_name,additional_feature
0001.png,This is a first value of a text feature you added to your images
0002.png,This is a second value of a text feature you added to your images
0003.png,This is a third value of a text feature you added to your images

不写 csv 写 jsonl 也可以：

{"file_name": "0001.png", "additional_feature": "This is a first value of a text feature you added to your images"}
{"file_name": "0002.png", "additional_feature": "This is a second value of a text feature you added to your images"}
{"file_name": "0003.png", "additional_feature": "This is a third value of a text feature you added to your images"}

注意注意注意
这里关键词，是之后用关键字取到这个值的关键字。比如说，如果是 caption 任务，那么用 text 替换 additional_feature ：

file_name,text
0001.png,This is a golden retriever playing with a ball
0002.png,A german shepherd
0003.png,One chihuahua

from datasets import load_dataset

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder")
>>> dataset[0]["text"]  # 如果写了 caption 可以这样看到
"This is a golden retriever playing with a ball"

如果是 object detect 任务，用 object 替换 additional_feature :
（注意，你这里要加上 train 或者 test 之类的子文件夹，不然会报错…很痛苦）

{"file_name": "train/0001.png", "objects": {"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}}
{"file_name": "train/0002.png", "objects": {"bbox": [[810.0, 100.0, 57.0, 28.0]], "categories": [1]}}
{"file_name": "train/0003.png", "objects": {"bbox": [[160.0, 31.0, 248.0, 616.0], [741.0, 68.0, 202.0, 401.0]], "categories": [2, 2]}}

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
>>> dataset[0]["objects"]
{"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}

写一个 dataset 脚本

这样组织你的数据集文件：

my_dataset/
├── README.md
├── my_dataset.py
└── data/  # optional, may contain your images or TAR archives

这样使用你的数据集：

from datasets import load_dataset
dataset = load_dataset("path/to/my_dataset")

模板示例：new dataset template

Create a dataset builder class.

继承 datasets.GeneratorBasedBuilder 实现 3 个方法：

_info(self) stores information about your dataset like its description, license, and features.
_split_generators(self, dl_manager) downloads the dataset and defines its splits.
_generate_examples(self, images, metadata_path) generates the images and labels for each split.

class Food101(datasets.GeneratorBasedBuilder):
    """Food-101 Images dataset"""

	# 下文会说具体的实现
    def _info(self):

    def _split_generators(self, dl_manager):

    def _generate_examples(self, images, metadata_path):

Create dataset configurations.

你可能需要多个数据集设置（比如你的数据集是由几个子集组成）
继承 datasets.BuilderConfig 实现 2 个属性：

data_url
metadata_urls

class Food101Config(datasets.BuilderConfig):
    """Builder Config for Food-101"""
 
    def __init__(self, data_url, metadata_urls, **kwargs):
        """BuilderConfig for Food-101.
        Args:
          data_url: `string`, url to download the zip file from.
          metadata_urls: dictionary with keys 'train' and 'validation' containing the archive metadata URLs
          **kwargs: keyword arguments forwarded to super.
        """
        super(Food101Config, self).__init__(version=datasets.Version("1.0.0"), **kwargs)
        self.data_url = data_url
        self.metadata_urls = metadata_urls

假设你的数据集（食物101）由两个子集（早餐和晚餐）组成：

class Food101(datasets.GeneratorBasedBuilder):
    """Food-101 Images dataset"""
 
    BUILDER_CONFIGS = [
        Food101Config(
            name="breakfast",
            description="Food types commonly eaten during breakfast.",
            data_url="https://link-to-breakfast-foods.zip",
            metadata_urls={
                "train": "https://link-to-breakfast-foods-train.txt", 
                "validation": "https://link-to-breakfast-foods-validation.txt"
            },
        ,
        Food101Config(
            name="dinner",
            description="Food types commonly eaten during dinner.",
            data_url="https://link-to-dinner-foods.zip",
            metadata_urls={
                "train": "https://link-to-dinner-foods-train.txt", 
                "validation": "https://link-to-dinner-foods-validation.txt"
            },
        )...
    ]

然后就可以这样用：

from datasets import load_dataset
ds = load_dataset("food101", "breakfast", split="train")

Add dataset metadata.

这个信息是用户在调用 dataset.info() 方法时返回的信息（实际上返回的是一个 DatasetInfo 对象）：

from datasets import load_dataset_builder
ds_builder = load_dataset_builder("food101")
ds_builder.info

要提供的信息包括：

description provides a concise description of the dataset.
features specify the dataset column types. Since you’re creating an image loading script, you’ll need to include the Image feature.
supervised_keys specify the input feature and label.
homepage provides a link to the dataset homepage.
citation is a BibTeX citation of the dataset.
license states the dataset’s license.

示例：

def _info(self):
    return datasets.DatasetInfo(
        description=_DESCRIPTION,
        features=datasets.Features(
            {
                "image": datasets.Image(),
                "label": datasets.ClassLabel(names=_NAMES),
            }
        ),
        supervised_keys=("image", "label"),
        homepage=_HOMEPAGE,
        citation=_CITATION,
        license=_LICENSE,
        task_templates=[ImageClassification(image_column="image", label_column="label")],
    )

Download and define the dataset splits.

def _split_generators(self, dl_manager):
    archive_path = dl_manager.download(_BASE_URL)
    split_metadata_paths = dl_manager.download(_METADATA_URLS)
    return [
        datasets.SplitGenerator(
            name=datasets.Split.TRAIN,
            gen_kwargs={
                "images": dl_manager.iter_archive(archive_path),
                "metadata_path": split_metadata_paths["train"],
            },
        ),
        datasets.SplitGenerator(
            name=datasets.Split.VALIDATION,
            gen_kwargs={
                "images": dl_manager.iter_archive(archive_path),
                "metadata_path": split_metadata_paths["test"],
            },
        ),
    ]

Generate the dataset.

generate_examples accepts the images and metadata_path from the previous method as arguments.

def _generate_examples(self, images, metadata_path):
    """Generate images and labels for splits."""
    with open(metadata_path, encoding="utf-8") as f:
        files_to_keep = set(f.read().split("\n"))
    for file_path, file_obj in images:
        if file_path.startswith(_IMAGES_DIR):
            if file_path[len(_IMAGES_DIR) : -len(".jpg")] in files_to_keep:
                label = file_path.split("/")[2]
                yield file_path, {
                    "image": {"path": file_path, "bytes": file_obj.read()},
                    "label": label,
                }

Create a dataset loading script

暂时用不到，贴个链接，有需要的时候再学 Create a dataset loading script

加入 preprocess

depending on your dataset modality, you’ll need to:

Tokenize a text dataset.
Resample an audio dataset.
Apply transforms to an image dataset.

data augmentations

用 AutoFeatureExtractor.from_pretrained 获得 feature extractor

应用方法

from transformers import AutoFeatureExtractor
from datasets import load_dataset, Image
from torchvision.transforms import RandomRotation

# 1. 获得数据集对象、预处理handler对象
feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
dataset = load_dataset("beans", split="train")

# 2. 写一个方法，把数据传入 torchvision 中的数据增广对象，并返回处理后的数据
rotate = RandomRotation(degrees=(0, 90))
def transforms(examples):
	# 默认图片输入模型的关键字都是 pixel_values，所以也要把名字换成这个
	examples["pixel_values"] = [rotate(image.convert("RGB")) for image in examples["image"]]
	return examples

# 调用 train_dataset.set_transform 方法，把你写的 function 传进去
dataset.set_transform(transforms)
dataset[0]["pixel_values"]

tokenizer

用 AutoTokenizer.from_pretrained 获得 tokenizer，把数据传入 tokenizer 会返回一个字典，字典 3 个 key：

input_ids: the numbers representing the tokens in the text.
token_type_ids: indicates which sequence a token belongs to if there is more than one sequence.
attention_mask: indicates whether a token should be masked or not.

tokenizer(dataset[0]["text"])
{'input_ids': [101, 1103, 2067, 1110, 17348, ...], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, ...], 
 'attention_mask': [1, 1, 1, 1, 1, 1, ...]}

应用方法

from transformers import AutoTokenizer
from datasets import load_dataset

# 1. 获取 tokenzier 对象和 dataset 对象
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
dataset = load_dataset("rotten_tomatoes", split="train")

# 2. 写一个方法，把数据传入 tokenzier，并返回处理后的数据
def tokenization(example):
    return tokenizer(example["text"])

# 3. 调用 train_dataset.map 方法，把你写的 function 传进去
dataset = dataset.map(tokenization, batched=True)

# 4. 调用 train_dataset.set_format 把数据集的名字换一下
dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
dataset.format['type']

Model 怎么弄

写自己的 sublayer （encoder / decoder）等等，继承 nn.Module 即可
这里的重点是，所有的 sublayer 的 init 参数都是 config

class MyEncoder(nn.Module):
	def __init__(self, config):
		super().__init__()
		self.l = nn.Linear(3, config.hidden_size)
	
	def forward(self, x):
		return self.l(x)

class MyDecoder(nn.Module):
	...

用一个 Model 把 sublayers 串起来，这个 Model 继承 PreTrainedModel
这里重点是，(a) 设置好 config_class (b) 注意 init 参数是对应的 Config 类的实例

class MyModel(PreTrainedModel):
	config_class = MyConfig  # 这个 MyConfig 继承 PretrainedConfig
	def __init__(self, config: MyConfig):
		# 注意 init 方法的参数是对应的 Config 类
		super().__init__()
		self.encoder = MyEncoder(config)
		self.decoder = MyDecoder(config)
	
	def forward(self, x):
		return self.decoder(self.encoder(x))

把自己写的类注册到 AutoModel 工厂名单中（如果不打算用 AutoModel.from_pretrained('my-model') 获得模型，而是打算用 MyModel(myConfig) 获得模型，那这一步不做也无所谓）

AutoConfig.register("my-model", MyConfig)
AutoModel.register(MyConfig, MyModel)

补写 MyConfig，这个继承 PretrainedConfig ，是与模型参数有关的所有超参都在这里。这里最重要的是设置 cls.model_type = 'my-model'

class MyConfig(PretrainedConfig):
	# 赋值共有参数
	model_type = "my-model"
	
	def __init__(self, hidden_size=768, **kwargs):
		super().__init__(**kwargs)
		self.hidden_size = hidden_size

获得 model 实例

model = AutoModel.from_pretrained('my-model')
# 或者
myConfig = MyConfig(hidden_size=256)
model = MyModel(myConfig)

TrainingArguments

有 97 个可以初始化的参数，相当可怕。

实际实例化的时候一般只传第一个参数进去：

output_dir (str) — The output directory where the model predictions and checkpoints will be written.

直接初始化的时候把参数都传进去是行不通的，可以通过以下方法分别设置参数，也更清楚一些：

`set_dataloader`

所有和 dataloader 有关的参数都在这了

train_batch_size: int = 8
eval_batch_size: int = 8
drop_last: bool = False
num_workers: int = 0
pin_memory: bool = True
auto_find_batch_size: bool = False
- Whether to find a batch size that will fit into memory automatically through exponential decay, avoiding CUDA Out-of-Memory errors. Requires accelerate to be installed (pip install accelerate)
ignore_data_skip: bool = False
- When resuming training, whether or not to skip the epochs and batches to get the data loading at the same stage as in the previous training.
sampler_seed: typing.Optional[int] = None
- Random seed to be used with data samplers. If not set, random generators for data sampling will use the same seed as self.seed.

from transformers import TrainingArguments

args = TrainingArguments("working_dir")
args = args.set_dataloader(train_batch_size=16, eval_batch_size=64)

>>> args.per_device_train_batch_size
16

`set_evaluate`

所有和训练过程中的中间测试有关的参数都通过本方法设置

strategy: typing.Union[str, transformers.trainer_utils.IntervalStrategy] = 'no' 这个参数如果不设为 no，那么属性 do_eval 自动设为 True
- "no": No evaluation is done during training.
- "steps": Evaluation is done (and logged) every steps.
- "epoch": Evaluation is done at the end of each epoch.
steps: int = 500
- Number of update steps between two evaluations if strategy=“steps”.
batch_size: int = 8
- The batch size per device (GPU/TPU core/CPU…) used for evaluation.
accumulation_steps: typing.Optional[int] = None
delay: typing.Optional[float] = None
loss_only: bool = False
- Ignores all outputs except the loss.
jit_mode: bool = False

from transformers import TrainingArguments

args = TrainingArguments("working_dir")
args = args.set_evaluate(strategy="steps", steps=100)
>>> args.eval_steps
100

`set_logging`

所有和训练过程中记录相关的参数都通过本方法设置

strategy: typing.Union[str, transformers.trainer_utils.IntervalStrategy] = 'steps'
- “no”: No save is done during training.
- “epoch”: Save is done at the end of each epoch.
- “steps”: Save is done every save_steps.
steps: int = 500
- Number of update steps between two logs if strategy=“steps”.
report_to: typing.Union[str, typing.List[str]] = 'none'
- The list of integrations to report the results and logs to. Supported platforms are “azure_ml”, “comet_ml”, “mlflow”, “neptune”, “tensorboard”,“clearml” and “wandb”. Use “all” to report to all integrations installed, “none” for no integrations.
level: str = 'passive'
- Logger log level to use on the main process. Possible choices are the log levels as strings: “passive”, debug", “info”, “warning”, “error” and “critical”
first_step: bool = False
nan_inf_filter: bool = False
on_each_node: bool = False
replica_level: str = 'passive'

from transformers import TrainingArguments

args = TrainingArguments("working_dir")
args = args.set_logging(strategy="steps", steps=100)
>>> args.logging_steps
100

`set_lr_scheduler`

name: typing.Union[str, transformers.trainer_utils.SchedulerType] = 'linear'
- “linear”
- “cosine”
- “cosine_with_restarts”
- “polynomial”
- “constant”
- “constant_with_warmup”
- “inverse_sqrt”
num_epochs: float = 3.0
max_steps: int = -1
warmup_ratio: float = 0
warmup_steps: int = 0

from transformers import TrainingArguments

args = TrainingArguments("working_dir")
args = args.set_lr_scheduler(name="cosine", warmup_ratio=0.05)
>>> args.warmup_ratio
0.05

`set_optimizer`

所有和训练用的优化器有关的超参都通过本方法设置

name: typing.Union[str, transformers.training_args.OptimizerNames] = 'adamw_hf'
- “adamw_hf”
- “adamw_torch”
- “adamw_torch_fused”
- “adamw_apex_fused”
- “adamw_anyprecision”
- “adafactor”
learning_rate: float = 5e-05
- The initial learning rate.
weight_decay: float = 0
- The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights.
beta1: float = 0.9
- The beta1 hyperparameter for the adam optimizer or its variants.
beta2: float = 0.999
epsilon: float = 1e-08
args: typing.Optional[str] = None
- Optional arguments that are supplied to AnyPrecisionAdamW (only useful when optim=“adamw_anyprecision”)

from transformers import TrainingArguments

args = TrainingArguments("working_dir")
args = args.set_optimizer(name="adamw_torch", beta1=0.8)
>>> args.optim
'adamw_torch'

`set_save`

strategy: typing.Union[str, transformers.trainer_utils.IntervalStrategy] = 'steps'
- “no”: No save is done during training.
- “epoch”: Save is done at the end of each epoch.
- “steps”: Save is done every save_steps.
steps: int = 500
total_limit: typing.Optional[int] = None
on_each_node: bool = False

`set_training`

设置一个 train 需要的参数，不如前面那些方法那么具体，只能浅浅设置一下

learning_rate: float = 5e-05
batch_size: int = 8
weight_decay: float = 0
num_epochs: float = 3
max_steps: int = -1
- If set to a positive number, the total number of training steps to perform. Overrides num_train_epochs. In case of using a finite iterable dataset the training may stop before reaching the set number of steps when all data is exhausted.
gradient_accumulation_steps: int = 1
- Number of updates steps to accumulate the gradients for, before performing a backward/update pass.
seed: int = 42
gradient_checkpointing: bool = False

from transformers import TrainingArguments

args = TrainingArguments("working_dir")
args = args.set_training(learning_rate=1e-4, batch_size=32)
>>> args.learning_rate
1e-4

🤗 Evaluate

Any metric, comparison, or measurement is loaded with the evaluate.load function

import evaluate
accuracy = evaluate.load("accuracy")
# explicitly pass the type
word_length = evaluate.load("word_length", module_type="measurement")

查看提供了哪些 modules

With list_evaluation_modules() you can check what modules are available on the hub.

>>> evaluate.list_evaluation_modules(
...	module_type="comparison",
...	include_community=False,
...	with_details=True)

[{'name': 'mcnemar', 'type': 'comparison', 'community': False, 'likes': 1},
 {'name': 'exact_match', 'type': 'comparison', 'community': False, 'likes': 0}]

如何利用 metric 计算

最简单的方法是直接调用 metric.compute()

>>> accuracy.compute(references=[0,1,0,1], predictions=[1,0,0,1])
{'accuracy': 0.5}

另外一种情况是，算出一个结果计算一次精度，然后再更新整体精度；这时调用 metric.add() ，然后再调用 metric.compute()

>>> for ref, pred in zip([0,1,0,1], [1,0,0,1]):
>>> 	accuracy.add(references=ref, predictions=pred)

>>> accuracy.compute()
{'accuracy': 0.5}

还有一种情况是，算出一个 batch 的结果计算一次精度，这时先调用 metric.add_batch() 再调用 metric.compute()

>>> for refs, preds in zip([[0,1],[0,1]], [[1,0],[0,1]]):
>>> 	accuracy.add_batch(references=refs, predictions=preds)

>>> accuracy.compute()
{'accuracy': 0.5}

套到你的代码上：

for model_inputs, gold_standards in evaluation_dataset:
    predictions = model(model_inputs)
    metric.add_batch(references=gold_standards, predictions=predictions)
metric.compute()

如何一次性计算好几个 metric

用 evaluate.combine() ，这个返回的东西当做普通的 metric 使用即可

clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

>>> clf_metrics.compute(predictions=[0, 1, 0], references=[0, 1, 1])
{
  'accuracy': 0.667,
  'f1': 0.667,
  'precision': 1.0,
  'recall': 0.5
}

如何保存 evaluate 结果

调用 evaluate.save() （再也不用手动保存啦，妈妈再也不用担心我搞不清哪个结果对应谁啦）

result = accuracy.compute(references=[0,1,0,1], predictions=[1,0,0,1])

hyperparams = {"model": "bert-base-uncased"}
evaluate.save("./results/"experiment="run 42", **result, **hyperparams)

保存效果如下：

{
    "experiment": "run 42",
    "accuracy": 0.5,
    "model": "bert-base-uncased",
    "_timestamp": "2022-05-30T22:09:11.959469",
    "_git_commit_hash": "123456789abcdefghijkl",
    "_evaluate_version": "0.1.0",
    "_python_version": "3.9.12 (main, Mar 26 2022, 15:51:15) \n[Clang 13.1.6 (clang-1316.0.21.2)]",
    "_interpreter_path": "/Users/leandro/git/evaluate/env/bin/python"
}

如何 expand 自己的 metric

继承 evaluate.Metric 或者 evaluate.EvaluationModule
另起一个 py 文件，实现：

_compute(self, references, predictions)
- 这个必须重写，这个方法是调用 metric.compute() 的时候实际计算的方法
_download_and_prepare(self, dl_manager)
- 有些 metric 计算的时候可能要用到别的东西，需要下载的东西，在这里面写
_info(self)
- 这个也必须重写。这个方法返回的是一个 EvaluationModuleInfo 对象，需要定义如下属性：
  - EvaluationModuleInfo.description provides a brief description about your evalution module.
  - EvaluationModuleInfo.citation contains a BibTex citation for the evalution module.
  - EvaluationModuleInfo.inputs_description describes the expected inputs and outputs. It may also provide an example usage of the evalution module.
  - EvaluationModuleInfo.features defines the name and type of the predictions and references. This has to be either a single datasets.Features object or a list of datasets.Features objects if multiple input types are allowed. （Feature 见上文介绍的 Dataset 的 Feature 部分）

使用方法：
把新写的这个 py 文件的路径传到 evaluate.load()

# load() 的第一个参数的说明
"""
path (``str``):
	path to the evaluation processing script with the evaluation builder. Can be either:
		- a local path to processing script or the directory containing the script (if the script has the same name as the directory), e.g. ``'./metrics/rouge'`` or ``'./metrics/rouge/rouge.py'``
		- a evaluation module identifier on the HuggingFace evaluate repo e.g. ``'rouge'`` or ``'bleu'`` that are in either ``'metrics/'``, ``'comparisons/'``, or ``'measurements/'`` depending on the provided ``module_type``.
"""

🤗 Trainer

要实例化一个 Trainer 对象，然后调用这个对象的 train() 方法开始训练。

实例化对象需要传入的参数：

model (PreTrainedModel or torch.nn.Module, optional)
- The model to train, evaluate or use for predictions.
args (TrainingArguments, optional)
- The arguments to tweak for training. Will default to a basic instance of TrainingArguments with the output_dir set to a directory named tmp_trainer in the current directory if not provided.
data_collator (DataCollator, optional)
- The function to use to form a batch from a list of elements of train_dataset or eval_dataset.
train_dataset (torch.utils.data.Dataset or torch.utils.data.IterableDataset, optional)
- The dataset to use for training. If it is a Dataset, columns not accepted by the model.forward() method are automatically removed.
eval_dataset (Union[torch.utils.data.Dataset, Dict[str, torch.utils.data.Dataset])], optional)
- The dataset to use for evaluation. If it is a Dataset, columns not accepted by the model.forward() method are automatically removed.
tokenizer (PreTrainedTokenizerBase, optional)
- The tokenizer used to preprocess the data.
model_init (Callable[[], PreTrainedModel], optional)
- A function that instantiates the model to be used.
compute_metrics (Callable[[EvalPrediction], Dict], optional)
- The function that will be used to compute metrics at evaluation. Must take a EvalPrediction and return a dictionary string to metric values.
callbacks (List of TrainerCallback, optional)
- A list of callbacks to customize the training loop.
optimizers (Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR], optional)
- A tuple containing the optimizer and the scheduler to use. Will default to an instance of AdamW on your model and a scheduler given by get_linear_schedule_with_warmup() controlled by args.
preprocess_logits_for_metrics (Callable[[torch.Tensor, torch.Tensor], torch.Tensor], optional)
- A function that preprocess the logits right before caching them at each evaluation step. Must take two tensors, the logits and the labels, and return the logits once processed as desired. The modifications made by this function will be reflected in the predictions received by compute_metrics.

开始训练调用方法 trainer.train()

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

【2026_MCM美赛】问题F：是否要发展全人类人工智能（或者如何发展全⼈类人工智能）？这是⼀个问题！（思路、代码、论文持续更新中）

Manygraduates?短短⼏年间，⽣成式⼈⼯智能（Gen-AI）已从最初功能有限、仅供少数早期⽤⼾使⽤的⼯具，发展成为融⼊我们⽇常⽣活、强⼤且不可或缺的资源。研究表明，随着时间的推移，⽣成式⼈⼯智能可能会对未来的⼯作产⽣深远影响。例如，在某些领域，⽣成式⼈⼯智能可能会取代⼈类（或⼤幅减轻⼈类的⼯作量），⽽在其他领域，它可能不会受到太⼤影响，甚⾄可能促进其发展。在这个问题中，你将探讨各类⾼等教

2048 AI社区

氛围编程（Vibe Coding）全解析：AI驱动的编程范式革命与工程实践指南

结合Karpathy的原始定义与行业实践，氛围编程可被精准描述为：依托大语言模型与AI原生开发工具，通过自然语言（或语音）交互传递开发意图，由AI自动完成代码生成、优化与调试，开发者以需求引导者、结果验证者的身份，通过多轮迭代实现功能落地的新型编程范式。核心载体是AI协同工具链：并非单纯依赖通用LLM，而是需要深度集成AI的IDE（如Cursor）、智能代理（如Replit Agent）等工具，实

2048 AI社区

A股股票分析软件（开源/GitHub）Star数量Top项目

GitHub上支持A股分析的开源项目主要分为三类：量化交易框架、AI分析工具和数据获取工具。最受欢迎的项目包括OpenBB（50k+ Star，多市场金融数据平台）、vn.py（23k+ Star，国产量化交易框架）、Qlib（15k+ Star，微软AI量化平台）、Superalgos（12k+ Star，可视化策略工具）和Backtrader（10k+ Star，轻量回测引擎）。这些项目覆盖了

2048 AI社区

所有评论(0)

查看更多评论

Cleo_Gao

@Cleo_Gao

已为社区贡献3条内容