超参数调优

确定标注任务的目标、标注对象和标注标准。

天涯赤子心71

1363人浏览 · 2024-10-30 14:48:00

天涯赤子心71 · 2024-10-30 14:48:00 发布

超参数调优是指在机器学习模型训练过程中，通过调整模型的超参数以获得最佳性能的过程。
由用户手动设定的。超参数调优的目标是找到一组最佳的超参数组合，使模型在特定任务上表现最佳
啥叫模型的泛化，就是运行模型后，调整超参数达到模型更好的性能，更好的效果。
超参数可以用自己理解，决可以决定这个模型好于不好的一个参数，是一个超级参数，类似于就是打仗，一夫当关，万夫莫开这种情况。
*:模型性能很大程度上取决于超参数
常见的超参数
- 学习率（Learning Rate）：控制模型参数更新的步长。
- 批量大小（Batch Size）：每次迭代中使用的训练样本数量。
- 迭代次数（Epochs）：整个训练数据集被使用的次数。
- 正则化参数（Regularization Parameters）：控制模型复杂度，防止过拟合。
- 神经网络层数和每层的神经元数量：影响模型的容量和复杂度
常见的超参数调优方法：
 
网格搜索（Grid Search）
- 对用户指定的超参数集执行详尽的搜索，尝试每一种可能的组合。
- 优点：简单直接，适用于小规模的超参数空间。
- 缺点：计算成本高，尤其是当超参数空间较大时。
网格搜索是一种穷举搜索方法，通过遍历所有可能的超参数组合来寻找最优解。这种方法简单直观，但计算成本高，尤其是在超参数空间较大时
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)

2. 随机搜索（Random Search）
- 从预定义的范围内随机选择超参数集进行迭代。
- 优点：在相同的迭代次数下可以探索更大的超参数空间。
- 缺点：可能会错过一些最佳的超参数组合。
3. 贝叶斯优化（Bayesian Optimization）
- 使用贝叶斯推断和高斯过程来预测不同超参数组合的模型性能，并选择预期表现最佳的组合。
- 优点：在处理高维超参数空间时比网格搜索和随机搜索更加高效和有效。
- 缺点：实现复杂度较高4821。
随机搜索（Random Search）
随机搜索通过在预定义的范围内随机选择超参数组合进行评估。相比网格搜索，随机搜索在相同的迭代次数下可以探索更大的超参数空间8。
import numpy as np  
from sklearn.model_selection import train_test_split  
from sklearn.linear_model import LogisticRegression  
from sklearn.metrics import accuracy_score  
  
# 假设我们有一些数据  
# X是特征矩阵，y是目标变量  
# 这里用随机数据作为示例  
np.random.seed(42)  
X = np.random.rand(100, 10)  # 100个样本，10个特征  
y = np.random.randint(0, 2, 100)  # 二分类问题  
  
# 分割数据集为训练集和测试集  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  
  
# 定义超参数的取值范围  
C_values = np.logspace(-4, 4, 20)  # C参数取20个对数均匀分布的值  
solver_values = ['liblinear', 'lbfgs', 'newton-cg', 'sag', 'saga']  # 可用的求解器  
  
# 随机搜索算法  
best_score = 0  
best_params = {'C': None, 'solver': None}  
num_iterations = 100  # 随机搜索的迭代次数  
  
for i in range(num_iterations):  
    # 随机选择超参数值  
    C = np.random.choice(C_values)  
    solver = np.random.choice(solver_values)  
      
    # 创建模型并训练  
    model = LogisticRegression(C=C, solver=solver, random_state=42, max_iter=10000)  
    model.fit(X_train, y_train)  
      
    # 在测试集上评估模型性能  
    y_pred = model.predict(X_test)  
    score = accuracy_score(y_test, y_pred)  
      
    # 更新最佳得分和最佳参数  
    if score > best_score:  
        best_score = score  
        best_params['C'] = C  
        best_params['solver'] = solver  
        print(f"Iteration {i+1}: Best score updated to {best_score:.4f} with params {best_params}")  
  
# 输出最终的最佳得分和最佳参数  
print(f"Final best score: {best_score:.4f} with params {best_params}")
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
param_dist = {'n_estimators': [100, 200, 300],'max_depth': [10, 20, 30],'min_samples_split': [2, 5, 10]
}
random_search = RandomizedSearchCV(estimator=RandomForestClassifier(), param_distributions=param_dist, n_iter=10, cv=5)
random_search.fit(X_train, y_train)
print(random_search.best_params_)
1. 贝叶斯优化（Bayesian Optimization）
贝叶斯优化使用贝叶斯推断和高斯过程来预测不同超参数组合的模型性能，并选择预期表现最佳的组合。这种方法在处理高维超参数空间时比网格搜索和随机搜索更加高效和有效48。
from skopt import BayesSearchCV
from sklearn.ensemble import RandomForestClassifier
param_space = {'n_estimators': (100, 300),'max_depth': (10, 30),'min_samples_split': (2, 10)
}
bayes_search = BayesSearchCV(estimator=RandomForestClassifier(), search_spaces=param_space, n_iter=10, cv=5)
bayes_search.fit(X_train, y_train)
print(bayes_search.best_params_)
4. 遗传算法（Genetic Algorithm）
- 通过模拟自然选择过程来优化超参数。
- 优点：适用于复杂的优化问题，能够跳出局部最优解。
- 缺点：计算成本较高，参数设置复杂42125。
import numpy as np  
  
# 参数设置  
POP_SIZE = 100       # 种群大小  
GENES_COUNT = 10     # 基因个数（即我们要优化的数的个数）  
TARGET = 30          # 目标和  
GENERATIONS = 100    # 迭代次数  
MUTATION_RATE = 0.01 # 变异率  
  
# 初始化种群  
def initialize_population(pop_size, genes_count):  
    return np.random.randint(0, 10, (pop_size, genes_count))  
  
# 评估适应度  
def fitness(individual, target):  
    return abs(target - sum(individual))  
  
# 选择  
def selection(population, fitness_scores):  
    selected = np.random.choice(range(len(population)), size=len(population), replace=True, p=1 - fitness_scores / sum(fitness_scores))  
    return population[selected]  
  
# 交叉  
def crossover(parent1, parent2):  
    point = np.random.randint(1, len(parent1)-1)  
    child1 = np.concatenate((parent1[:point], parent2[point:]))  
    child2 = np.concatenate((parent2[:point], parent1[point:]))  
    return child1, child2  
  
# 变异  
def mutate(individual, mutation_rate):  
    for i in range(len(individual)):  
        if np.random.rand() < mutation_rate:  
            individual[i] = np.random.randint(0, 10)  
    return individual  
  
# 遗传算法  
def genetic_algorithm():  
    population = initialize_population(POP_SIZE, GENES_COUNT)  
    for generation in range(GENERATIONS):  
        fitness_scores = np.array([fitness(individual, TARGET) for individual in population])  
          
        # 选择  
        selected_population = selection(population, fitness_scores)  
          
        # 生成新种群  
        new_population = []  
        for i in range(0, POP_SIZE, 2):  
            parent1, parent2 = selected_population[i], selected_population[i+1]  
            child1, child2 = crossover(parent1, parent2)  
            new_population.append(mutate(child1, MUTATION_RATE))  
            new_population.append(mutate(child2, MUTATION_RATE))  
          
        population = np.array(new_population)  
          
        # 输出当前最优解  
        best_fitness = min(fitness_scores)  
        best_individual = population[np.argmin(fitness_scores)]  
        print(f"Generation {generation+1}: Best Fitness = {best_fitness}, Best Individual = {best_individual}")  
      
    # 返回最优解  
    best_fitness = min(fitness_scores)  
    best_individual = population[np.argmin(fitness_scores)]  
    return best_fitness, best_individual  
  
# 运行遗传算法  
best_fitness, best_individual = genetic_algorithm()  
print(f"Final Best Fitness = {best_fitness}, Best Individual = {best_individual}")
5. 基于梯度的方法（Gradient-based Methods）
- 使用梯度信息来调整超参数。
- 优点：在连续超参数空间中表现良好。
- 缺点：需要超参数对目标函数的梯度信息
from azure.ai.ml.sweep import Choice

command_job_for_sweep = command_job(
    batch_size=Choice(values=[16, 32, 64, 128]),
    number_of_hidden_layers=Choice(values=range(1,5)),
)
batch_size 采用 [16、32、64、128] 中的一个值，number_of_hidden_layers 采用 [1、2、3、4] 中的一个值
from azure.ai.ml.sweep import Normal, Uniform

command_job_for_sweep = command_job(   
    learning_rate=Normal(mu=10, sigma=3),
    keep_probability=Uniform(min_value=0.05, max_value=0.1),
)
代码定义具有两个参数（learning_rate 和 keep_probability）的搜索空间。 learning_rate 包含平均值为 10、标准偏差为 3 的正态分布。 keep_probability 包含最小值为 0.05、最大值为 0.1 的均匀分布

手动标注和自动标注。

1、手动标注是指人工或众包工作者对数据进行手动注释或标记。手动标注是一种较为准确的标注方法，因为人工可以理解数据的含义，但它需要大量的时间和人力成本。

优点：耗时耗力，但是标注苏剧的准确率高

2、自动标注是指使用自动化算法对数据进行标注。自动标注是一种快速和经济的标注方法，但它可能会产生一定的误差，因为自动化算法可能无法完全理解数据的含义

自动标注：优点，速度快，但是准确率低，成本不高

定义标注任务和标注规则：确定标注任务的目标、标注对象和标注标准。

2、选择标注人员：选择适合的人工或众包工作者，并对他们进行培训。

3、分配任务：将数据分配给标注人员进行标注。

4、审核标注结果：对标注结果进行质量控制和质量保证，发现并修正错误。

5、收集和整理数据：将标注后的数据进行整理和格式化，以便机器学习算法可以使用。

import spacy  
  
# 加载中文模型  
nlp = spacy.load("zh_core_web_sm")  
  
# 定义一个要标注的文本  
text = "雷军是中国的知名企业家，他以小米公司创始人身份闻名。"  
  
# 使用spaCy模型处理文本  
doc = nlp(text)  
  
# 遍历文本中的实体，并加上标注  
annotated_text = text  
for ent in doc.ents:  
    start, end = ent.start_char, ent.end_char  
    annotated_text = annotated_text[:start] + f"[[{ent.label_}]]: " + annotated_text[start:end] + annotated_text[end:]  
  
# 输出标注后的文本  
print(annotated_text)

import tensorflow as tf  
from transformers import AutoTokenizer, AutoModelForTokenClassification  
from transformers import pipeline  
  
# 加载预训练的NER模型  
model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"  
tokenizer = AutoTokenizer.from_pretrained(model_name)  
model = AutoModelForTokenClassification.from_pretrained(model_name)  
  
# 定义一个命名实体识别管道  
nlp = pipeline("ner", model=model, tokenizer=tokenizer)  
  
# 定义一个要标注的文本  
text = "雷军是中国的知名企业家，他创立了小米公司并担任CEO。"  
  
# 使用管道处理文本，并获取标注结果  
ner_results = nlp(text)  
  
# 遍历标注结果，并加上标注  
annotated_text = text  
for entity in ner_results:  
    start, end = entity['start'], entity['end']  
    label = entity['entity']  
    annotated_text = annotated_text[:start] + f"[[{label}]]: " + annotated_text[start:end] + annotated_text[end:]  
  
# 输出标注后的文本  
print(annotated_text)