Datawhale AI夏令营

扩散模型扩散模型是一种生成式模型，它的灵感来源于物理学中的扩散过程。前向扩散（Forward Diffusion）：这是一个固定的、逐渐加噪声的过程。从一张清晰的图像开始，模型会逐步向图像中添加高斯噪声，直到图像最终完全变成随机的噪声。这个过程是可控的，每个步骤的噪声量都是预设的。反向扩散（Reverse Diffusion）：这是模型需要学习的阶段。模型的目标是学习如何逐步地从纯噪声中“去噪”，

2401_88241058

1025人浏览 · 2025-08-10 23:24:38

2401_88241058 · 2025-08-10 23:24:38 发布

全球AI攻防挑战赛图像生成赛道task2笔记

目前进展：因为是小白，所以目前在理解代码（增加注释来理解）和专业名词，并尝试跑完t2i，跑通tie和vattie任务，在实践过程中遇到了FLUX.1-Kontext-dev下载失败、数据盘内存不足等问题，还在尝试解决，后面尝试api调用模型，并尝试结合语言处理模型优化prompt

比赛任务：

AIGC图片生成(t2i)：根据给定的文本提示，生成真实且美观的图片。
自然场景图片编辑(tie)：根据提供的图片和编辑指令，对原图指定区域的内容进行修改。
视觉文本编辑(vittie)：根据提供的图片和编辑指令，对原图中的文字信息进行编辑或替换。
Deepfake：利用给定的人脸图片，对原图中的人脸进行替换处理。（将target图片中的人脸替换到ori图片中，保持ori中的背景元素）

baseline利用 CogView 和 OpenCV 来完成 文生图、人脸替换、智能编辑 等任务

定义

扩散模型

扩散模型是一种 生成式模型 ，它的灵感来源于物理学中的扩散过程。它包含两个主要阶段：

前向扩散（Forward Diffusion） ：这是一个固定的、逐渐加噪声的过程。从一张清晰的图像开始，模型会逐步向图像中添加 高斯噪声 ，直到图像最终完全变成随机的噪声。这个过程是可控的，每个步骤的噪声量都是预设的。
反向扩散（Reverse Diffusion） ：这是模型需要学习的阶段。模型的目标是学习如何 逐步地从纯噪声中“去噪” ，从而逆转前向扩散过程，最终还原出清晰、有意义的图像。这个去噪过程通常由一个 U-Net 神经网络 来完成，它会预测在当前噪声图像中添加了多少噪声，然后将这些噪声去除。

OCR

OCR（Optical Character Recognition，光学字符识别）是一种通过计算机技术（“图像预处理→文字检测→文字识别→后处理”的全流程）自动识别图像或扫描文档中的文字（印刷体/手写体），并将其转换为可编辑、可搜索文本的技术。

OCR的核心技术流程

1.图像预处理（Image Preprocessing）

目标：优化图像质量，消除干扰，为后续识别提供清晰输入。

常用技术：
- 二值化：将彩色/灰度图转为黑白二值图（文字为黑，背景为白），突出文字轮廓。
- 倾斜校正：通过霍夫变换检测文字行倾斜角度并修正（如扫描文档的歪斜）。
- 降噪：去除斑点、干扰线（如复印污渍、背景纹理），常用中值滤波、高斯滤波。
- 字符分割：切割粘连字符（如“cl”“ee”）或重叠文字，确保单个字符独立。

2.文字检测（Text Detection）

目标：定位图像中文字区域的位置和边界（解决“哪里有文字”）。

主流技术：
- 深度学习模型：
  - EAST（Efficient and Accurate Scene Text Detector）：快速预测文字区域的旋转矩形框，支持任意方向文字（如广告牌、车牌）。
  - CTPN（Connectionist Text Proposal Network）：适用于水平或多方向文字检测，输出候选文本框。
  - YOLO/SSD改进版：将文字视为“目标”，通过目标检测框架实时定位（如摄像头实时识别）。
  - Mask R-CNN：输出文字区域的像素级掩码（Mask），精确分割文字与背景（复杂场景）。

3.文字识别（Text Recognition）

目标：将检测到的文字区域转换为字符序列（解决“文字是什么”）。

主流技术：
- CRNN（Convolutional Recurrent Neural Network）：
  - CNN 提取图像特征 → RNN/LSTM 学习字符序列依赖关系 → CTC（Connectionist Temporal Classification）处理不定长字符对齐，支持任意长度文字（如句子、段落）。
- Attention-OCR：引入注意力机制，动态聚焦当前识别字符区域，提升复杂场景（弯曲文字、低分辨率）的精度。
- Transformer-OCR：基于Vision Transformer（ViT）将文字图像块转为字符序列，支持多语言和长文本识别（如Google的ViT-OCR）。
- 手写体识别：专用模型（如RNN+LSTM、GNN）处理手写体的不规则笔画和连笔。

4.后处理（Post-Processing）

目标：优化识别结果，修正错误。

常用技术：
- 字典纠错：基于语言模型（如N-Gram、BERT）修正识别错误（如“0CR”→“OCR”）。
- 格式恢复：还原文档排版结构（段落、表格、字体大小），如PDF转Word场景。
- 多语言适配：自动检测语种并切换识别模型（如中英文混合文本）。

"./"（当前目录）:

当前目录：/home/user/project/
./model.dat → /home/user/project/model.dat

"../"（上一级目录）：

当前目录：/home/user/project/src/
../model.dat → /home/user/project/model.dat

代码

t2i任务

from diffusers import CogView4Pipeline
import torch
#加载 CogView4 预训练模型
pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16)

# Open it for reduce GPU memory usage
#将暂时不用的模型组件“卸载”到 CPU 内存，仅在需要时加载到 GPU，降低峰值GPU显存占用
pipe.enable_model_cpu_offload() 
#启用 VAE 切片
pipe.vae.enable_slicing()
#启用 VAE 分块
pipe.vae.enable_tiling()

prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background."
image = pipe(
    prompt=prompt,
    guidance_scale=3.5,
    num_images_per_prompt=1,
    num_inference_steps=50,
    width=1024,
    height=1024,
).images[0]
'''
guidance_scale=3.5：引导尺度，控制文本与生成图像的对齐程度（值越高，图像越贴合文本描述，但可能牺牲多样性）。CogView4 推荐值通常为 3-7
num_images_per_prompt=1：每次生成的图像数量为1
num_inference_steps=50：扩散迭代步数，即从纯噪声逐步去噪生成图像的步骤数（步数越多，细节越丰富，但生成越慢）
width=1024, height=1024：生成图像的分辨率（宽×高，单位：像素）
'''
image.save("cogview4.png")#将生成的 PIL 图像对象保存为本地文件 "cogview4.png" 是保存路径和文件名

tie和vattie任务

下载 FLUX.1-Kontext-dev时遇到问题

import torch
from diffusers import FluxKontextPipeline
from diffusers.utils import load_image
#FLUX.1-Kontext-dev是开发版，可能存在部分细节不完善，可尝试官方稳定版（如FLUX.1-Kontext）
pipe = FluxKontextPipeline.from_pretrained("black-forest-labs/FLUX.1-Kontext-dev", torch_dtype=torch.bfloat16) #模型仓库地址及权重数据类型（半精度，节省显存）
pipe.to("cuda") #将模型移至GPU（CUDA）加速计算

#加载输入图像（待编辑的原始图像）
input_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
'''
load_image：从指定URL加载图像（此处是一张猫的图片），返回PIL.Image对象（可用于后续模型输入）
'''
image = pipe(
  image=input_image,
  prompt="Add a hat to the cat",
  guidance_scale=2.5
).images[0]
'''
image=input_image：待编辑的原始图像(必需)
prompt="Add a hat to the cat"：文本提示（必需）
guidance_scale=2.5：引导尺度  控制文本与图像的对齐程度
值越高（如5-7）：编辑结果越贴合文本提示
值越低（如1-2）：编辑结果更保留原图像的真实性
'''
image.save("1.png")

deepfake任务

#模型初始化部分
try:
    import cv2
    import dlib
    detector = dlib.get_frontal_face_detector()# 初始化人脸检测器（检测正面人脸）
    predictor = dlib.shape_predictor("./shape_predictor_68_face_landmarks.dat") 
    # 初始化68点facial landmarks预测器
except:
    traceback.print_exc() # 初始化68点facial landmarks预测器
    pass

#人脸互换部分    
'''
函数参数：
source_img_path：源图像路径（要替换的脸）
aim_img_path：目标图像路径（要被替换的身体）
save_img_path：结果保存路径
'''
def face_swap_using_dlib(source_img_path: str, aim_img_path: str, save_img_path: str):
    '''读取图像并转换为灰度图（预处理），减少计算量且提高人脸检测精度'''
    face = cv2.imread(source_img_path)# 读取源图像
    body = cv2.imread(aim_img_path)# 读取目标图像
    
    face_gray = cv2.cvtColor(face, cv2.COLOR_BGR2GRAY) 
    body_gray = cv2.cvtColor(body, cv2.COLOR_BGR2GRAY)
    
    # 在图像形状中创建空矩阵
    height, width = face_gray.shape
    mask = np.zeros((height, width), np.uint8)

    height, width, channels = body.shape
    
    '''处理源图像（提取facial landmarks与三角划分）'''
    # 检测源图像中的人脸边界框
    rect = detector(face_gray)[0] #取第一个检测到的人脸
    
    # 这将创建一个具有68对整数值的-这些值是面部结构的（x， y）坐标
    # 提取源图像的68个facial landmarks
    landmarks = predictor(face_gray, rect) # 对人脸区域提取landmarks
    landmarks_points = [] # 存储landmarks坐标的列表
    
    # 定义函数：将landmarks转为坐标列表
    def get_landmarks(landmarks, landmarks_points):
      for n in range(68):
          x = landmarks.part(n).x  # 获取第n个landmark的x坐标
          y = landmarks.part(n).y  # 获取第n个landmark的y坐标
          landmarks_points.append((x, y)) # 添加到列表
    
    get_landmarks(landmarks, landmarks_points) # 调用函数，填充landmarks_points
    
    # 生成源图像人脸的凸包（convex hull）：包围所有landmarks的最小凸多边形（人脸轮廓）
    points = np.array(landmarks_points, np.int32)
    convexhull = cv2.convexHull(points) #计算凸包
    
    face_cp = face.copy()
    face_image_1 = cv2.bitwise_and(face, face, mask=mask)

    # 对源图像进行delaunay三角划分（将人脸分成多个三角形，便于后续逐三角形变换）
    rect = cv2.boundingRect(convexhull) # 计算凸包的边界框（x, y, w, h）
    subdiv = cv2.Subdiv2D(rect)  # 初始化Subdiv2D（用于delaunay三角划分）
    subdiv.insert(landmarks_points)  # 向Subdiv2D中插入landmarks点
    triangles = subdiv.getTriangleList()
    # 获取delaunay三角形列表（每个三角形是6个值：x1,y1,x2,y2,x3,y3）
    triangles = np.array(triangles, dtype=np.int32)
    
    # 过滤无效三角形（只保留包含3个landmarks点的三角形）
    indexes_triangles = []  # 存储有效三角形的索引
    face_cp = face.copy() # 复制源图像（用于绘制三角形，调试用）
    
    # 定义函数：获取landmarks点的索引
    def get_index(arr):
        index = 0
        if arr[0]: # arr是np.where的结果
            index = arr[0][0]
        return index
    
    for triangle in triangles :
        # 提取三角形的三个顶点（pt1, pt2, pt3）
        pt1 = (triangle[0], triangle[1])
        pt2 = (triangle[2], triangle[3])
        pt3 = (triangle[4], triangle[5])
        
        # 在源图像副本上绘制三角形
        cv2.line(face_cp, pt1, pt2, (255, 255, 255), 3,  0)
        cv2.line(face_cp, pt2, pt3, (255, 255, 255), 3,  0)
        cv2.line(face_cp, pt3, pt1, (255, 255, 255), 3,  0)
    
        # 找到三角形顶点对应的landmarks索引
        index_pt1 = np.where((points == pt1).all(axis=1))
        index_pt1 = get_index(index_pt1)
        index_pt2 = np.where((points == pt2).all(axis=1))
        index_pt2 = get_index(index_pt2)
        index_pt3 = np.where((points == pt3).all(axis=1))
        index_pt3 = get_index(index_pt3)
    
         # 保留有效三角形（三个顶点都在landmarks中）
        if index_pt1 is not None and index_pt2 is not None and index_pt3 is not None:
            vertices = [index_pt1, index_pt2, index_pt3]
            indexes_triangles.append(vertices)
    '''处理目标图像（提取facial landmarks与凸包）'''
    # 检测目标图像中的人脸边界框
    rect2 = detector(body_gray)[0]
    # This creates a with 68 pairs of integer values — these values are the (x, y)-coordinates of the facial structures 
    landmarks_2 = predictor(body_gray, rect2)
    landmarks_points2 = []    
    # Uses the function declared previously to get a list of the landmark coordinates
    get_landmarks(landmarks_2, landmarks_points2)
    
    # Generates a convex hull for the second person
    points2 = np.array(landmarks_points2, np.int32)
    convexhull2 = cv2.convexHull(points2)
    
    body_cp = body.copy()
    ''' 逐三角形变换（将源脸映射到目标脸）'''
    # 初始化变量（用于存储变换后的源脸区域）
    lines_space_new_face = np.zeros((height, width, channels), np.uint8)
    # 源脸三角形变换后的空间
    body_new_face = np.zeros((height, width, channels), np.uint8)
    # 目标图像的新脸区域（存储变换后的源脸）
    
    height, width = face_gray.shape
    lines_space_mask = np.zeros((height, width), np.uint8)
    
    
    for triangle in indexes_triangles:
    
        # 源图像的三角形顶点
        pt1 = landmarks_points[triangle[0]]
        pt2 = landmarks_points[triangle[1]]
        pt3 = landmarks_points[triangle[2]]
    
        # 源三角形的边界框
        (x, y, widht, height) = cv2.boundingRect(np.array([pt1, pt2, pt3], np.int32))
        cropped_triangle = face[y: y+height, x: x+widht] # 裁剪源三角形区域
        cropped_mask = np.zeros((height, widht), np.uint8) # 初始化源三角形掩码
    
        # 生成源三角形掩码（填充白色，用于提取三角形区域）
        points = np.array([[pt1[0]-x, pt1[1]-y], [pt2[0]-x, pt2[1]-y], [pt3[0]-x, pt3[1]-y]], np.int32)
        cv2.fillConvexPoly(cropped_mask, points, 255) # 填充三角形掩码
    
        # Draws lines for the triangles
        cv2.line(lines_space_mask, pt1, pt2, 255)
        cv2.line(lines_space_mask, pt2, pt3, 255)
        cv2.line(lines_space_mask, pt1, pt3, 255)
    
        lines_space = cv2.bitwise_and(face, face, mask=lines_space_mask)
    
        # Calculates the delaunay triangles of the second person's face
    
        # Coordinates of the first person's delaunay triangles
        pt1 = landmarks_points2[triangle[0]]
        pt2 = landmarks_points2[triangle[1]]
        pt3 = landmarks_points2[triangle[2]]
    
        # Gets the delaunay triangles
        (x, y, widht, height) = cv2.boundingRect(np.array([pt1, pt2, pt3], np.int32))
        cropped_mask2 = np.zeros((height,widht), np.uint8)
    
        # 生成目标三角形掩码（填充白色）
        points2 = np.array([[pt1[0]-x, pt1[1]-y], [pt2[0]-x, pt2[1]-y], [pt3[0]-x, pt3[1]-y]], np.int32)
        cv2.fillConvexPoly(cropped_mask2, points2, 255)
    
        # 计算affine变换矩阵（将源三角形映射到目标三角形）https://docs.opencv.org/3.4/d4/d61/tutorial_warp_affine.html
        points =  np.float32(points)
        points2 = np.float32(points2)
        M = cv2.getAffineTransform(points, points2) # 计算affine变换矩阵
        # 扭曲第一个三角形的内容以适合第二个三角形
        dist_triangle = cv2.warpAffine(cropped_triangle, M, (widht, height))
        dist_triangle = cv2.bitwise_and(dist_triangle, dist_triangle, mask=cropped_mask2)
    
        # Joins all the distorted triangles to make the face mask to fit in the second person's features
        body_new_face_rect_area = body_new_face[y: y+height, x: x+widht]
        body_new_face_rect_area_gray = cv2.cvtColor(body_new_face_rect_area, cv2.COLOR_BGR2GRAY) # 转为灰度图
    
        # Creates a mask
        masked_triangle = cv2.threshold(body_new_face_rect_area_gray, 1, 255, cv2.THRESH_BINARY_INV) 
        dist_triangle = cv2.bitwise_and(dist_triangle, dist_triangle, mask=masked_triangle[1])
        
        # Adds the piece to the face mask
        body_new_face_rect_area = cv2.add(body_new_face_rect_area, dist_triangle)
        body_new_face[y: y+height, x: x+widht] = body_new_face_rect_area
    
    ''' 融合新脸与目标图像'''
    body_face_mask = np.zeros_like(body_gray)# 初始化掩码（与目标灰度图同尺寸）
    body_head_mask = cv2.fillConvexPoly(body_face_mask, convexhull2, 255)
    # 用凸包填充白色（脸区域）
    body_face_mask = cv2.bitwise_not(body_head_mask) # 反转掩码（脸区域为黑色，其余为白色）
   
   # 移除目标图像中的脸区域（保留背景） 
    body_maskless = cv2.bitwise_and(body, body, mask=body_face_mask)# 用反转掩码过滤
    # 合并背景与新脸区域
    result = cv2.add(body_maskless, body_new_face)

    # 计算目标脸的中心（用于无缝融合）
    (x, y, widht, height) = cv2.boundingRect(convexhull2) # 目标凸包的边界框
    center_face2 = (int((x+x+widht)/2), int((y+y+height)/2))
    
    # 无缝融合（seamlessClone）：让新脸与目标背景自然融合（避免边缘生硬）
    seamlessclone = cv2.seamlessClone(result, body, body_head_mask, center_face2, cv2.NORMAL_CLONE)  # 正常克隆模式    
    cv2.imwrite(save_img_path, seamlessclone)

劣势：基于 Dlib + OpenCV 的传统方法在处理复杂光照、表情、姿态、角度时，融合效果往往僵硬，边缘痕迹明显，面部轮廓易失真，导致 PQ Score 较低（这个后期可以尝试如何改进）

上分方向

后面尝试api调用模型来优化t2i,tie,vattie任务，并尝试结合语言处理模型优化prompt（将中文转化成英文，处理token被截断问题），deepfake看看有没有更好的模型，在调用时记录进度状态和限流规避。

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐