CLIP 完全上手指南：从安装、下载加速、文本/图像编码到图文匹配，一篇全搞定！

CLIP（Contrastive Language–Image Pretraining）是 OpenAI 在 2021 年发布的“图文理解之王”。它不生成图像，而是把图像和文本映射到同一个语义空间输入文字 → 找最匹配的图片输入图片 → 找最匹配的文字计算图文相似度做 zero-shot 图像分类（不用训练！它让 AI 同时“看懂图”和“读懂文”，还能让它们对话！方法输入是否自动 tokenize

大写-凌祁

329人浏览 · 2025-09-15 17:11:15

大写-凌祁 · 2025-09-15 17:11:15 发布

🔥 CLIP 完全上手指南：从安装、下载加速、文本/图像编码到图文匹配，一篇全搞定！

作者：大写-凌祁
座右铭：用最简单的话讲最复杂的知识
平台：CSDN
原创不易，转载请注明出处
更新日期：2025年4月

🎯 本文你能学到什么？

✅ CLIP 是什么？能做什么？
✅ 如何安装 + 加载模型（含国内镜像加速方案）
✅ 文本编码的两种方式：encode_text vs get_text_features
✅ 图像编码 + 图文相似度计算实战
✅ 中文支持方案 + 常见问题避坑指南
✅ 附赠完整可运行代码（复制即用）

🌟 什么是 CLIP？

CLIP（Contrastive Language–Image Pretraining）是 OpenAI 在 2021 年发布的“图文理解之王”。

它不生成图像，而是把图像和文本映射到同一个语义空间，让你可以：

输入文字 → 找最匹配的图片
输入图片 → 找最匹配的文字
计算图文相似度
做 zero-shot 图像分类（不用训练！）

通俗讲：它让 AI 同时“看懂图”和“读懂文”，还能让它们对话！

🚀 第一步：安装 + 加速下载模型

✅ 1.1 安装官方 CLIP 库

CLIP 官方库未上传 PyPI，需从 GitHub 安装：

pip install torch torchvision
pip install git+https://github.com/openai/CLIP.git

💡 建议使用 Python 3.8+，PyTorch 1.8+

✅ 1.2 模型下载太慢？用国内镜像站！

官方模型下载地址在国外，经常卡顿。我们提供三种加速方案：

🧩 方案一：修改源码，替换为镜像 URL（推荐开发者）

找到你环境中的 clip/clip.py 文件（路径示例：~/miniconda3/lib/python3.9/site-packages/clip/clip.py）

在 _download 函数中，加入镜像替换逻辑：

# 在 with urllib.request.urlopen(url) as source 前插入：

# 👇 国内镜像替换 👇
mirror_map = {
    "https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt":
        "https://clip-as-service-hub.s3.timeweb.com/ViT-B-32.pt",
    "https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4a0068adbf593e8/RN50.pt":
        "https://clip-as-service-hub.s3.timeweb.com/RN50.pt",
}

if url in mirror_map:
    url = mirror_map[url]

镜像源由 Jina AI 维护，速度飞快！

🧩 方案二：手动下载 + 放入缓存目录（推荐新手）

浏览器打开镜像地址下载模型：

ViT-B/32 → https://clip-as-service-hub.s3.timeweb.com/ViT-B-32.pt
RN50     → https://clip-as-service-hub.s3.timeweb.com/RN50.pt

# Linux / Mac
mkdir -p ~/.cache/clip
cp ViT-B-32.pt ~/.cache/clip/

# Windows（CMD）
mkdir %USERPROFILE%\.cache\clip
copy ViT-B-32.pt %USERPROFILE%\.cache\clip\

代码中直接调用，自动识别本地模型：

model, preprocess = clip.load("ViT-B/32", device=device)  # 不再联网！

🧩 方案三：使用中文优化版 `cn_clip`（推荐中文用户）

pip install cn_clip

import cn_clip.clip as clip
model, preprocess = clip.load_from_name("ViT-B-16", device=device, download_root="./models")

项目地址：https://github.com/OFA-Sys/Chinese-CLIP
支持中文，模型默认从国内源下载！

📝 第二步：文本编码 —— 两种方式详解

CLIP 提供两种文本编码方式，新手容易混淆，我来帮你理清！

✅ 方式一：`model.get_text_features(texts)` ← 推荐！

这是官方封装的“一步到位”函数，自动完成：

tokenize（分词）
encode（编码）
normalize（归一化）

最适合日常使用、图文匹配、快速实验！

import clip
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model, _ = clip.load("ViT-B/32", device=device)

texts = ["一只戴墨镜的狗", "太空中的宇航员", "下雨天的街道"]

with torch.no_grad():
    text_features = model.get_text_features(texts)  # ⭐ 一行搞定！

print(text_features.shape)  # torch.Size([3, 512])
print(text_features.norm(dim=-1))  # tensor([1., 1., 1.]) ← 已归一化

✅ 方式二：`clip.tokenize + model.encode_text` ← 精细控制用

适合需要：

自定义 tokenize（如截断、填充）
调试中间 token
不想自动归一化

text_tokens = clip.tokenize(texts).to(device)
with torch.no_grad():
    text_features = model.encode_text(text_tokens)  # 未归一化！

# 如需归一化，手动做：
text_features = text_features / text_features.norm(dim=-1, keepdim=True)

🆚 对比总结

方法	输入	是否自动 tokenize	是否自动归一化	使用场景
`get_text_features`	字符串列表	✅ 是	✅ 是	推荐！日常推理、匹配
`encode_text`	token 张量	❌ 否	❌ 否	需精细控制中间过程

🧠 记住：计算相似度时，向量必须归一化！get_text_features 已帮你做好。

🖼️ 第三步：图像编码 + 图文匹配实战

图像编码目前没有 get_image_features，仍需手动预处理 + encode_image。

✅ 完整图文匹配示例

import clip
import torch
from PIL import Image
import requests

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# 1. 文本编码（推荐用 get_text_features）
texts = ["一只戴墨镜的狗", "一个在吃冰淇淋的小女孩", "太空中的宇航员", "两只猫在沙发上"]
with torch.no_grad():
    text_features = model.get_text_features(texts)  # ✅ 自动归一化！

# 2. 图像编码
url = "http://images.cocodataset.org/val2017/000000039769.jpg"  # 两只猫
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
image_input = preprocess(image).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image_input)
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)  # 手动归一化

# 3. 计算相似度（余弦相似度 = 点乘，因已归一化）
similarity = (image_features @ text_features.T).softmax(dim=-1)  # softmax 转概率
values, indices = similarity[0].topk(3)  # 取 Top 3

# 4. 输出结果
print("🔍 图文匹配 Top 3：")
for score, idx in zip(values, indices):
    print(f"「{texts[idx]}」→ 匹配度 {score.item():.2%}")

✅ 输出示例：

🔍 图文匹配 Top 3：
「两只猫在沙发上」→ 匹配度 62.15%
「一只戴墨镜的狗」→ 匹配度 25.33%
「一个在吃冰淇淋的小女孩」→ 匹配度 8.76%

🇨🇳 Bonus：中文支持方案

原版 CLIP 是英文模型，对中文理解差！

✅ 解决方案 1：使用 Chinese-CLIP（推荐！）

pip install cn_clip

import cn_clip.clip as clip
from cn_clip.clip import load_from_name

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = load_from_name("ViT-B-16", device=device)
model.eval()

# 直接输入中文！
texts = ["戴着墨镜的狗", "吃冰淇淋的小女孩", "沙发上的两只猫"]
text_features = model.get_text_features(texts, device=device)

支持模型：ViT-B-16, ViT-L-14, RN50 等
GitHub：https://github.com/OFA-Sys/Chinese-CLIP

✅ 解决方案 2：英文翻译中转

用翻译 API（如百度、Google、DeepL）先把中文转英文，再输入 CLIP。

# 伪代码
chinese_texts = ["戴墨镜的狗", "吃冰淇淋的女孩"]
english_texts = translate_to_english(chinese_texts)  # ['dog wearing sunglasses', ...]
features = model.get_text_features(english_texts)

🛠️ 常见问题 Q&A

❓ 1. 模型支持哪些？

模型名	类型	速度	精度	文件大小
`RN50`	ResNet	快	一般	~90MB
`ViT-B/32`	Vision Transformer	较快	高	~330MB
`ViT-B/16`	Vision Transformer	中等	更高	~330MB
`ViT-L/14`	Vision Transformer	慢	最高	~1.7GB

新手推荐 ViT-B/32，速度和精度平衡！

❓ 2. 能批量处理吗？

当然！所有函数都支持 batch：

texts = ["猫"] * 1000  # 1000条文本
features = model.get_text_features(texts)  # shape: [1000, 512]

images = torch.randn(100, 3, 224, 224).to(device)
image_features = model.encode_image(images)  # shape: [100, 512]

❓ 3. 为什么相似度要用 softmax？

image_features @ text_features.T → 得到 logits（原始相似度）
.softmax(dim=-1) → 转成概率分布，方便理解（总和为1）
如果只想比大小，可以不用 softmax，直接用 logits

🎁 附：完整懒人包代码（复制粘贴即运行）

"""
📌 作者：大写-凌祁
🎯 用最简单的话讲最复杂的知识
🚀 CLIP 完整图文匹配示例（含 get_text_features + 镜像提示）
"""

import os
import clip
import torch
from PIL import Image
import requests

# ========================
# 💡 提示：若下载慢，请提前下载模型到 ~/.cache/clip/
# 镜像地址：https://clip-as-service-hub.s3.timeweb.com/ViT-B-32.pt
# 中文用户推荐 cn_clip：pip install cn_clip
# ========================

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"🚀 使用设备：{device}")

# 加载模型
model, preprocess = clip.load("ViT-B/32", device=device)
print("✅ 模型加载成功！")

# 文本（推荐直接用 get_text_features）
texts = [
    "一只戴墨镜的狗在冲浪",
    "一个在吃草莓冰淇淋的小女孩",
    "国际空间站外的宇航员",
    "下雨天窗户边的两只猫"
]

print("\n🧠 正在编码文本...")
with torch.no_grad():
    text_features = model.get_text_features(texts)  # ✅ 一步到位！

# 图像
print("🖼️ 正在加载图像...")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"  # 两只猫
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
image_input = preprocess(image).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image_input)
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)

# 相似度匹配
print("\n🔍 正在计算图文匹配度...")
similarity = (image_features @ text_features.T).softmax(dim=-1)[0]
values, indices = similarity.topk(len(texts))  # 全部排序

print("\n🏆 最终匹配结果：")
for i, (score, idx) in enumerate(zip(values, indices)):
    rank = i + 1
    print(f"第{rank}名: 「{texts[idx]}」→ {score.item():.2%}")

📚 CLIP 能用来做什么？

应用方向	说明
图文搜索	以文搜图、以图搜文
AI 绘画提示词优化	计算 prompt 与生成图的匹配度
零样本图像分类	无需训练，用文本标签分类图像
内容安全审核	检测图文不符（标题党）
多模态推荐系统	结合图文语义做个性化推荐