【大模型实战篇】基于开源视觉大模型封装多模态信息提取工具

大模型、多模态、视觉模型、qwen-vl-2.5、glm-4v、openai

源泉的小广场

405人浏览 · 2025-08-26 20:22:05

源泉的小广场 · 2025-08-26 20:22:05 发布

1. 功能模块

封装一个统一的视觉模型调用工具，支持多种大模型（Qwen2.5-VL、GLM4V、GPT-4V）来分析图片内容。

主要模块有：

配置管理（DEFAULT_CONFIG）
- 存储不同模型的调用参数（API URL、模型名、API Key、最大 Token、温度等）。
- 每个模型都有一个配置字典。
视觉模型服务类（VisionModelService）
封装了所有核心功能，包括：
- _init_clients()：初始化各个模型的客户端实例（依赖 openai.OpenAI）。
- encode_image_to_base64()：将本地图片转成 base64 字符串（供模型 API 使用）。
- analyze_image_content()：构建请求，调用模型 API 分析图片，返回结果（尽量解析成 JSON）。
- _get_default_prompt()：提供一个默认的中文提示词，让模型以 JSON 格式返回图片信息。
- get_supported_models()：返回支持的模型列表。
- test_connection()：测试指定模型是否能正常调用。
全局单例服务实例（_vision_service + 工厂方法）
- get_vision_service()：保证全局只初始化一个 VisionModelService，避免重复创建客户端。
便捷函数（对外接口）
- analyze_image_with_vision_model()：对外封装的“分析图片”函数。
- test_vision_model_connection()：对外封装的“测试连接”函数。
主函数入口（if __name__ == "__main__":）
- 先测试 qwen2.5_vl 的连接。
- 如果 data/test.jpg 存在，则调用模型分析图片，否则提示图片不存在。

2. 调用流程

代码运行时的主要流程是：

初始化
- 执行 get_vision_service() → 创建 VisionModelService 实例。
- _init_clients() 根据配置初始化各个模型的 OpenAI 客户端。
测试模型连接（可选）
- 调用 test_vision_model_connection("qwen2.5_vl")。
- 发送一条简单消息 "请回复'连接测试成功'"。
- 检查返回值，判断是否连接成功。
分析图片（核心逻辑）
- 调用 analyze_image_with_vision_model(image_path, model_type, prompt)。
- encode_image_to_base64() 把图片转成 base64。
- 构造 messages 请求（包括文字 prompt 和图片 base64 URL）。
- 调用 client.chat.completions.create() 生成结果。
- 优先尝试把结果解析成 JSON，如果失败，就直接用原始文本包装成 JSON 格式返回。

3. 关键逻辑细节

多模型统一调用
不同模型（Qwen、GLM、OpenAI）都统一用 VisionModelService 来调用，减少了代码重复。
图片处理
- 图片被读取为二进制 → base64 编码 → data:image/jpeg;base64,xxx 的 URL → 传给模型。
- 避免了直接上传本地路径的兼容性问题。
健壮性设计
- API 调用失败会捕获异常，并返回 status: error。
- 模型响应为空/不是 JSON 格式，也能兼容返回结果。
- 默认超时时间设置为 60 秒，避免请求卡死。
默认提示词（Prompt Engineering）
- 强制模型输出 JSON 格式结果。
- 要求所有 key 必须是中文（比如 "产品名称":"xxx"）。
- 特别强调提取 包装上的文字、成分表 等。

"""
视觉模型调用工具程序
支持多种视觉模型，包括Qwen2.5-VL等
"""
import json
import os
import sys
from typing import Dict, Any, Optional, List
from PIL import Image
from io import BytesIO
import base64
import time

try:
    from openai import OpenAI
except ImportError:
    print("警告：未安装openai库，请运行: pip install openai")
    OpenAI = None

try:
    from PIL import Image
except ImportError:
    print("警告：未安装PIL库，请运行: pip install pillow")
    Image = None

# 默认配置
DEFAULT_CONFIG = {
    "qwen2.5_vl": {
        "api_base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
        "model_name": "model",
        "api_key": os.getenv("QWEN_API_KEY"),
        "max_tokens": 4096,
        "temperature": 0.1
    },
    "glm4v": {
        "api_base_url": "https://open.bigmodel.cn/api/paas/v4",
        "model_name": "model",
        "api_key": os.getenv("GLM_API_KEY"),
        "max_tokens": 4096,
        "temperature": 0.1
    },
    "gpt4v": {
        "api_base_url": "https://api.openai.com/v1",
        "model_name": "model",
        "api_key": os.getenv("OPENAI_API_KEY"),
        "max_tokens": 4096,
        "temperature": 0.1
    }
}

class VisionModelService:
    """视觉模型服务类"""
    
    def __init__(self, config: Dict[str, Any] = None):
        """
        初始化视觉模型服务
        
        Args:
            config: 配置字典，包含各种模型的配置信息
        """
        self.config = config or DEFAULT_CONFIG
        self.clients = {}
        self._init_clients()
    
    def _init_clients(self):
        """初始化各种模型的客户端"""
        if OpenAI is None:
            print("警告：OpenAI库未安装，无法初始化客户端")
            return
            
        for model_type, model_config in self.config.items():
            try:
                client = OpenAI(
                    api_key=model_config.get("api_key", "dummy-key"),
                    base_url=model_config.get("api_base_url"),
                )
                self.clients[model_type] = {
                    "client": client,
                    "config": model_config
                }
                print(f"✅ {model_type} 客户端初始化成功")
            except Exception as e:
                print(f"❌ {model_type} 客户端初始化失败: {e}")
    
    def encode_image_to_base64(self, image_path: str) -> str:
        """
        将图片文件编码为base64字符串
        
        Args:
            image_path: 图片文件路径
            
        Returns:
            base64编码的图片字符串
        """
        try:
            with open(image_path, "rb") as image_file:
                return base64.b64encode(image_file.read()).decode('utf-8')
        except Exception as e:
            print(f"编码图片失败: {e}")
            return ""
    
    def analyze_image_content(self, 
                            image_path: str, 
                            model_type: str = "qwen2.5_vl",
                            prompt: str = None,
                            image_info: Dict[str, Any] = None) -> Dict[str, Any]:
        """
        分析图片内容
        
        Args:
            image_path: 图片文件路径
            model_type: 模型类型 (qwen2.5_vl, glm4v, gpt4v)
            prompt: 自定义提示词
            image_info: 图片基本信息
            
        Returns:
            图片分析结果
        """
        if model_type not in self.clients:
            return {
                "status": "error",
                "error": f"不支持的模型类型: {model_type}",
                "supported_models": list(self.clients.keys())
            }
        
        client_info = self.clients[model_type]
        client = client_info["client"]
        config = client_info["config"]
        
        try:
            # 将图片编码为base64
            base64_image = self.encode_image_to_base64(image_path)
            
            if not base64_image:
                return {
                    "status": "error",
                    "error": "图片编码失败",
                    "image_path": image_path
                }
            
            # 使用默认提示词或自定义提示词
            if prompt is None:
                prompt = self._get_default_prompt()
            
            # 构建请求消息
            messages = [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": prompt
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/jpeg;base64,{base64_image}"
                            }
                        }
                    ]
                }
            ]
            
            # 调用模型API（增加超时时间）
            response = client.chat.completions.create(
                model=config["model_name"],
                messages=messages,
                max_tokens=config.get("max_tokens", 4096),
                temperature=config.get("temperature", 0.1),
                timeout=60  # 增加到60秒超时
            )
            
            # 解析响应
            if response.choices and len(response.choices) > 0:
                content = response.choices[0].message.content
                
                # 尝试解析JSON格式的响应
                try:
                    parsed_content = json.loads(content)
                    return {
                        "status": "success",
                        "model_type": model_type,
                        "content": parsed_content,
                        "raw_content": content,
                        "image_info": image_info or {}
                    }
                except json.JSONDecodeError:
                    # 如果不是JSON格式，返回原始文本
                    return {
                        "status": "success",
                        "model_type": model_type,
                        "content": {"分析结果": content},
                        "raw_content": content,
                        "image_info": image_info or {}
                    }
            else:
                return {
                    "status": "error",
                    "error": "模型响应为空",
                    "model_type": model_type
                }
                
        except Exception as e:
            return {
                "status": "error",
                "error": f"模型调用失败: {str(e)}",
                "model_type": model_type,
                "image_path": image_path
            }
    
    def _get_default_prompt(self) -> str:
        """获取默认的图片分析提示词"""
        return """你是一个专业的图片内容分析AI。请分析这张图片并提取其中的所有信息，以标准的JSON键值对格式返回，并且所有键名必须使用中文。

要求：
1. 提取图片中的所有可见文字信息，并尽可能组织成有意义的键值对
2. 识别产品信息：名称、规格、价格、功效、成分、使用方法等
3. 如果能识别出明确的标题和对应内容，请组织成"标题":"内容"的格式
4. 对于无法形成键值对的信息，可以用描述性的键名
5. 特别注意包装上的文字信息，包括品牌、产品名称、成分表、营养信息等

不要输出解释性文字，直接返回标准的JSON键值对格式。"""
    
    def get_supported_models(self) -> List[str]:
        """获取支持的模型列表"""
        return list(self.clients.keys())
    
    def test_connection(self, model_type: str = "qwen2.5_vl") -> Dict[str, Any]:
        """
        测试模型连接
        
        Args:
            model_type: 模型类型
            
        Returns:
            连接测试结果
        """
        if model_type not in self.clients:
            return {
                "status": "error",
                "error": f"不支持的模型类型: {model_type}"
            }
        
        try:
            client_info = self.clients[model_type]
            client = client_info["client"]
            config = client_info["config"]
            
            # 发送一个简单的测试请求
            messages = [
                {
                    "role": "user",
                    "content": "请回复'连接测试成功'"
                }
            ]
            
            response = client.chat.completions.create(
                model=config["model_name"],
                messages=messages,
                max_tokens=10,
                temperature=0
            )
            
            if response.choices and len(response.choices) > 0:
                return {
                    "status": "success",
                    "model_type": model_type,
                    "message": "连接测试成功",
                    "response": response.choices[0].message.content
                }
            else:
                return {
                    "status": "error",
                    "error": "模型响应为空"
                }
                
        except Exception as e:
            return {
                "status": "error",
                "error": f"连接测试失败: {str(e)}",
                "model_type": model_type
            }

# 全局实例
_vision_service = None

def get_vision_service(config: Dict[str, Any] = None) -> VisionModelService:
    """
    获取视觉模型服务实例
    
    Args:
        config: 配置字典
        
    Returns:
        VisionModelService实例
    """
    global _vision_service
    if _vision_service is None:
        _vision_service = VisionModelService(config)
    return _vision_service

def analyze_image_with_vision_model(image_path: str, 
                                  model_type: str = "qwen2.5_vl",
                                  prompt: str = None,
                                  config: Dict[str, Any] = None) -> Dict[str, Any]:
    """
    使用视觉模型分析图片的便捷函数
    
    Args:
        image_path: 图片文件路径
        model_type: 模型类型
        prompt: 自定义提示词
        config: 配置字典
        
    Returns:
        分析结果
    """
    service = get_vision_service(config)
    return service.analyze_image_content(image_path, model_type, prompt)

def test_vision_model_connection(model_type: str = "qwen2.5_vl", 
                               config: Dict[str, Any] = None) -> Dict[str, Any]:
    """
    测试视觉模型连接的便捷函数
    
    Args:
        model_type: 模型类型
        config: 配置字典
        
    Returns:
        测试结果
    """
    service = get_vision_service(config)
    return service.test_connection(model_type)

# 示例用法
if __name__ == "__main__":
    # 测试连接
    print("测试视觉模型连接...")
    result = test_vision_model_connection("qwen2.5_vl")
    print(f"连接测试结果: {result}")
    
    # 如果有测试图片，可以进行分析
    test_image = "data/test.jpg"
    if os.path.exists(test_image):
        print(f"\n分析图片: {test_image}")
        analysis_result = analyze_image_with_vision_model(test_image, "qwen2.5_vl")
        print(f"分析结果: {json.dumps(analysis_result, ensure_ascii=False, indent=2)}")
    else:
        print(f"\n测试图片不存在: {test_image}")

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

挖漏洞平台宝藏汇总！全网最全 + 零基础到精通，错过再难寻，看这篇

2048 AI社区

Spring AI Alibaba Graph流式演进

原文作者：怀玉、刘军20250925日，随着Spring AI Alibaba Graph从1.0.0.3升级至1.0.0.4，其中的Graph流式输出有了很大的改进，相关的example已更新，欢迎大家随时跟进，PR地址如下：https://github.com/spring-ai-alibaba/examples/pull/364。