使用pycocoeval进行ImageCaption任务评测

ImageCaption任务是CV以及mllm研究中的基础任务。在这个blog中，我们介绍如何使用pycocoevalcap这个package评测模型的captioning能力。

Dinsanity

2728人浏览 · 2024-10-25 18:07:22

Dinsanity · 2024-10-25 18:07:22 发布

1. 场景：

ImageCaption任务是CV以及mllm研究中的基础任务。在这个blog中，我们介绍如何使用pycocoevalcap这个package评测模型的captioning能力。

2. 评测指标

在ImageCaption任务中，我们通常参考bleu, meteor, rouge_L, CIDEr, SPICE等semantic matching指标。其中CIDEr和 SPICE是用来评测ImageCaption的专门指标，而bleu, meteor, rouge_L也广泛应用于NMT和summarization等NLG任务。这里对于metric的原理不进行赘述，感兴趣的小伙伴可以参考如资料：

3. pycocoevalcap 安装与配置

pycocoevalcap是微软开发的专门用于评测MS COCO数据集ImageCaption任务的工具包, 它几乎集成了所有caption evaluation指标。但是由于pycocoevalcap开发的比较早，SPICE的计算还需依赖JAVA1.8环境，安装起来并不是一帆风顺的。But，让我们一步一步来～
在这里插入图片描述

3.1 安装pycocoevalcap

在用户的conda虚拟环境中，直接用pip安装即可。国内的小伙伴可以基于国内镜像，这里使用清华源。

pip install pycocoevalcap -i https://pypi.tuna.tsinghua.edu.cn/simple

3.2 安装JAVA Version 8

首先访问JAVA官网。我用的是M1芯片的Macbook，于是要在这里下载
在这里插入图片描述
如果使用其他的设备，可以在官网里面找到符合自己设备的安装包，再进行下载。ubuntu用户可以参考这里。下载下来的是一个Java 8 Updata 431的安装向导，直接双击点开并Install即可。

随着进度条走完，JAVA Version 8 已经安装完毕，我们可以在terminal中check一下.

#查看java版本
(MLBD) xxx@xxx-MacBook-Pro ~ % java -version
java version "1.8.0_431"
Java(TM) SE Runtime Environment (build 1.8.0_431-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.431-b10, mixed mode)
#查看java 路径
(MLBD) xxx@xxx-MacBook-Pro ~ % /usr/libexec/java_home -V
Matching Java Virtual Machines (1):
    1.8.431.10 (arm64) "Oracle Corporation" - "Java" /Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home # 就是这个路径‘/Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home’
/Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home

在这里，我们就把JAVA version 8安装好了。

3.3 注释SPICE.py 中的cache

若直接进行SPICE指标的评测，有时会报如下错误：

---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
Cell In[1], line 25
     22     coco_eval.evaluate()
     24     return coco_eval.eval
---> 25 compute_cider(result_path, annotation_path)

Cell In[1], line 22
     20 coco_eval = COCOEvalCap(coco, coco_result)
     21 coco_eval.params["image_id"] = coco_result.getImgIds()
---> 22 coco_eval.evaluate()
     24 return coco_eval.eval

File /opt/miniconda3/envs/MLBD/lib/python3.10/site-packages/pycocoevalcap/eval.py:53, in COCOEvalCap.evaluate(self)
     51 for scorer, method in scorers:
     52     print('computing %s score...'%(scorer.method()))
---> 53     score, scores = scorer.compute_score(gts, res)
     54     if type(method) == list:
     55         for sc, scs, m in zip(score, scores, method):

File /opt/miniconda3/envs/MLBD/lib/python3.10/site-packages/pycocoevalcap/spice/spice.py:76, in Spice.compute_score(self, gts, res)
     69   os.makedirs(cache_dir)
     70 spice_cmd = ['java', '-jar', '-Xmx8G', SPICE_JAR, in_file.name,
     71   '-cache', cache_dir,
     72   '-out', out_file.name,
...
    368         cmd = popenargs[0]
--> 369     raise CalledProcessError(retcode, cmd)
    370 return 0

CalledProcessError: Command '['java', '-jar', '-Xmx8G', 'spice-1.0.jar', '/opt/miniconda3/envs/MLBD/lib/python3.10/site-packages/pycocoevalcap/spice/tmp/tmpq2cks2fv', '-cache', '/opt/miniconda3/envs/MLBD/lib/python3.10/site-packages/pycocoevalcap/spice/cache', '-out', '/opt/miniconda3/envs/MLBD/lib/python3.10/site-packages/pycocoevalcap/spice/tmp/tmp8ywvtify', '-subset', '-silent']' returned non-zero exit status 1.

这里似乎是因为pycocoevalcap默认从cache里找JAVA(具体可参考pycocoevalcap安装踩坑和填坑)。此时，我们需要找到pycocoevalcap中spice.py，注释掉cache读取方面的代码。我们首先找到pycocoevalcap的安装位置：

(MLBD) xxx@xxx-MacBook-Pro mmtrain % pip show pycocoevalcap
Name: pycocoevalcap
Version: 1.2
Summary: MS-COCO Caption Evaluation for Python 3
Home-page: https://github.com/salaniz/pycocoevalcap
Author: 
Author-email: 
License: UNKNOWN
# 这里就是安装location
Location: /opt/miniconda3/envs/MLBD/lib/python3.10/site-packages
Requires: pycocotools
Required-by:

接着我们访问 /opt/miniconda3/envs/MLBD/lib/python3.10/site-packages/spice/spice.py 文件。进行如下注释并保存：
在这里插入图片描述
好了，这个bug就顺利解决辽。如果还是报错，可以参考

深度学习代码，对coco数据集evaluate时，spice评估总是报错，解决如下：

4. 开测：

安装与配置就绪，我们可以进行测评了。主要有两种测评方式：

4.1 pycocotools评测

我们需要将annotations和preds整合成规定的json数据格式，整体评测出所有caption相关指标的值。
在example里我们考虑两张图片的caption评测，每张图片都有5个标注者给予了annotations。
annotations.json (dist) 如下，

annotations[‘images’][‘id’]为图片编号，用来区别不同的图片；annotations[‘annotations’][‘image_id’]与annotations[‘images’][‘id’]是对上的，都指的是图片编号。
annotations[‘annotations’][‘id’]指的是annotations的编号; annotations[‘annotations’][‘caption’]指每个caption annotation.

{	
	"images": [
				{
				 	"id": 0
					},
				{
					"id": 1
					}
				], 
	"annotations":[
					{
						"image_id": 0, 
						"id": 0, 
						"caption": "Two young guys with shaggy hair look at their hands while hanging out in the yard."
						}, 
					{
						"image_id": 0, 
						"id": 1, 
						"caption": "Two young, White males are outside near many bushes."
						}, 
					{
						"image_id": 0, 
						"id": 2, 
						"caption": "Two men in green shirts are standing in a yard."
						}, 
					 {
					 	"image_id": 0, 
					 	"id": 3, 
					 	"caption": "A man in a blue shirt standing in a garden."
					 	}, 
					 {
					 	"image_id": 0, 
					 	"id": 4, 
					 	"caption": "Two friends enjoy time spent together."
					 	},
					{
						"image_id": 1, 
						"id": 5, 
						"caption": "Several men in hard hats are operating a giant pulley system."
						}, 
					{
						"image_id": 1, 
						"id": 6, 
						"caption": "Workers look down from up above on a piece of equipment."
						}, 
					{
						"image_id": 1, 
						"id": 7, 
						"caption": "Two men working on a machine wearing hard hats."
						}, 
					{
						"image_id": 1, 
						"id": 8, 
						"caption": "Four men on top of a tall structure."
						}
				]
	}

model prediction被整合到preds.json (list)中，如下：

"image_id"为图片编号， “caption”为model 预测出的caption。

[	
	{
		"image_id": 0, 
		"caption": "Two men standing in front of a door in a garden.Two men standing in front of a door in a garden."
		},
	{
		"image_id": 1, 
		"caption": "a man standing on the top of a metal tower a man standing on the top of a metal tower(36"
		}
	]

from pycocoevalcap.eval import COCOEvalCap
from pycocotools.coco import COCO
import os
result_path = 'preds.json'
annotation_path='annotations.json'

def compute_cider(
    result_path,
    annotations_path,
):
    # create coco object and coco_result object
    coco = COCO(annotations_path)
    coco_result = coco.loadRes(result_path)
    # create coco_eval object by taking coco and coco_result
    coco_eval = COCOEvalCap(coco, coco_result)
    coco_eval.params["image_id"] = coco_result.getImgIds()
    coco_eval.evaluate()

    return coco_eval.eval
compute_cider(result_path, annotation_path)

4.2 只评测我们关心的某一个/几个指标。

此时可以参考pycocoevalcap安装踩坑和填坑

from pycocoevalcap.bleu.bleu import Bleu
from pycocoevalcap.meteor.meteor import Meteor
from pycocoevalcap.rouge.rouge import Rouge
from pycocoevalcap.cider.cider import Cider
from pycocoevalcap.spice.spice import Spice


class Scorer():
    def __init__(self, ref, gt):
        self.ref = ref
        self.gt = gt
        print('setting up scorers...')
        self.scorers = [
            (Bleu(4), ["Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4"]),
            (Meteor(), "METEOR"),
            (Rouge(), "ROUGE_L"),
            (Cider(), "CIDEr"),
            (Spice(), "SPICE"),
        ]
    
    def compute_scores(self):
        total_scores = {}
        for scorer, method in self.scorers:
            print('computing %s score...' % (scorer.method()))
            score, scores = scorer.compute_score(self.gt, self.ref)
            if type(method) == list:
                for sc, scs, m in zip(score, scores, method):
                    print("%s: %0.3f" % (m, sc))
                total_scores["Bleu"] = score
            else:
                print("%s: %0.3f" % (method, score))
                total_scores[method] = score
        
        print('*****DONE*****')
        for key, value in total_scores.items():
            print('{}:{}'.format(key, value))


if __name__ == '__main__':
    ref = {
        '1': ['go down the stairs and stop at the bottom .'],
        '2': ['this is a cat.']
    }
    gt = {
        '1': ['Walk down the steps and stop at the bottom. ', 'Go down the stairs and wait at the bottom.',
              'Once at the top of the stairway, walk down the spiral staircase all the way to the bottom floor. Once you have left the stairs you are in a foyer and that indicates you are at your destination.'],
        '2': ['It is a cat.', 'There is a cat over there.', 'cat over there.']
    }
    scorer = Scorer(ref, gt)
    scorer.compute_scores()

结果如下：

setting up scorers...
computing Bleu score...
{'testlen': 14, 'reflen': 13, 'guess': [14, 12, 10, 8], 'correct': [11, 9, 5, 2]}
ratio: 1.076923076840237
Bleu_1: 0.786
Bleu_2: 0.768
Bleu_3: 0.665
Bleu_4: 0.521
computing METEOR score...
METEOR: 0.904
computing Rouge score...
ROUGE_L: 0.694
computing CIDEr score...
CIDEr: 2.274
computing SPICE score...
Parsing reference captions
Initiating Stanford parsing pipeline
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[main] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... 
done [0.2 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.0 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.7 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.4 sec].
Threads( StanfordCoreNLP ) [0.440 seconds]
Parsing test captions
Threads( StanfordCoreNLP ) 
SPICE evaluation took: 3.809 s
SPICE: 0.383
*****DONE*****
Bleu:[0.7857142856581634, 0.7676494735193371, 0.6654242733719051, 0.5209655210466142]
METEOR:0.9037037037037037
ROUGE_L:0.6938153310104529
CIDEr:2.273565655101005
SPICE:0.38333333333333336

Tips:

由于CIDER指标的计算依赖TF-IDF，我么需要喂入大于等于2张图片的preds和annotations, 否则会出现CIDER=0的情况。

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

AI应用架构师指南：扩容方案中的配置中心

当你负责的AI应用（比如LLM推理服务、向量检索系统）突然遭遇流量爆炸——从100 QPS飙升到1000 QPS时，你快速扩容了10台GPU服务器。新节点加载了旧版本的模型，导致响应结果不一致；部分节点的GPU显存配置错误，引发OOM崩溃；配置更新延迟了3分钟，期间服务出现大量超时。这些问题的根源不是扩容速度不够快，而是配置管理没有适配AI场景的特殊性。传统配置中心（如Spring Cloud C

2048 AI社区

提示工程架构师深度解析：AI大模型提示理解偏差与准确性修正策略

要解决提示理解偏差，首先需明确什么是偏差偏差的表现形式，以及问题空间的边界。提示理解偏差（Prompt Comprehension Bias）：大模型对用户输入提示的语义理解与用户预期意图之间的系统性偏离。偏差类型示例后果意图误解用户说“帮我订机票”，模型输出“机票价格查询”任务失败，用户投诉歧义解读“银行”被理解为“金融机构”或“河岸”输出与需求无关信息遗漏长提示中“Q3销售额分析”被忽略输出不

2048 AI社区

谷歌提示工程架构师深度解析：AI模型可解释性设计的底层逻辑，终于讲清楚了

Interpretability（内在可解释性）：模型本身的结构或输出是“人类可直接理解的”。比如线性模型的系数、决策树的分裂规则——不需要额外工具，你就能看懂模型在“想什么”。Explainability（事后可解释性）：模型本身是“黑盒”，需要用额外的工具或方法，对模型的决策结果进行反向解释。比如用SHAP值解释BERT的文本分类结果，用激活热力图解释CNN的图像识别结果。用线性模型预测房价（