目标检测评价标准mAP

Python

发布日期: 2020-10-28

更新日期: 2022-07-31

文章字数: 2.8k

阅读时长: 12 分

阅读次数:

引言：

目标检测中的AP和mAP计算方法,看了几篇相关资料，知乎上一篇文章
https://zhuanlan.zhihu.com/p/43068926 写的容易理解，转载以学习，为了方便自己理解。

1、Recall & Precision

mAP全称是mean Average Precision，这里的Average Precision，是在不同recall下计算得到的，所以要知道什么是mAP，要先了解recall（召回率）和precision（精确率）。
recall和precision是二分类问题中常用的评价指标，通常以关注的类为正类，其他类为负类，分类器的结果在测试数据上有4种情况：

	实际 1	实际 0
预测 1	TP（True Positive）	FP（False Positive）
预测 0	FN（False Negative）	TN（True Negative）

计算公式分别为：

用一个具体的例子说明：
假设我们在数据集上训练了一个识别猫咪的模型，测试集包含100个样本，其中猫咪60张，另外40张为小狗。测试结果显示为猫咪的一共有52张图片，其中确实为猫咪的共50张，也就是有10张猫咪没有被模型检测出来，而且在检测结果中有2张为误检。因为猫咪更可爱，我们更关注猫咪的检测情况，所以这里将猫咪认为是正类：

所以TP=50，TN=38，FN=10，FP=2，P=50/52，R=50/60，acc=(50+38)/(50+38+10+2)

为什么要引入recall和precision？

recall和precision是模型性能两个不同维度的度量：

在图像分类任务中，虽然很多时候考察的是accuracy，比如ImageNet的评价标准。但具体到单个类别，如果recall比较高，但precision较低，比如大部分的汽车都被识别出来了，但把很多卡车也误识别为了汽车，这时候对应一个原因。如果recall较低，precision较高，比如检测出的飞机结果很准确，但是有很多的飞机没有被识别出来，这时候又有一个原因。

recall度量的是「查全率」，所有的正样本是不是都被检测出来了。比如在肿瘤预测场景中，要求模型有更高的recall，不能放过每一个肿瘤。

precision度量的是「查准率」，在所有检测出的正样本中是不是实际都为正样本。比如在垃圾邮件判断等场景中，要求有更高的precision，确保放到回收站的都是垃圾邮件。

2、mAP（mean Average Precision）

在查找资料的过程中，发现从信息检索的角度出发更容易理解mAP的含义。

在信息检索当中，比如我们搜索一个条目，相关的条目在数据库中一共有5条，但搜索的结果一共有10条（包含4条相关条目）。这个时候精确率precision=返回结果中相关的条目数/返回总条目数，在这里等于4/10。召回率recall=返回结果中相关条目数/相关条目总数，在这里等于4/5。但对于一个搜索系统，相关条目在结果中的顺序是非常影响用户体验的，我们希望相关的结果越靠前越好。比如在这个例子中，4个条目出现在位置查询一（1，2，4，7）就比在查询二（3，5，6，8）效果要好，但两者的precision是相等的。这时候单单一个precision不足以衡量系统的好坏，于是引入了AP（Average Precision）——不同召回率上的平均precision。对于上面两个例子。

查询一：

rank | correct |   P   |   R
-----------------------------
  1   | right  |  1/1  |  1/5
-----------------------------
  2  |  right  | 2/2  |  2/5
-----------------------------
  3  |  wrong  | 2/3  |  2/5
-----------------------------
  4  |   right | 3/4  |  3/5
-----------------------------
  5  |  wrong  | 3/5  |  3/5
-----------------------------
  6  |  wrong  | 3/6  |  3/5
-----------------------------
  7  |  right  | 4/7  |  4/5
-----------------------------
  8  |  wrong  | 4/8  |  4/5
-----------------------------
  9  |  wrong  | 4/9  |  4/5
-----------------------------
  10 |  wrong  | 4/10 |  4/5
------------------------------

查询一：

rank | correct |   P   |   R
 -----------------------------
   1  |  wrong  |   0   |   0
 -----------------------------
   2  |  wrong  |   0   |   0
 -----------------------------
   3  |  right  | 1/3  |  1/5
 -----------------------------
   4  |  wrong  | 1/4  |  1/5
 -----------------------------
   5  |  right  | 2/5  |  2/5
 -----------------------------
   6  |  right  | 3/6  |  3/5
 -----------------------------
   7  |  wrong  | 3/7  |  3/5
 -----------------------------
   8  |  right  | 4/8  |  4/5
 -----------------------------
   9  |  wrong  | 4/9  |  4/5
 -----------------------------
  10  |  wrong  | 4/10 |  4/5
 -----------------------------

AP(查询一) = (1+1+3/4+4/7+0)/5 = 0.664

AP(查询二) = (1/3+2/5+3/6+4/8+0)/5 = 0.347

这个时候mAP = (0.664+0.347)/2 = 0.51

分析：对于上面的例子，最好的结果就是5个条目全部被检索到，并且分别排在rank=1、2、3、4、5的位置，这时AP=1。所以可以得出即使条目被全部检索到，但结果的先后顺序决定了一个系统的好坏。这个结论会用在目标检测当中。

注：precision在计算的时候取各个召回率下最大的那个，因为同一recall下最大的precision表示该条目最先出现的位置。

3、目标检测中的mAP

图像分类任务通常用accuracy来衡量模型的准确率，对于目标检测任务，比如测试集上的所有图片一共有1000个object（这里的object不是图片的数量，因为一张图片中可能包含若干个object），两个模型都正确检测出了900个object（IOU>规定的阈值）。与图像分类任务不同的是，目标检测因为可能出现重复检测的情况，所以不是一个n to n的问题。在上面的例子中也就不能简单用分类任务的accuracy来衡量模型性能，因为模型A有可能是预测了2000个结果才中了900个，而模型B可能只预测了1200个结果。模型B的性能显然要好于A，因为模型A更像是广撒网，误检测的概率比较高。想象一下如果将模型A用在自动驾驶的汽车上，出现很多误检测的情况对汽车的安全性和舒适性都有很大影响。

那在目标检测任务中，应该怎样衡量模型的性能？其中一个标准就是信息检索那样，不仅要衡量检测出正确目标的数量，还应该评价模型是否能以较高的precision检测出目标。也就是在某个类别下的检测，在检测出正确目标之前，是不是出现了很多判断失误。AP越高，说明检测失误越少。对于所有类别的AP求平均就得到mAP了。

4、计算方法和相关代码

voc2007的计算方法：
在计算AP时，首先要把结果按照置信度排序，公式如下：

voc2010的计算方法：
比起07年，10年以后的新方法是取所有真实的recall值，按照07年的方法得到所有recall/precision数据点以后，计算recall/precision曲线下的面积：

Compute a version of the measured precision/recall curve with precision monotonically decreasing, by setting the precision for recall r to the maximum precision obtained for any recall r′ ≥ r.
Compute the AP as the area under this curve by numerical integration. No approximation is involved since the curve is piecewise constant.

举一个例子具体说明：
对于Aeroplane类别，我们有以下输出（BB表示Bounding Box序号，IOU>0.5时GT=1）：

BB  | confidence | GT
----------------------
BB1 |  0.9       | 1
----------------------
BB2 |  0.9       | 1
----------------------
BB1 |  0.8       | 1
----------------------
BB3 |  0.7       | 0
----------------------
BB4 |  0.7       | 0
----------------------
BB5 |  0.7       | 1
----------------------
BB6 |  0.7       | 0
----------------------
BB7 |  0.7       | 0
----------------------
BB8 |  0.7       | 1
----------------------
BB9 |  0.7       | 1
----------------------

因此，我们有 TP=5 (BB1, BB2, BB5, BB8, BB9), FP=5 (重复检测到的BB1也算FP)。除了表里检测到的5个GT以外，我们还有2个GT没被检测到，因此: FN = 2. 这时我们就可以按照Confidence的顺序给出各处的PR值，如下：

rank=1  precision=1.00 and recall=0.14
------------------------------
rank=2  precision=1.00 and recall=0.29
------------------------------
rank=3  precision=0.66 and recall=0.29
------------------------------
rank=4  precision=0.50 and recall=0.29
------------------------------
rank=5  precision=0.40 and recall=0.29
------------------------------
rank=6  precision=0.50 and recall=0.43
------------------------------
rank=7  precision=0.43 and recall=0.43
------------------------------
rank=8  precision=0.38 and recall=0.43
------------------------------
rank=9  precision=0.44 and recall=0.57
------------------------------
rank=10 precision=0.50 and recall=0.71
------------------------------

07年的方法：
我们选取Recall >={ 0, 0.1, …, 1}的11处Percision的最大值：1, 1, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0, 0, 0。AP = 5.5 / 11 = 0.5

VOC2010及以后的方法：
对于Recall >= {0, 0.14, 0.29, 0.43, 0.57, 0.71, 1}，我们选取此时Percision的最大值：1, 1, 1, 0.5, 0.5, 0.5, 0。计算recall/precision下的面积：AP = (0.14-0)x1 + (0.29-0.14)x1 + (0.43-0.29)x0.5 + (0.57-0.43)x0.5 + (0.71-0.57)x0.5 + (1-0.71)x0 = 0.5

计算出每个类别的AP以后，对于所有类别的AP取均值就得到mAP了。

代码:

# 计算recall, precision和AP
class_recs = {}
    npos = 0
    for imagename in imagenames:
        R = [obj for obj in recs[imagename] if obj['name'] == classname] 
        bbox = np.array([x['bbox'] for x in R])
        difficult = np.array([x['difficult'] for x in R]).astype(np.bool)
        det = [False] * len(R) #这个值是用来判断是否重复检测的
        npos = npos + sum(~difficult)
        class_recs[imagename] = {'bbox': bbox,
                                 'difficult': difficult,
                                 'det': det}

    # read dets
    detfile = detpath.format(classname)
    with open(detfile, 'r') as f:
        lines = f.readlines()

    splitlines = [x.strip().split(' ') for x in lines]
    image_ids = [x[0] for x in splitlines]
    confidence = np.array([float(x[1]) for x in splitlines])
    BB = np.array([[float(z) for z in x[2:]] for x in splitlines])

    # sort by confidence
    sorted_ind = np.argsort(-confidence)
    BB = BB[sorted_ind, :]
    image_ids = [image_ids[x] for x in sorted_ind]

    # go down dets and mark TPs and FPs
    nd = len(image_ids)
    tp = np.zeros(nd)
    fp = np.zeros(nd)
    for d in range(nd):
        R = class_recs[image_ids[d]]
        bb = BB[d, :].astype(float)
        ovmax = -np.inf
        BBGT = R['bbox'].astype(float)

        if BBGT.size > 0:
            # compute overlaps
            # intersection
            ixmin = np.maximum(BBGT[:, 0], bb[0])
            iymin = np.maximum(BBGT[:, 1], bb[1])
            ixmax = np.minimum(BBGT[:, 2], bb[2])
            iymax = np.minimum(BBGT[:, 3], bb[3])
            iw = np.maximum(ixmax - ixmin + 1., 0.)
            ih = np.maximum(iymax - iymin + 1., 0.)
            inters = iw * ih

            # union
            uni = ((bb[2] - bb[0] + 1.) * (bb[3] - bb[1] + 1.) +
                   (BBGT[:, 2] - BBGT[:, 0] + 1.) *
                   (BBGT[:, 3] - BBGT[:, 1] + 1.) - inters)

            overlaps = inters / uni
            ovmax = np.max(overlaps)
            jmax = np.argmax(overlaps)

        if ovmax > ovthresh:
            if not R['difficult'][jmax]:
                if not R['det'][jmax]:
                    tp[d] = 1.
                    R['det'][jmax] = 1 #判断是否重复检测，检测过一次以后，值就从False变为1了
                else:
                    fp[d] = 1.
        else:
            fp[d] = 1.

    # compute precision recall
    fp = np.cumsum(fp)
    tp = np.cumsum(tp)
    rec = tp / float(npos)
    # avoid divide by zero in case the first detection matches a difficult
    # ground truth
    prec = tp / np.maximum(tp + fp, np.finfo(np.float64).eps)
    ap = voc_ap(rec, prec, use_07_metric)

    return rec, prec, ap

计算AP:

def voc_ap(rec, prec, use_07_metric=False):
    """Compute VOC AP given precision and recall. If use_07_metric is true, uses
    the VOC 07 11-point method (default:False).
    """
    if use_07_metric:
        # 11 point metric
        ap = 0.
        for t in np.arange(0., 1.1, 0.1):
            if np.sum(rec >= t) == 0:
                p = 0
            else:
                p = np.max(prec[rec >= t])
            ap = ap + p / 11.
    else:
        # correct AP calculation
        # first append sentinel values at the end
        mrec = np.concatenate(([0.], rec, [1.]))
        mpre = np.concatenate(([0.], prec, [0.]))

        # compute the precision envelope
        for i in range(mpre.size - 1, 0, -1):
            mpre[i - 1] = np.maximum(mpre[i - 1], mpre[i])
        i = np.where(mrec[1:] != mrec[:-1])[0]

        # and sum (\Delta recall) * prec
        ap = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1]) #计算面积
    return ap

计算mAP:

def mAP():

    detpath,annopath,imagesetfile,cachedir,class_path = get_dir('kitti')
    ovthresh=0.3,
    use_07_metric=False

    rec = 0; prec = 0; mAP = 0
    class_list = get_classlist(class_path)
    for classname in class_list:
        rec, prec, ap = voc_eval(detpath,
                                 annopath,
                                 imagesetfile,
                                 classname,
                                 cachedir,
                                 ovthresh=0.5,
                                 use_07_metric=False,
                                 kitti=True)
        print('on {}, the ap is {}, recall is {}, precision is {}'.format(classname, ap, rec[-1], prec[-1]))
        mAP += ap
    mAP = float(mAP) / len(class_list)

    return mAP