文章作者:Tyan
博客:noahsnail.com ?|? CSDN ?|? 簡(jiǎn)書(shū)
聲明:作者翻譯論文僅為學(xué)習(xí),如有侵權(quán)請(qǐng)聯(lián)系作者刪除博文,謝謝!
翻譯論文匯總:https://github.com/SnailTyan/deep-learning-papers-translation
YOLO9000: Better, Faster, Stronger
Abstract
We introduce YOLO9000, a state-of-the-art, real-time object detection system that can detect over 9000 object categories. First we propose various improvements to the YOLO detection method, both novel and drawn from prior work. The improved model, YOLOv2, is state-of-the-art on standard detection tasks like PASCAL VOC and COCO. Using a novel, multi-scale training method the same YOLOv2 model can run at varying sizes, offering an easy tradeoff between speed and accuracy. At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6 mAP, outperforming state-of-the-art methods like Faster R-CNN with ResNet and SSD while still running significantly faster. Finally we propose a method to jointly train on object detection and classification. Using this method we train YOLO9000 simultaneously on the COCO detection dataset and the ImageNet classification dataset. Our joint training allows YOLO9000 to predict detections for object classes that don’t have labelled detection data. We validate our approach on the ImageNet detection task. YOLO9000 gets 19.7 mAP on the ImageNet detection validation set despite only having detection data for 44 of the 200 classes. On the 156 classes not in COCO, YOLO9000 gets 16.0 mAP. But YOLO can detect more than just 200 classes; it predicts detections for more than 9000 different object categories. And it still runs in real-time.
摘要
我們引入了一個(gè)先進(jìn)的實(shí)時(shí)目標(biāo)檢測(cè)系統(tǒng)YOLO9000,可以檢測(cè)超過(guò)9000個(gè)目標(biāo)類(lèi)別。首先,我們提出了對(duì)YOLO檢測(cè)方法的各種改進(jìn),既有新穎性,也有前期的工作。改進(jìn)后的模型YOLOv2在PASCAL VOC和COCO等標(biāo)準(zhǔn)檢測(cè)任務(wù)上是最先進(jìn)的。使用一種新穎的,多尺度訓(xùn)練方法,同樣的YOLOv2模型可以以不同的尺寸運(yùn)行,從而在速度和準(zhǔn)確性之間提供了一個(gè)簡(jiǎn)單的折衷。在67FPS時(shí),YOLOv2在VOC 2007上獲得了76.8 mAP。在40FPS時(shí),YOLOv2獲得了78.6 mAP,比使用ResNet的Faster R-CNN和SSD等先進(jìn)方法表現(xiàn)更出色,同時(shí)仍然運(yùn)行速度顯著更快。最后我們提出了一種聯(lián)合訓(xùn)練目標(biāo)檢測(cè)與分類(lèi)的方法。使用這種方法,我們?cè)贑OCO檢測(cè)數(shù)據(jù)集和ImageNet分類(lèi)數(shù)據(jù)集上同時(shí)訓(xùn)練YOLO9000。我們的聯(lián)合訓(xùn)練允許YOLO9000預(yù)測(cè)未標(biāo)注的檢測(cè)數(shù)據(jù)目標(biāo)類(lèi)別的檢測(cè)結(jié)果。我們?cè)贗mageNet檢測(cè)任務(wù)上驗(yàn)證了我們的方法。YOLO9000在ImageNet檢測(cè)驗(yàn)證集上獲得19.7 mAP,盡管200個(gè)類(lèi)別中只有44個(gè)具有檢測(cè)數(shù)據(jù)。在沒(méi)有COCO的156個(gè)類(lèi)別上,YOLO9000獲得16.0 mAP。但YOLO可以檢測(cè)到200多個(gè)類(lèi)別;它預(yù)測(cè)超過(guò)9000個(gè)不同目標(biāo)類(lèi)別的檢測(cè)結(jié)果。并且它仍然能實(shí)時(shí)運(yùn)行。
1. Introduction
General purpose object detection should be fast, accurate, and able to recognize a wide variety of objects. Since the introduction of neural networks, detection frameworks have become increasingly fast and accurate. However, most detection methods are still constrained to a small set of objects.
1. 引言
通用目的的目標(biāo)檢測(cè)應(yīng)該快速,準(zhǔn)確,并且能夠識(shí)別各種各樣的目標(biāo)。自從引入神經(jīng)網(wǎng)絡(luò)以來(lái),檢測(cè)框架變得越來(lái)越快速和準(zhǔn)確。但是,大多數(shù)檢測(cè)方法仍然受限于一小部分目標(biāo)。
Current object detection datasets are limited compared to datasets for other tasks like classification and tagging. The most common detection datasets contain thousands to hundreds of thousands of images with dozens to hundreds of tags [3] [10] [2]. Classification datasets have millions of images with tens or hundreds of thousands of categories [20] [2].
與分類(lèi)和標(biāo)記等其他任務(wù)的數(shù)據(jù)集相比,目前目標(biāo)檢測(cè)數(shù)據(jù)集是有限的。最常見(jiàn)的檢測(cè)數(shù)據(jù)集包含成千上萬(wàn)到數(shù)十萬(wàn)張具有成百上千個(gè)標(biāo)簽的圖像[3][10][2]。分類(lèi)數(shù)據(jù)集有數(shù)以百萬(wàn)計(jì)的圖像,數(shù)十或數(shù)十萬(wàn)個(gè)類(lèi)別[20][2]。
We would like detection to scale to level of object classification. However, labelling images for detection is far more expensive than labelling for classification or tagging (tags are often user-supplied for free). Thus we are unlikely to see detection datasets on the same scale as classification datasets in the near future.
我們希望檢測(cè)能夠擴(kuò)展到目標(biāo)分類(lèi)的級(jí)別。但是,標(biāo)注檢測(cè)圖像要比標(biāo)注分類(lèi)或貼標(biāo)簽要昂貴得多(標(biāo)簽通常是用戶免費(fèi)提供的)。因此,我們不太可能在近期內(nèi)看到與分類(lèi)數(shù)據(jù)集相同規(guī)模的檢測(cè)數(shù)據(jù)集。
We propose a new method to harness the large amount of classification data we already have and use it to expand the scope of current detection systems. Our method uses a hierarchical view of object classification that allows us to combine distinct datasets together.
我們提出了一種新的方法來(lái)利用我們已經(jīng)擁有的大量分類(lèi)數(shù)據(jù),并用它來(lái)擴(kuò)大當(dāng)前檢測(cè)系統(tǒng)的范圍。我們的方法使用目標(biāo)分類(lèi)的分層視圖,允許我們將不同的數(shù)據(jù)集組合在一起。
We also propose a joint training algorithm that allows us to train object detectors on both detection and classification data. Our method leverages labeled detection images to learn to precisely localize objects while it uses classification images to increase its vocabulary and robustness.
我們還提出了一種聯(lián)合訓(xùn)練算法,使我們能夠在檢測(cè)和分類(lèi)數(shù)據(jù)上訓(xùn)練目標(biāo)檢測(cè)器。我們的方法利用標(biāo)記的檢測(cè)圖像來(lái)學(xué)習(xí)精確定位物體,同時(shí)使用分類(lèi)圖像來(lái)增加詞表和魯棒性。
Using this method we train YOLO9000, a real-time object detector that can detect over 9000 different object categories. First we improve upon the base YOLO detection system to produce YOLOv2, a state-of-the-art, real-time detector. Then we use our dataset combination method and joint training algorithm to train a model on more than 9000 classes from ImageNet as well as detection data from COCO.

Figure 1: YOLO9000. YOLO9000 can detect a wide variety of object classes in real-time.
使用這種方法我們訓(xùn)練YOLO9000,一個(gè)實(shí)時(shí)的目標(biāo)檢測(cè)器,可以檢測(cè)超過(guò)9000種不同的目標(biāo)類(lèi)別。首先,我們改進(jìn)YOLO基礎(chǔ)檢測(cè)系統(tǒng),產(chǎn)生最先進(jìn)的實(shí)時(shí)檢測(cè)器YOLOv2。然后利用我們的數(shù)據(jù)集組合方法和聯(lián)合訓(xùn)練算法對(duì)來(lái)自ImageNet的9000多個(gè)類(lèi)別以及COCO的檢測(cè)數(shù)據(jù)訓(xùn)練了一個(gè)模型。

圖1:YOLO9000。YOLO9000可以實(shí)時(shí)檢測(cè)許多目標(biāo)類(lèi)別。
All of our code and pre-trained models are available online at http://pjreddie.com/yolo9000/.
我們的所有代碼和預(yù)訓(xùn)練模型都可在線獲得:http://pjreddie.com/yolo9000/。
2. Better
YOLO suffers from a variety of shortcomings relative to state-of-the-art detection systems. Error analysis of YOLO compared to Fast R-CNN shows that YOLO makes a significant number of localization errors. Furthermore, YOLO has relatively low recall compared to region proposal-based methods. Thus we focus mainly on improving recall and localization while maintaining classification accuracy.
2. 更好
與最先進(jìn)的檢測(cè)系統(tǒng)相比,YOLO有許多缺點(diǎn)。YOLO與Fast R-CNN相比的誤差分析表明,YOLO造成了大量的定位誤差。此外,與基于區(qū)域提出的方法相比,YOLO召回率相對(duì)較低。因此,我們主要側(cè)重于提高召回率和改進(jìn)定位,同時(shí)保持分類(lèi)準(zhǔn)確性。
Computer vision generally trends towards larger, deeper networks [6] [18] [17]. Better performance often hinges on training larger networks or ensembling multiple models together. However, with YOLOv2 we want a more accurate detector that is still fast. Instead of scaling up our network, we simplify the network and then make the representation easier to learn. We pool a variety of ideas from past work with our own novel concepts to improve YOLO’s performance. A summary of results can be found in Table 2.

Table 2: The path from YOLO to YOLOv2. Most of the listed design decisions lead to significant increases in mAP. Two exceptions are switching to a fully convolutional network with anchor boxes and using the new network. Switching to the anchor box style approach increased recall without changing mAP while using the new network cut computation by $33%$.
計(jì)算機(jī)視覺(jué)一般趨向于更大,更深的網(wǎng)絡(luò)[6][18][17]。更好的性能通常取決于訓(xùn)練更大的網(wǎng)絡(luò)或?qū)⒍鄠€(gè)模型組合在一起。但是,在YOLOv2中,我們需要一個(gè)更精確的檢測(cè)器,它仍然很快。我們不是擴(kuò)大我們的網(wǎng)絡(luò),而是簡(jiǎn)化網(wǎng)絡(luò),然后讓表示更容易學(xué)習(xí)。我們將過(guò)去的工作與我們自己的新概念匯集起來(lái),以提高YOLO的性能。表2列出了結(jié)果總結(jié)。

表2:從YOLO到Y(jié)OLOv2的路徑。列出的大部分設(shè)計(jì)決定都會(huì)導(dǎo)致mAP的顯著增加。有兩個(gè)例外是切換到具有錨盒的一個(gè)全卷積網(wǎng)絡(luò)和使用新網(wǎng)絡(luò)。切換到錨盒風(fēng)格的方法增加了召回,而不改變mAP,而使用新網(wǎng)絡(luò)會(huì)削減$33%$的計(jì)算量。
Batch Normalization. Batch normalization leads to significant improvements in convergence while eliminating the need for other forms of regularization [7]. By adding batch normalization on all of the convolutional layers in YOLO we get more than $2%$ improvement in mAP. Batch normalization also helps regularize the model. With batch normalization we can remove dropout from the model without overfitting.
批標(biāo)準(zhǔn)化。批標(biāo)準(zhǔn)化導(dǎo)致收斂性的顯著改善,同時(shí)消除了對(duì)其他形式正則化的需求[7]。通過(guò)在YOLO的所有卷積層上添加批標(biāo)準(zhǔn)化,我們?cè)趍AP中獲得了超過(guò)$2%$的改進(jìn)。批標(biāo)準(zhǔn)化也有助于模型正則化。通過(guò)批標(biāo)準(zhǔn)化,我們可以從模型中刪除丟棄而不會(huì)過(guò)擬合。
High Resolution Classifier. All state-of-the-art detection methods use classifier pre-trained on ImageNet [16]. Starting with AlexNet most classifiers operate on input images smaller than 256 × 256 [8]. The original YOLO trains the classifier network at 224 × 224 and increases the resolution to 448 for detection. This means the network has to simultaneously switch to learning object detection and adjust to the new input resolution.
高分辨率分類(lèi)器。所有最先進(jìn)的檢測(cè)方法都使用在ImageNet[16]上預(yù)訓(xùn)練的分類(lèi)器。從AlexNet開(kāi)始,大多數(shù)分類(lèi)器對(duì)小于256×256[8]的輸入圖像進(jìn)行操作。原來(lái)的YOLO以224×224的分辨率訓(xùn)練分類(lèi)器網(wǎng)絡(luò),并將分辨率提高到448進(jìn)行檢測(cè)。這意味著網(wǎng)絡(luò)必須同時(shí)切換到學(xué)習(xí)目標(biāo)檢測(cè)和調(diào)整到新的輸入分辨率。
For YOLOv2 we first fine tune the classification network at the full 448 × 448 resolution for 10 epochs on ImageNet. This gives the network time to adjust its filters to work better on higher resolution input. We then fine tune the resulting network on detection. This high resolution classification network gives us an increase of almost $4%$ mAP.
對(duì)于YOLOv2,我們首先ImageNet上以448×448的分辨率對(duì)分類(lèi)網(wǎng)絡(luò)進(jìn)行10個(gè)迭代周期的微調(diào)。這給了網(wǎng)絡(luò)時(shí)間來(lái)調(diào)整其濾波器以便更好地處理更高分辨率的輸入。然后,我們?cè)跈z測(cè)上微調(diào)得到的網(wǎng)絡(luò)。這個(gè)高分辨率分類(lèi)網(wǎng)絡(luò)使我們?cè)黾恿私?4%$的mAP。
Convolutional With Anchor Boxes. YOLO predicts the coordinates of bounding boxes directly using fully connected layers on top of the convolutional feature extractor. Instead of predicting coordinates directly Faster R-CNN predicts bounding boxes using hand-picked priors [15]. Using only convolutional layers the region proposal network (RPN) in Faster R-CNN predicts offsets and confidences for anchor boxes. Since the prediction layer is convolutional, the RPN predicts these offsets at every location in a feature map. Predicting offsets instead of coordinates simplifies the problem and makes it easier for the network to learn.
具有錨盒的卷積。YOLO直接使用卷積特征提取器頂部的全連接層來(lái)預(yù)測(cè)邊界框的坐標(biāo)。Faster R-CNN使用手動(dòng)選擇的先驗(yàn)來(lái)預(yù)測(cè)邊界框而不是直接預(yù)測(cè)坐標(biāo)[15]。Faster R-CNN中的區(qū)域提出網(wǎng)絡(luò)(RPN)僅使用卷積層來(lái)預(yù)測(cè)錨盒的偏移和置信度。由于預(yù)測(cè)層是卷積的,所以RPN在特征映射的每個(gè)位置上預(yù)測(cè)這些偏移。預(yù)測(cè)偏移而不是坐標(biāo)簡(jiǎn)化了問(wèn)題,并且使網(wǎng)絡(luò)更容易學(xué)習(xí)。
We remove the fully connected layers from YOLO and use anchor boxes to predict bounding boxes. First we eliminate one pooling layer to make the output of the network’s convolutional layers higher resolution. We also shrink the network to operate on 416 input images instead of 448×448. We do this because we want an odd number of locations in our feature map so there is a single center cell. Objects, especially large objects, tend to occupy the center of the image so it’s good to have a single location right at the center to predict these objects instead of four locations that are all nearby. YOLO’s convolutional layers downsample the image by a factor of 32 so by using an input image of 416 we get an output feature map of 13 × 13.
我們從YOLO中移除全連接層,并使用錨盒來(lái)預(yù)測(cè)邊界框。首先,我們消除了一個(gè)池化層,使網(wǎng)絡(luò)卷積層輸出具有更高的分辨率。我們還縮小了網(wǎng)絡(luò),操作416×416的輸入圖像而不是448×448。我們這樣做是因?yàn)槲覀円谖覀兊奶卣饔成渲杏衅鏀?shù)個(gè)位置,所以只有一個(gè)中心單元。目標(biāo),特別是大目標(biāo),往往占據(jù)圖像的中心,所以在中心有一個(gè)單獨(dú)的位置來(lái)預(yù)測(cè)這些目標(biāo),而不是四個(gè)都在附近的位置是很好的。YOLO的卷積層將圖像下采樣32倍,所以通過(guò)使用416的輸入圖像,我們得到了13×13的輸出特征映射。
When we move to anchor boxes we also decouple the class prediction mechanism from the spatial location and instead predict class and objectness for every anchor box. Following YOLO, the objectness prediction still predicts the IOU of the ground truth and the proposed box and the class predictions predict the conditional probability of that class given that there is an object.
當(dāng)我們移動(dòng)到錨盒時(shí),我們也將類(lèi)預(yù)測(cè)機(jī)制與空間位置分離,預(yù)測(cè)每個(gè)錨盒的類(lèi)別和目標(biāo)。在YOLO之后,目標(biāo)預(yù)測(cè)仍然預(yù)測(cè)了實(shí)際值和提出的邊界框的IOU,并且類(lèi)別預(yù)測(cè)預(yù)測(cè)了當(dāng)存在目標(biāo)時(shí)該類(lèi)別的條件概率。
Using anchor boxes we get a small decrease in accuracy. YOLO only predicts 98 boxes per image but with anchor boxes our model predicts more than a thousand. Without anchor boxes our intermediate model gets 69.5 mAP with a recall of $81%$. With anchor boxes our model gets 69.2 mAP with a recall of $88%$. Even though the mAP decreases, the increase in recall means that our model has more room to improve.
使用錨盒,我們?cè)诰壬系玫搅艘粋€(gè)小下降。YOLO每張圖像只預(yù)測(cè)98個(gè)邊界框,但是使用錨盒我們的模型預(yù)測(cè)超過(guò)一千。如果沒(méi)有錨盒,我們的中間模型將獲得69.5的mAP,召回率為$81%$。具有錨盒我們的模型得到了69.2 mAP,召回率為$88%$。盡管mAP下降,但召回率的上升意味著我們的模型有更大的提升空間。
Dimension Clusters. We encounter two issues with anchor boxes when using them with YOLO. The first is that the box dimensions are hand picked. The network can learn to adjust the boxes appropriately but if we pick better priors for the network to start with we can make it easier for the network to learn to predict good detections.
維度聚類(lèi)。當(dāng)錨盒與YOLO一起使用時(shí),我們遇到了兩個(gè)問(wèn)題。首先是邊界框尺寸是手工挑選的。網(wǎng)絡(luò)可以學(xué)習(xí)適當(dāng)調(diào)整邊界框,但如果我們?yōu)榫W(wǎng)絡(luò)選擇更好的先驗(yàn),我們可以使網(wǎng)絡(luò)更容易學(xué)習(xí)它以便預(yù)測(cè)好的檢測(cè)。
Instead of choosing priors by hand, we run k-means clustering on the training set bounding boxes to automatically find good priors. If we use standard k-means with Euclidean distance larger boxes generate more error than smaller boxes. However, what we really want are priors that lead to good IOU scores, which is independent of the size of the box. Thus for our distance metric we use:$$d(\text{box}, \text{centroid}) = 1 - \text{IOU}(\text{box}, \text{centroid})$$ We run k-means for various values of $k$ and plot the average IOU with closest centroid, see Figure 2. We choose $k=5$ as a good tradeoff between model complexity and high recall. The cluster centroids are significantly different than hand-picked anchor boxes. There are fewer short, wide boxes and more tall, thin boxes.

Figure 2: Clustering box dimensions on VOC and COCO. We run k-means clustering on the dimensions of bounding boxes to get good priors for our model. The left image shows the average IOU we get with various choices for $k$. We find that $k = 5$ gives a good tradeoff for recall vs. complexity of the model. The right image shows the relative centroids for VOC and COCO. Both sets of priors favor thinner, taller boxes while COCO has greater variation in size than VOC.
我們不用手工選擇先驗(yàn),而是在訓(xùn)練集邊界框上運(yùn)行k-means聚類(lèi),自動(dòng)找到好的先驗(yàn)。如果我們使用具有歐幾里得距離的標(biāo)準(zhǔn)k-means,那么較大的邊界框比較小的邊界框產(chǎn)生更多的誤差。然而,我們真正想要的是導(dǎo)致好的IOU分?jǐn)?shù)的先驗(yàn),這是獨(dú)立于邊界框大小的。因此,對(duì)于我們的距離度量,我們使用:$$d(\text{box}, \text{centroid}) = 1 - \text{IOU}(\text{box}, \text{centroid})$$我們運(yùn)行各種$k$值的k-means,并畫(huà)出平均IOU與最接近的幾何中心,見(jiàn)圖2。我們選擇$k=5$作為模型復(fù)雜性和高召回率之間的良好折衷。聚類(lèi)中心與手工挑選的錨盒明顯不同。有更短更寬的邊界框和更高更細(xì)的邊界框。

圖2:VOC和COCO的聚類(lèi)邊界框尺寸。我們對(duì)邊界框的維度進(jìn)行k-means聚類(lèi),以獲得我們模型的良好先驗(yàn)。左圖顯示了我們通過(guò)對(duì)$k$的各種選擇得到的平均IOU。我們發(fā)現(xiàn)$k = 5$給出了一個(gè)很好的召回率與模型復(fù)雜度的權(quán)衡。右圖顯示了VOC和COCO的相對(duì)中心。這兩種先驗(yàn)都贊成更薄更高的邊界框,而COCO比VOC在尺寸上有更大的變化。
We compare the average IOU to closest prior of our clustering strategy and the hand-picked anchor boxes in Table 1. At only 5 priors the centroids perform similarly to 9 anchor boxes with an average IOU of 61.0 compared to 60.9. If we use 9 centroids we see a much higher average IOU. This indicates that using k-means to generate our bounding box starts the model off with a better representation and makes the task easier to learn.

Table 1: Average IOU of boxes to closest priors on VOC 2007. The average IOU of objects on VOC 2007 to their closest, unmodified prior using different generation methods. Clustering gives much better results than using hand-picked priors.
在表1中我們將平均IOU與我們聚類(lèi)策略中最接近的先驗(yàn)以及手工選取的錨盒進(jìn)行了比較。僅有5個(gè)先驗(yàn)中心的平均IOU為61.0,其性能類(lèi)似于9個(gè)錨盒的60.9。如果我們使用9個(gè)中心,我們會(huì)看到更高的平均IOU。這表明使用k-means來(lái)生成我們的邊界框會(huì)以更好的表示開(kāi)始訓(xùn)練模型,并使得任務(wù)更容易學(xué)習(xí)。

表1:VOC 2007上最接近先驗(yàn)的邊界框平均IOU。VOC 2007上目標(biāo)的平均IOU與其最接近的,使用不同生成方法之前未經(jīng)修改的平均值。聚類(lèi)結(jié)果比使用手工選擇的先驗(yàn)結(jié)果要更好。
Direct location prediction. When using anchor boxes with YOLO we encounter a second issue: model instability, especially during early iterations. Most of the instability comes from predicting the $(x,y)$ locations for the box. In region proposal networks the network predicts values $t_x$ and $t_y$ and the $(x,y)$ center coordinates are calculated as:
$$
x = (t_x * w_a) - x_a\\
y = (t_y * h_a) - y_a
$$
直接位置預(yù)測(cè)。當(dāng)YOLO使用錨盒時(shí),我們會(huì)遇到第二個(gè)問(wèn)題:模型不穩(wěn)定,特別是在早期的迭代過(guò)程中。大部分的不穩(wěn)定來(lái)自預(yù)測(cè)邊界框的$(x,y)$位置。在區(qū)域提出網(wǎng)絡(luò)中,網(wǎng)絡(luò)預(yù)測(cè)值$t_x$和$t_y$,$(x,y)$中心坐標(biāo)計(jì)算如下:
$$
x = (t_x * w_a) - x_a\\
y = (t_y * h_a) - y_a
$$
For example, a prediction of $t_x = 1$ would shift the box to the right by the width of the anchor box, a prediction of $t_x = -1$ would shift it to the left by the same amount.
例如,預(yù)測(cè)$t_x = 1$會(huì)將邊界框向右移動(dòng)錨盒的寬度,預(yù)測(cè)$t_x = -1$會(huì)將其向左移動(dòng)相同的寬度。
This formulation is unconstrained so any anchor box can end up at any point in the image, regardless of what location predicted the box. With random initialization the model takes a long time to stabilize to predicting sensible offsets.
這個(gè)公式是不受限制的,所以任何錨盒都可以在圖像任一點(diǎn)結(jié)束,而不管在哪個(gè)位置預(yù)測(cè)該邊界框。隨機(jī)初始化模型需要很長(zhǎng)時(shí)間才能穩(wěn)定以預(yù)測(cè)合理的偏移量。
Instead of predicting offsets we follow the approach of YOLO and predict location coordinates relative to the location of the grid cell. This bounds the ground truth to fall between $0$ and $1$. We use a logistic activation to constrain the network's predictions to fall in this range.
我們沒(méi)有預(yù)測(cè)偏移量,而是按照YOLO的方法預(yù)測(cè)相對(duì)于網(wǎng)格單元位置的位置坐標(biāo)。這限制了落到$0$和$1$之間的真實(shí)值。我們使用邏輯激活來(lái)限制網(wǎng)絡(luò)的預(yù)測(cè)落在這個(gè)范圍內(nèi)。
The network predicts 5 bounding boxes at each cell in the output feature map. The network predicts 5 coordinates for each bounding box, $t_x$, $t_y$, $t_w$, $t_h$, and $t_o$. If the cell is offset from the top left corner of the image by $(c_x, c_y)$ and the bounding box prior has width and height $p_w$, $p_h$, then the predictions correspond to:
$$
b_x = \sigma(t_x) + c_x \\
b_y = \sigma(t_y) + c_y\\
b_w = p_w e^{t_w}\\
b_h = p_h e^{t_h}\\
Pr(\text{object}) * IOU(b, \text{object}) = \sigma(t_o)
$$

Figure 3: Bounding boxes with dimension priors and location prediction. We predict the width and height of the box as offsets from cluster centroids. We predict the center coordinates of the box relative to the location of filter application using a sigmoid function.
網(wǎng)絡(luò)預(yù)測(cè)輸出特征映射中每個(gè)單元的5個(gè)邊界框。網(wǎng)絡(luò)預(yù)測(cè)每個(gè)邊界框的5個(gè)坐標(biāo),$t_x$,$t_y$,$t_w$,$t_h$和$t_o$。如果單元從圖像的左上角偏移了$(c_x, c_y)$,并且邊界框先驗(yàn)的寬度和高度為$p_w$,$p_h$,那么預(yù)測(cè)對(duì)應(yīng):
$$
b_x = \sigma(t_x) + c_x \\
b_y = \sigma(t_y) + c_y\\
b_w = p_w e^{t_w}\\
b_h = p_h e^{t_h}\\
Pr(\text{object}) * IOU(b, \text{object}) = \sigma(t_o)
$$

圖3:具有維度先驗(yàn)和位置預(yù)測(cè)的邊界框。我們預(yù)測(cè)邊界框的寬度和高度作為聚類(lèi)中心的偏移量。我們使用sigmoid函數(shù)預(yù)測(cè)邊界框相對(duì)于濾波器應(yīng)用位置的中心坐標(biāo)。
Since we constrain the location prediction the parametrization is easier to learn, making the network more stable. Using dimension clusters along with directly predicting the bounding box center location improves YOLO by almost $5%$ over the version with anchor boxes.
由于我們限制位置預(yù)測(cè)參數(shù)化更容易學(xué)習(xí),使網(wǎng)絡(luò)更穩(wěn)定。使用維度聚類(lèi)以及直接預(yù)測(cè)邊界框中心位置的方式比使用錨盒的版本將YOLO提高了近$5%$。
Fine-Grained Features. This modified YOLO predicts detections on a 13 × 13 feature map. While this is sufficient for large objects, it may benefit from finer grained features for localizing smaller objects. Faster R-CNN and SSD both run their proposal networks at various feature maps in the network to get a range of resolutions. We take a different approach, simply adding a passthrough layer that brings features from an earlier layer at 26 × 26 resolution.
細(xì)粒度功能。這個(gè)修改后的YOLO在13×13特征映射上預(yù)測(cè)檢測(cè)結(jié)果。雖然這對(duì)于大型目標(biāo)來(lái)說(shuō)已經(jīng)足夠了,但它可以從用于定位較小目標(biāo)的更細(xì)粒度的特征中受益。Faster R-CNN和SSD都在網(wǎng)絡(luò)的各種特征映射上運(yùn)行他們提出的網(wǎng)絡(luò),以獲得一系列的分辨率。我們采用不同的方法,僅僅添加一個(gè)直通層,從26x26分辨率的更早層中提取特征。
The passthrough layer concatenates the higher resolution features with the low resolution features by stacking adjacent features into different channels instead of spatial locations, similar to the identity mappings in ResNet. This turns the 26 × 26 × 512 feature map into a 13 × 13 × 2048 feature map, which can be concatenated with the original features. Our detector runs on top of this expanded feature map so that it has access to fine grained features. This gives a modest $1%$ performance increase.
直通層通過(guò)將相鄰特征堆疊到不同的通道而不是空間位置來(lái)連接較高分辨率特征和較低分辨率特征,類(lèi)似于ResNet中的恒等映射。這將26×26×512特征映射變成13×13×2048特征映射,其可以與原始特征連接。我們的檢測(cè)器運(yùn)行在這個(gè)擴(kuò)展的特征映射的頂部,以便它可以訪問(wèn)細(xì)粒度的特征。這會(huì)使性能提高$1%$。
Multi-Scale Training. The original YOLO uses an input resolution of 448 × 448. With the addition of anchor boxes we changed the resolution to 416×416. However, since our model only uses convolutional and pooling layers it can be resized on the fly. We want YOLOv2 to be robust to running on images of different sizes so we train this into the model.
多尺度訓(xùn)練。原來(lái)的YOLO使用448×448的輸入分辨率。通過(guò)添加錨盒,我們將分辨率更改為416×416。但是,由于我們的模型只使用卷積層和池化層,因此它可以實(shí)時(shí)調(diào)整大小。我們希望YOLOv2能夠魯棒的運(yùn)行在不同大小的圖像上,因此我們可以將其訓(xùn)練到模型中。
Instead of fixing the input image size we change the network every few iterations. Every 10 batches our network randomly chooses a new image dimension size. Since our model downsamples by a factor of 32, we pull from the following multiples of 32: {320, 352, ..., 608}. Thus the smallest option is 320 × 320 and the largest is 608 × 608. We resize the network to that dimension and continue training.
我們沒(méi)有固定的輸入圖像大小,每隔幾次迭代就改變網(wǎng)絡(luò)。每隔10個(gè)批次我們的網(wǎng)絡(luò)會(huì)隨機(jī)選擇一個(gè)新的圖像尺寸大小。由于我們的模型縮減了32倍,我們從下面的32的倍數(shù)中選擇:{320,352,...,608}。因此最小的選項(xiàng)是320×320,最大的是608×608。我們調(diào)整網(wǎng)絡(luò)的尺寸并繼續(xù)訓(xùn)練。
This regime forces the network to learn to predict well across a variety of input dimensions. This means the same network can predict detections at different resolutions. The network runs faster at smaller sizes so YOLOv2 offers an easy tradeoff between speed and accuracy.
這個(gè)制度迫使網(wǎng)絡(luò)學(xué)習(xí)如何在各種輸入維度上做好預(yù)測(cè)。這意味著相同的網(wǎng)絡(luò)可以預(yù)測(cè)不同分辨率下的檢測(cè)結(jié)果。在更小尺寸上網(wǎng)絡(luò)運(yùn)行速度更快,因此YOLOv2在速度和準(zhǔn)確性之間提供了一個(gè)簡(jiǎn)單的折衷。
At low resolutions YOLOv2 operates as a cheap, fairly accurate detector. At 288 × 288 it runs at more than 90 FPS with mAP almost as good as Fast R-CNN. This makes it ideal for smaller GPUs, high framerate video, or multiple video streams.
在低分辨率YOLOv2作為一個(gè)便宜,相當(dāng)準(zhǔn)確的檢測(cè)器。在288×288時(shí),其運(yùn)行速度超過(guò)90FPS,mAP與Fast R-CNN差不多。這使其成為小型GPU,高幀率視頻或多視頻流的理想選擇。
At high resolution YOLOv2 is a state-of-the-art detector with 78.6 mAP on VOC 2007 while still operating above real-time speeds. See Table 3 for a comparison of YOLOv2 with other frameworks on VOC 2007. Figure 4

Table 3: Detection frameworks on PASCAL VOC 2007. YOLOv2 is faster and more accurate than prior detection methods. It can also run at different resolutions for an easy tradeoff between speed and accuracy. Each YOLOv2 entry is actually the same trained model with the same weights, just evaluated at a different size. All timing information is on a Geforce GTX Titan X (original, not Pascal model).

Figure 4: Accuracy and speed on VOC 2007.
在高分辨率下,YOLOv2是VOC 2007上最先進(jìn)的檢測(cè)器,達(dá)到了78.6 mAP,同時(shí)仍保持運(yùn)行在實(shí)時(shí)速度之上。請(qǐng)參閱表3,了解YOLOv2與VOC 2007其他框架的比較。圖4

表3:PASCAL VOC 2007的檢測(cè)框架。YOLOv2比先前的檢測(cè)方法更快,更準(zhǔn)確。它也可以以不同的分辨率運(yùn)行,以便在速度和準(zhǔn)確性之間進(jìn)行簡(jiǎn)單折衷。每個(gè)YOLOv2條目實(shí)際上是具有相同權(quán)重的相同訓(xùn)練模型,只是以不同的大小進(jìn)行評(píng)估。所有的時(shí)間信息都是在Geforce GTX Titan X(原始的,而不是Pascal模型)上測(cè)得的。

圖4:VOC 2007上的準(zhǔn)確性與速度。
Further Experiments. We train YOLOv2 for detection on VOC 2012. Table 4 shows the comparative performance of YOLOv2 versus other state-of-the-art detection systems. YOLOv2 achieves 73.4 mAP while running far faster than competing methods. We also train on COCO and compare to other methods in Table 5. On the VOC metric (IOU = .5) YOLOv2 gets 44.0 mAP, comparable to SSD and Faster R-CNN.

Table 4: PASCAL VOC2012 test detection results. YOLOv2 performs on par with state-of-the-art detectors like Faster R-CNN with ResNet and SSD512 and is 2?10× faster.

Table 5: Results on COCO test-dev2015. Table adapted from [11]
進(jìn)一步實(shí)驗(yàn)。我們?cè)赩OC 2012上訓(xùn)練YOLOv2進(jìn)行檢測(cè)。表4顯示了YOLOv2與其他最先進(jìn)的檢測(cè)系統(tǒng)的比較性能。YOLOv2取得了73.4 mAP同時(shí)運(yùn)行速度比競(jìng)爭(zhēng)方法快的多。我們?cè)贑OCO上進(jìn)行了訓(xùn)練,并在表5中與其他方法進(jìn)行比較。在VOC度量(IOU = 0.5)上,YOLOv2得到44.0 mAP,與SSD和Faster R-CNN相當(dāng)。

表4:PASCAL VOC2012 test上的檢測(cè)結(jié)果。YOLOv2與最先進(jìn)的檢測(cè)器如具有ResNet的Faster R-CNN、SSD512在標(biāo)準(zhǔn)數(shù)據(jù)集上運(yùn)行,YOLOv2比它們快2-10倍。

表5:在COCO test-dev2015上的結(jié)果。表參考[11]
3. Faster
We want detection to be accurate but we also want it to be fast. Most applications for detection, like robotics or self-driving cars, rely on low latency predictions. In order to maximize performance we design YOLOv2 to be fast from the ground up.
3. 更快
我們希望檢測(cè)是準(zhǔn)確的,但我們也希望它快速。大多數(shù)檢測(cè)應(yīng)用(如機(jī)器人或自動(dòng)駕駛機(jī)車(chē))依賴于低延遲預(yù)測(cè)。為了最大限度提高性能,我們從頭開(kāi)始設(shè)計(jì)YOLOv2。
Most detection frameworks rely on VGG-16 as the base feature extractor [17]. VGG-16 is a powerful, accurate classification network but it is needlessly complex. The convolutional layers of VGG-16 require 30.69 billion floating point operations for a single pass over a single image at 224 × 224 resolution.
大多數(shù)檢測(cè)框架依賴于VGG-16作為的基本特征提取器[17]。VGG-16是一個(gè)強(qiáng)大的,準(zhǔn)確的分類(lèi)網(wǎng)絡(luò),但它是不必要的復(fù)雜。在單張圖像224×224分辨率的情況下VGG-16的卷積層運(yùn)行一次前饋傳播需要306.90億次浮點(diǎn)運(yùn)算。
The YOLO framework uses a custom network based on the Googlenet architecture [19]. This network is faster than VGG-16, only using 8.52 billion operations for a forward pass. However, it’s accuracy is slightly worse than VGG-16. For single-crop, top-5 accuracy at 224 × 224, YOLO’s custom model gets $88.0%$ ImageNet compared to $90.0%$ for VGG-16.
YOLO框架使用基于Googlenet架構(gòu)[19]的自定義網(wǎng)絡(luò)。這個(gè)網(wǎng)絡(luò)比VGG-16更快,一次前饋傳播只有85.2億次的操作。然而,它的準(zhǔn)確性比VGG-16略差。在ImageNet上,對(duì)于單張裁剪圖像,224×224分辨率下的top-5準(zhǔn)確率,YOLO的自定義模型獲得了$88.0%$,而VGG-16則為$90.0%$。
Darknet-19. We propose a new classification model to be used as the base of YOLOv2. Our model builds off of prior work on network design as well as common knowledge in the field. Similar to the VGG models we use mostly 3 × 3 filters and double the number of channels after every pooling step [17]. Following the work on Network in Network (NIN) we use global average pooling to make predictions as well as 1 × 1 filters to compress the feature representation between 3 × 3 convolutions [9]. We use batch normalization to stabilize training, speed up convergence, and regularize the model [7].
Darknet-19。我們提出了一個(gè)新的分類(lèi)模型作為YOLOv2的基礎(chǔ)。我們的模型建立在網(wǎng)絡(luò)設(shè)計(jì)先前工作以及該領(lǐng)域常識(shí)的基礎(chǔ)上。與VGG模型類(lèi)似,我們大多使用3×3濾波器,并在每個(gè)池化步驟之后使通道數(shù)量加倍[17]。按照Network in Network(NIN)的工作,我們使用全局平均池化做預(yù)測(cè)以及1×1濾波器來(lái)壓縮3×3卷積之間的特征表示[9]。我們使用批標(biāo)準(zhǔn)化來(lái)穩(wěn)定訓(xùn)練,加速收斂,并正則化模型[7]。
Our final model, called Darknet-19, has 19 convolutional layers and 5 maxpooling layers. For a full description see Table 6. Darknet-19 only requires 5.58 billion operations to process an image yet achieves $72.9%$ top-1 accuracy and $91.2%$ top-5 accuracy on ImageNet.

Table 6: Darknet-19.
我們的最終模型叫做Darknet-19,它有19個(gè)卷積層和5個(gè)最大池化層。完整描述請(qǐng)看表6。Darknet-19只需要55.8億次運(yùn)算來(lái)處理圖像,但在ImageNet上卻達(dá)到了$72.9%$的top-1準(zhǔn)確率和$91.2%$的top-5準(zhǔn)確率。

表6:Darknet-19。
Training for classification. We train the network on the standard ImageNet 1000 class classification dataset for 160 epochs using stochastic gradient descent with a starting learning rate of 0.1, polynomial rate decay with a power of 4, weight decay of 0.0005 and momentum of 0.9 using the Darknet neural network framework [13]. During training we use standard data augmentation tricks including random crops, rotations, and hue, saturation, and exposure shifts.
分類(lèi)訓(xùn)練。我們使用Darknet神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu),使用隨機(jī)梯度下降,初始學(xué)習(xí)率為0.1,學(xué)習(xí)率多項(xiàng)式衰減系數(shù)為4,權(quán)重衰減為0.0005,動(dòng)量為0.9,在標(biāo)準(zhǔn)ImageNet 1000類(lèi)分類(lèi)數(shù)據(jù)集上訓(xùn)練網(wǎng)絡(luò)160個(gè)迭代周期[13]。在訓(xùn)練過(guò)程中,我們使用標(biāo)準(zhǔn)的數(shù)據(jù)增強(qiáng)技巧,包括隨機(jī)裁剪,旋轉(zhuǎn),色調(diào),飽和度和曝光偏移。
As discussed above, after our initial training on images at 224 × 224 we fine tune our network at a larger size, 448. For this fine tuning we train with the above parameters but for only 10 epochs and starting at a learning rate of $10^{?3}$. At this higher resolution our network achieves a top-1 accuracy of $76.5%$ and a top-5 accuracy of $93.3%$.
如上所述,在我們對(duì)224×224的圖像進(jìn)行初始訓(xùn)練之后,我們對(duì)網(wǎng)絡(luò)在更大的尺寸448上進(jìn)行了微調(diào)。對(duì)于這種微調(diào),我們使用上述參數(shù)進(jìn)行訓(xùn)練,但是只有10個(gè)迭代周期,并且以$10^{?3}$的學(xué)習(xí)率開(kāi)始。在這種更高的分辨率下,我們的網(wǎng)絡(luò)達(dá)到了$76.5%$的top-1準(zhǔn)確率和$93.3%$的top-5準(zhǔn)確率。
Training for detection. We modify this network for detection by removing the last convolutional layer and instead adding on three 3 × 3 convolutional layers with 1024 filters each followed by a final 1 × 1 convolutional layer with the number of outputs we need for detection. For VOC we predict 5 boxes with 5 coordinates each and 20 classes per box so 125 filters. We also add a passthrough layer from the final 3 × 3 × 512 layer to the second to last convolutional layer so that our model can use fine grain features.
檢測(cè)訓(xùn)練。我們修改這個(gè)網(wǎng)絡(luò)進(jìn)行檢測(cè),刪除了最后一個(gè)卷積層,加上了三個(gè)具有1024個(gè)濾波器的3×3卷積層,其后是最后的1×1卷積層與我們檢測(cè)需要的輸出數(shù)量。對(duì)于VOC,我們預(yù)測(cè)5個(gè)邊界框,每個(gè)邊界框有5個(gè)坐標(biāo)和20個(gè)類(lèi)別,所以有125個(gè)濾波器。我們還添加了從最后的3×3×512層到倒數(shù)第二層卷積層的直通層,以便我們的模型可以使用細(xì)粒度特征。
We train the network for 160 epochs with a starting learning rate of $10^{?3}$, dividing it by 10 at 60 and 90 epochs. We use a weight decay of 0.0005 and momentum of 0.9. We use a similar data augmentation to YOLO and SSD with random crops, color shifting, etc. We use the same training strategy on COCO and VOC.
我們訓(xùn)練網(wǎng)絡(luò)160個(gè)迭代周期,初始學(xué)習(xí)率為$10^{?3}$,在60個(gè)和90個(gè)迭代周期時(shí)將學(xué)習(xí)率除以10。我們使用0.0005的權(quán)重衰減和0.9的動(dòng)量。我們對(duì)YOLO和SSD進(jìn)行類(lèi)似的數(shù)據(jù)增強(qiáng),隨機(jī)裁剪,色彩偏移等。我們對(duì)COCO和VOC使用相同的訓(xùn)練策略。
4. Stronger
We propose a mechanism for jointly training on classification and detection data. Our method uses images labelled for detection to learn detection-specific information like bounding box coordinate prediction and objectness as well as how to classify common objects. It uses images with only class labels to expand the number of categories it can detect.
4. 更強(qiáng)
我們提出了一個(gè)聯(lián)合訓(xùn)練分類(lèi)和檢測(cè)數(shù)據(jù)的機(jī)制。我們的方法使用標(biāo)記為檢測(cè)的圖像來(lái)學(xué)習(xí)邊界框坐標(biāo)預(yù)測(cè)和目標(biāo)之類(lèi)的特定檢測(cè)信息以及如何對(duì)常見(jiàn)目標(biāo)進(jìn)行分類(lèi)。它使用僅具有類(lèi)別標(biāo)簽的圖像來(lái)擴(kuò)展可檢測(cè)類(lèi)別的數(shù)量。
During training we mix images from both detection and classification datasets. When our network sees an image labelled for detection we can backpropagate based on the full YOLOv2 loss function. When it sees a classification image we only backpropagate loss from the classification-specific parts of the architecture.
在訓(xùn)練期間,我們混合來(lái)自檢測(cè)和分類(lèi)數(shù)據(jù)集的圖像。當(dāng)我們的網(wǎng)絡(luò)看到標(biāo)記為檢測(cè)的圖像時(shí),我們可以基于完整的YOLOv2損失函數(shù)進(jìn)行反向傳播。當(dāng)它看到一個(gè)分類(lèi)圖像時(shí),我們只能從該架構(gòu)的分類(lèi)特定部分反向傳播損失。
This approach presents a few challenges. Detection datasets have only common objects and general labels, like dog or boat. Classification datasets have a much wider and deeper range of labels. ImageNet has more than a hundred breeds of dog, including Norfolk terrier, Yorkshire terrier, and Bedlington terrier. If we want to train on both datasets we need a coherent way to merge these labels.
這種方法提出了一些挑戰(zhàn)。檢測(cè)數(shù)據(jù)集只有通用目標(biāo)和通用標(biāo)簽,如“狗”或“船”。分類(lèi)數(shù)據(jù)集具有更廣更深的標(biāo)簽范圍。ImageNet有超過(guò)一百種品種的狗,包括Norfolk terrier,Yorkshire terrier和Bedlington terrier。如果我們想在兩個(gè)數(shù)據(jù)集上訓(xùn)練,我們需要一個(gè)連貫的方式來(lái)合并這些標(biāo)簽。
Most approaches to classification use a softmax layer across all the possible categories to compute the final probability distribution. Using a softmax assumes the classes are mutually exclusive. This presents problems for combining datasets, for example you would not want to combine ImageNet and COCO using this model because the classes Norfolk terrier and dog are not mutually exclusive.
大多數(shù)分類(lèi)方法使用跨所有可能類(lèi)別的softmax層來(lái)計(jì)算最終的概率分布。使用softmax假定這些類(lèi)是互斥的。這給數(shù)據(jù)集的組合帶來(lái)了問(wèn)題,例如你不想用這個(gè)模型來(lái)組合ImageNet和COCO,因?yàn)轭?lèi)Norfolk terrier和dog不是互斥的。
We could instead use a multi-label model to combine the datasets which does not assume mutual exclusion. This approach ignores all the structure we do know about the data, for example that all of the COCO classes are mutually exclusive.
我們可以改為使用多標(biāo)簽?zāi)P蛠?lái)組合不假定互斥的數(shù)據(jù)集。這種方法忽略了我們已知的關(guān)于數(shù)據(jù)的所有結(jié)構(gòu),例如,所有的COCO類(lèi)是互斥的。
Hierarchical classification. ImageNet labels are pulled from WordNet, a language database that structures concepts and how they relate [12]. In WordNet, Norfolk terrier and Yorkshire terrier are both hyponyms of terrier which is a type of hunting dog, which is a type of dog, which is a canine, etc. Most approaches to classification assume a flat structure to the labels however for combining datasets, structure is exactly what we need.
分層分類(lèi)。ImageNet標(biāo)簽是從WordNet中提取的,這是一個(gè)構(gòu)建概念及其相互關(guān)系的語(yǔ)言數(shù)據(jù)庫(kù)[12]。在WordNet中,Norfolk terrier和Yorkshire terrier都是terrier的下義詞,terrier是一種hunting dog,hunting dog是dog,dog是canine等。分類(lèi)的大多數(shù)方法為標(biāo)簽假設(shè)一個(gè)扁平結(jié)構(gòu),但是對(duì)于組合數(shù)據(jù)集,結(jié)構(gòu)正是我們所需要的。
WordNet is structured as a directed graph, not a tree, because language is complex. For example a dog is both a type of canine and a type of domestic animal which are both synsets in WordNet. Instead of using the full graph structure, we simplify the problem by building a hierarchical tree from the concepts in ImageNet.
WordNet的結(jié)構(gòu)是有向圖,而不是樹(shù),因?yàn)檎Z(yǔ)言是復(fù)雜的。例如,dog既是一種canine,也是一種domestic animal,它們都是WordNet中的同義詞。我們不是使用完整的圖結(jié)構(gòu),而是通過(guò)從ImageNet的概念中構(gòu)建分層樹(shù)來(lái)簡(jiǎn)化問(wèn)題。
To build this tree we examine the visual nouns in ImageNet and look at their paths through the WordNet graph to the root node, in this case “physical object”. Many synsets only have one path through the graph so first we add all of those paths to our tree. Then we iteratively examine the concepts we have left and add the paths that grow the tree by as little as possible. So if a concept has two paths to the root and one path would add three edges to our tree and the other would only add one edge, we choose the shorter path.
為了構(gòu)建這棵樹(shù),我們檢查了ImageNet中的視覺(jué)名詞,并查看它們通過(guò)WordNet圖到根節(jié)點(diǎn)的路徑,在這種情況下是“物理對(duì)象”。許多同義詞通過(guò)圖只有一條路徑,所以首先我們將所有這些路徑添加到我們的樹(shù)中。然后我們反復(fù)檢查我們留下的概念,并盡可能少地添加生長(zhǎng)樹(shù)的路徑。所以如果一個(gè)概念有兩條路徑到一個(gè)根,一條路徑會(huì)給我們的樹(shù)增加三條邊,另一條只增加一條邊,我們選擇更短的路徑。
The final result is WordTree, a hierarchical model of visual concepts. To perform classification with WordTree we predict conditional probabilities at every node for the probability of each hyponym of that synset given that synset. For example, at the terrier node we predict:
$$
Pr(\text{Norfolk terrier} | \text{terrier}) \\
Pr(\text{Yorkshire terrier} | \text{terrier}) \\
Pr(\text{Bedlington terrier} | \text{terrier})\\
...\\
$$
最終的結(jié)果是WordTree,一個(gè)視覺(jué)概念的分層模型。為了使用WordTree進(jìn)行分類(lèi),我們預(yù)測(cè)每個(gè)節(jié)點(diǎn)的條件概率,以得到同義詞集合中每個(gè)同義詞下義詞的概率。例如,在terrier節(jié)點(diǎn)我們預(yù)測(cè):
$$
Pr(\text{Norfolk terrier} | \text{terrier}) \\
Pr(\text{Yorkshire terrier} | \text{terrier}) \\
Pr(\text{Bedlington terrier} | \text{terrier})\\
...\\
$$
If we want to compute the absolute probability for a particular node we simply follow the path through the tree to the root node and multiply to conditional probabilities. So if we want to know if a picture is of a Norfolk terrier we compute:
$$
Pr(\text{Norfolk terrier}) = Pr(\text{Norfolk terrier} | \text{terrier})\\
* Pr(\text{terrier} | \text{hunting dog}) \\
* \ldots * \\
*Pr(\text{mammal} | Pr(\text{animal})\\
* Pr(\text{animal} | \text{physical object})
$$
如果我們想要計(jì)算一個(gè)特定節(jié)點(diǎn)的絕對(duì)概率,我們只需沿著通過(guò)樹(shù)到達(dá)根節(jié)點(diǎn)的路徑,再乘以條件概率。所以如果我們想知道一張圖片是否是Norfolk terrier,我們計(jì)算:
$$
Pr(\text{Norfolk terrier}) = Pr(\text{Norfolk terrier} | \text{terrier})\\
* Pr(\text{terrier} | \text{hunting dog}) \\
* \ldots * \\
*Pr(\text{mammal} | Pr(\text{animal})\\
* Pr(\text{animal} | \text{physical object})
$$
For classification purposes we assume that the the image contains an object: $Pr(\text{physical object}) = 1$.
為了分類(lèi)目的,我們假定圖像包含一個(gè)目標(biāo):$Pr(\text{physical object}) = 1$。
To validate this approach we train the Darknet-19 model on WordTree built using the 1000 class ImageNet. To build WordTree1k we add in all of the intermediate nodes which expands the label space from 1000 to 1369. During training we propagate ground truth labels up the tree so that if an image is labelled as a Norfolk terrier it also gets labelled as a dog and a mammal, etc. To compute the conditional probabilities our model predicts a vector of 1369 values and we compute the softmax over all sysnsets that are hyponyms of the same concept, see Figure 5.

Figure 5: Prediction on ImageNet vs WordTree. Most ImageNet models use one large softmax to predict a probability distribution. Using WordTree we perform multiple softmax operations over co-hyponyms.
為了驗(yàn)證這種方法,我們?cè)谑褂?000類(lèi)ImageNet構(gòu)建的WordTree上訓(xùn)練Darknet-19模型。為了構(gòu)建WordTree1k,我們添加了所有將標(biāo)簽空間從1000擴(kuò)展到1369的中間節(jié)點(diǎn)。在訓(xùn)練過(guò)程中,我們將真實(shí)標(biāo)簽向樹(shù)上面?zhèn)鞑?,以便如果圖像被標(biāo)記為Norfolk terrier,則它也被標(biāo)記為dog和mammal等。為了計(jì)算條件概率,我們的模型預(yù)測(cè)了具有1369個(gè)值的向量,并且我們計(jì)算了相同概念的下義詞在所有同義詞集上的softmax,見(jiàn)圖5。

圖5:在ImageNet與WordTree上的預(yù)測(cè)。大多數(shù)ImageNet模型使用一個(gè)較大的softmax來(lái)預(yù)測(cè)概率分布。使用WordTree,我們可以在共同的下義詞上執(zhí)行多次softmax操作。
Using the same training parameters as before, our hierarchical Darknet-19 achieves $71.9%$ top-1 accuracy and $90.4%$ top-5 accuracy. Despite adding 369 additional concepts and having our network predict a tree structure our accuracy only drops marginally. Performing classification in this manner also has some benefits. Performance degrades gracefully on new or unknown object categories. For example, if the network sees a picture of a dog but is uncertain what type of dog it is, it will still predict dog with high confidence but have lower confidences spread out among the hyponyms.
使用與以前相同的訓(xùn)練參數(shù),我們的分級(jí)Darknet-19達(dá)到$71.9%$的top-1準(zhǔn)確率和$90.4%$的top-5準(zhǔn)確率。盡管增加了369個(gè)額外的概念,而且我們的網(wǎng)絡(luò)預(yù)測(cè)了一個(gè)樹(shù)狀結(jié)構(gòu),但我們的準(zhǔn)確率僅下降了一點(diǎn)點(diǎn)。以這種方式進(jìn)行分類(lèi)也有一些好處。在新的或未知的目標(biāo)類(lèi)別上性能會(huì)優(yōu)雅地降低。例如,如果網(wǎng)絡(luò)看到一只狗的照片,但不確定它是什么類(lèi)型的狗,它仍然會(huì)高度自信地預(yù)測(cè)“狗”,但是在下義位擴(kuò)展之間有更低的置信度。
This formulation also works for detection. Now, instead of assuming every image has an object, we use YOLOv2's objectness predictor to give us the value of $Pr(\text{physical object})$. The detector predicts a bounding box and the tree of probabilities. We traverse the tree down, taking the highest confidence path at every split until we reach some threshold and we predict that object class.
這個(gè)構(gòu)想也適用于檢測(cè)?,F(xiàn)在,我們不是假定每張圖像都有一個(gè)目標(biāo),而是使用YOLOv2的目標(biāo)預(yù)測(cè)器給我們$Pr(\text{physical object})$的值。檢測(cè)器預(yù)測(cè)邊界框和概率樹(shù)。我們遍歷樹(shù),在每個(gè)分割中采用最高的置信度路徑,直到達(dá)到某個(gè)閾值,然后我們預(yù)測(cè)目標(biāo)類(lèi)。
Dataset combination with WordTree. We can use WordTree to combine multiple datasets together in a sensible fashion. We simply map the categories in the datasets to synsets in the tree. Figure 6 shows an example of using WordTree to combine the labels from ImageNet and COCO. WordNet is extremely diverse so we can use this technique with most datasets.
與WordTree的數(shù)據(jù)集組合。我們可以使用WordTree以合理的方式將多個(gè)數(shù)據(jù)集組合在一起。我們只需將數(shù)據(jù)集中的類(lèi)別映射到樹(shù)中的synsets即可。圖6顯示了使用WordTree來(lái)組合來(lái)自ImageNet和COCO的標(biāo)簽的示例。WordNet是非常多樣化的,所以我們可以在大多數(shù)數(shù)據(jù)集中使用這種技術(shù)。
Joint classification and detection. Now that we can combine datasets using WordTree we can train our joint model on classification and detection. We want to train an extremely large scale detector so we create our combined dataset using the COCO detection dataset and the top 9000 classes from the full ImageNet release. We also need to evaluate our method so we add in any classes from the ImageNet detection challenge that were not already included. The corresponding WordTree for this dataset has 9418 classes. ImageNet is a much larger dataset so we balance the dataset by oversampling COCO so that ImageNet is only larger by a factor of 4:1.
聯(lián)合分類(lèi)和檢測(cè)?,F(xiàn)在我們可以使用WordTree組合數(shù)據(jù)集,我們可以在分類(lèi)和檢測(cè)上訓(xùn)練聯(lián)合模型。我們想要訓(xùn)練一個(gè)非常大規(guī)模的檢測(cè)器,所以我們使用COCO檢測(cè)數(shù)據(jù)集和完整的ImageNet版本中的前9000個(gè)類(lèi)來(lái)創(chuàng)建我們的組合數(shù)據(jù)集。我們還需要評(píng)估我們的方法,以便從ImageNet檢測(cè)挑戰(zhàn)中添加任何尚未包含的類(lèi)。該數(shù)據(jù)集的相應(yīng)WordTree有9418個(gè)類(lèi)別。ImageNet是一個(gè)更大的數(shù)據(jù)集,所以我們通過(guò)對(duì)COCO進(jìn)行過(guò)采樣來(lái)平衡數(shù)據(jù)集,使得ImageNet僅僅大于4:1的比例。
Using this dataset we train YOLO9000. We use the base YOLOv2 architecture but only 3 priors instead of 5 to limit the output size. When our network sees a detection image we backpropagate loss as normal. For classification loss, we only backpropagate loss at or above the corresponding level of the label. For example, if the label is dog we do assign any error to predictions further down in the tree, German Shepherd versus Golden Retriever, because we do not have that information.
使用這個(gè)數(shù)據(jù)集我們訓(xùn)練YOLO9000。我們使用基礎(chǔ)的YOLOv2架構(gòu),但只有3個(gè)先驗(yàn)而不是5個(gè)來(lái)限制輸出大小。當(dāng)我們的網(wǎng)絡(luò)看到一個(gè)檢測(cè)圖像時(shí),我們正常的反向傳播損失。對(duì)于分類(lèi)損失,我們僅在等于或高于標(biāo)簽對(duì)應(yīng)的層反向傳播損失。例如,如果標(biāo)簽是“狗”,我們確實(shí)沿著樹(shù)向下進(jìn)一步預(yù)測(cè)“德國(guó)牧羊犬”與“金毛獵犬”之間的差異,因?yàn)槲覀儧](méi)有這些信息。
When it sees a classification image we only backpropagate classification loss. To do this we simply find the bounding box that predicts the highest probability for that class and we compute the loss on just its predicted tree. We also assume that the predicted box overlaps what would be the ground truth label by at least 0.3 IOU and we backpropagate objectness loss based on this assumption.
當(dāng)它看到分類(lèi)圖像時(shí),我們只能反向傳播分類(lèi)損失。要做到這一點(diǎn),我們只需找到預(yù)測(cè)該類(lèi)別最高概率的邊界框,然后計(jì)算其預(yù)測(cè)樹(shù)上的損失。我們還假設(shè)預(yù)測(cè)邊界框與真實(shí)標(biāo)簽重疊至少0.3的IOU,并且基于這個(gè)假設(shè)反向傳播目標(biāo)損失。
Using this joint training, YOLO9000 learns to find objects in images using the detection data in COCO and it learns to classify a wide variety of these objects using data from ImageNet.
使用這種聯(lián)合訓(xùn)練,YOLO9000學(xué)習(xí)使用COCO中的檢測(cè)數(shù)據(jù)來(lái)查找圖像中的目標(biāo),并學(xué)習(xí)使用來(lái)自ImageNet的數(shù)據(jù)對(duì)各種目標(biāo)進(jìn)行分類(lèi)。
We evaluate YOLO9000 on the ImageNet detection task. The detection task for ImageNet shares on 44 object categories with COCO which means that YOLO9000 has only seen classification data for the majority of the test images, not detection data. YOLO9000 gets 19.7 mAP overall with 16.0 mAP on the disjoint 156 object classes that it has never seen any labelled detection data for. This mAP is higher than results achieved by DPM but YOLO9000 is trained on different datasets with only partial supervision [4]. It also is simultaneously detecting 9000 other object categories, all in real-time.
我們?cè)贗mageNet檢測(cè)任務(wù)上評(píng)估YOLO9000。ImageNet的檢測(cè)任務(wù)與COCO共享44個(gè)目標(biāo)類(lèi)別,這意味著YOLO9000只能看到大多數(shù)測(cè)試圖像的分類(lèi)數(shù)據(jù),而不是檢測(cè)數(shù)據(jù)。YOLO9000在從未見(jiàn)過(guò)任何標(biāo)記的檢測(cè)數(shù)據(jù)的情況下,整體上獲得了19.7 mAP,在不相交的156個(gè)目標(biāo)類(lèi)別中獲得了16.0 mAP。這個(gè)mAP高于DPM的結(jié)果,但是YOLO9000在不同的數(shù)據(jù)集上訓(xùn)練,只有部分監(jiān)督[4]。它也同時(shí)檢測(cè)9000個(gè)其他目標(biāo)類(lèi)別,所有的都是實(shí)時(shí)的。
When we analyze YOLO9000’s performance on ImageNet we see it learns new species of animals well but struggles with learning categories like clothing and equipment. New animals are easier to learn because the objectness predictions generalize well from the animals in COCO. Conversely, COCO does not have bounding box label for any type of clothing, only for person, so YOLO9000 struggles to model categories like “sunglasses” or “swimming trunks”.
當(dāng)我們分析YOLO9000在ImageNet上的表現(xiàn)時(shí),我們發(fā)現(xiàn)它很好地學(xué)習(xí)了新的動(dòng)物種類(lèi),但是卻在像服裝和設(shè)備這樣的學(xué)習(xí)類(lèi)別中掙扎。新動(dòng)物更容易學(xué)習(xí),因?yàn)槟繕?biāo)預(yù)測(cè)可以從COCO中的動(dòng)物泛化的很好。相反,COCO沒(méi)有任何類(lèi)型的衣服的邊界框標(biāo)簽,只針對(duì)人,因此YOLO9000正在努力建?!澳R”或“泳褲”等類(lèi)別。
5. Conclusion
We introduce YOLOv2 and YOLO9000, real-time detection systems. YOLOv2 is state-of-the-art and faster than other detection systems across a variety of detection datasets. Furthermore, it can be run at a variety of image sizes to provide a smooth tradeoff between speed and accuracy.
5. 結(jié)論
我們介紹了YOLOv2和YOLO9000,兩個(gè)實(shí)時(shí)檢測(cè)系統(tǒng)。YOLOv2在各種檢測(cè)數(shù)據(jù)集上都是最先進(jìn)的,也比其他檢測(cè)系統(tǒng)更快。此外,它可以運(yùn)行在各種圖像大小,以提供速度和準(zhǔn)確性之間的平滑折衷。
YOLO9000 is a real-time framework for detection more than 9000 object categories by jointly optimizing detection and classification. We use WordTree to combine data from various sources and our joint optimization technique to train simultaneously on ImageNet and COCO. YOLO9000 is a strong step towards closing the dataset size gap between detection and classification.
YOLO9000是一個(gè)通過(guò)聯(lián)合優(yōu)化檢測(cè)和分類(lèi)來(lái)檢測(cè)9000多個(gè)目標(biāo)類(lèi)別的實(shí)時(shí)框架。我們使用WordTree將各種來(lái)源的數(shù)據(jù)和我們的聯(lián)合優(yōu)化技術(shù)相結(jié)合,在ImageNet和COCO上同時(shí)進(jìn)行訓(xùn)練。YOLO9000是在檢測(cè)和分類(lèi)之間縮小數(shù)據(jù)集大小差距的重要一步。
Many of our techniques generalize outside of object detection. Our WordTree representation of ImageNet offers a richer, more detailed output space for image classification. Dataset combination using hierarchical classification would be useful in the classification and segmentation domains. Training techniques like multi-scale training could provide benefit across a variety of visual tasks.
我們的許多技術(shù)都可以泛化到目標(biāo)檢測(cè)之外。我們對(duì)ImageNet的WordTree表示為圖像分類(lèi)提供了更豐富,更詳細(xì)的輸出空間。使用分層分類(lèi)的數(shù)據(jù)集組合在分類(lèi)和分割領(lǐng)域?qū)⑹怯杏玫?。像多尺度?xùn)練這樣的訓(xùn)練技術(shù)可以為各種視覺(jué)任務(wù)提供益處。
For future work we hope to use similar techniques for weakly supervised image segmentation. We also plan to improve our detection results using more powerful matching strategies for assigning weak labels to classification data during training. Computer vision is blessed with an enormous amount of labelled data. We will continue looking for ways to bring different sources and structures of data together to make stronger models of the visual world.
對(duì)于未來(lái)的工作,我們希望使用類(lèi)似的技術(shù)來(lái)進(jìn)行弱監(jiān)督的圖像分割。我們還計(jì)劃使用更強(qiáng)大的匹配策略來(lái)改善我們的檢測(cè)結(jié)果,以在訓(xùn)練期間將弱標(biāo)簽分配給分類(lèi)數(shù)據(jù)。計(jì)算機(jī)視覺(jué)受到大量標(biāo)記數(shù)據(jù)的祝福。我們將繼續(xù)尋找方法,將不同來(lái)源和數(shù)據(jù)結(jié)構(gòu)的數(shù)據(jù)整合起來(lái),形成更強(qiáng)大的視覺(jué)世界模型。
References
[1] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. arXiv preprint arXiv:1512.04143, 2015. 6
[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009. 1
[3] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303– 338, 2010. 1
[4] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Discriminatively trained deformable part models, release 4. http://people.cs.uchicago.edu/pff/latent-release4/. 8
[5] R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015. 4, 5, 6
[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015. 2, 4, 5
[7] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 2, 5
[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. 2
[9] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013. 5
[10] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014. 1, 6
[11] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. E. Reed. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015. 4, 5, 6
[12] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. Introduction to wordnet: An on-line lexical database. International journal of lexicography, 3(4):235–244, 1990. 6
[13] J. Redmon. Darknet: Open source neural networks in c. http://pjreddie.com/darknet/, 2013–2016. 5
[14] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. arXiv preprint arXiv:1506.02640, 2015. 4, 5
[15] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal net- works. arXiv preprint arXiv:1506.01497, 2015. 2, 3, 4, 5, 6
[16] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015. 2
[17] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 2, 5
[18] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016. 2
[19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. 5
[20] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 1