R-CNN, Fast R-CNN, Faster R-CNN, YOLO:目標(biāo)檢測(cè)算法總結(jié)

參考鏈接

以下是文中涉及的算法的最原始的文章:

  1. https://arxiv.org/pdf/1311.2524.pdf
  2. https://arxiv.org/pdf/1504.08083.pdf
  3. https://arxiv.org/pdf/1506.01497.pdf
  4. https://arxiv.org/pdf/1506.02640v5.pdf

一、介紹

  1. detection algorithm 和 classification algorithm的區(qū)別在于,目標(biāo)檢測(cè)算法是在感興趣的物體周圍畫個(gè)圈去定位它;分類算法是predict每個(gè)像素點(diǎn)的label。

  2. 我們不能通過建立一個(gè)標(biāo)準(zhǔn)的最后一層是全連接層的卷積網(wǎng)絡(luò)來解決目標(biāo)檢測(cè)問題,原因是輸出層的長(zhǎng)度是可變的,并不是一個(gè)常量。

一個(gè)最直接的解決辦法是從圖中取不同的感興趣區(qū)域,然后對(duì)這些區(qū)域用CNN進(jìn)行分類,檢測(cè)這些區(qū)域中是否有物體的存在。
但是待檢測(cè)物體可能存在于圖片的不同位置而且有不同的長(zhǎng)寬比例。所以以上方法需要選取量非常大的區(qū)域并需要非常大的計(jì)算量。

因此,R-CNN, Fast R-CNN, Faster R-CNN, YOLO被開發(fā)去又快又準(zhǔn)地找物體。

二、 R-CNN

為了解決上述提到的有大量區(qū)域被選擇的問題, Ross Girshick et al提出了一種方法:用了選擇性搜索從圖片提取了2000個(gè)區(qū)域,這些區(qū)域被稱為”region proposals“。

image.png

用這種辦法,我們不需要去分類巨大數(shù)量的區(qū)域了,我們只需要去處理2000個(gè)區(qū)域。這2000個(gè)區(qū)域是用如下的選擇性搜索算法(selective search algorithm)來找到的:

Selective Search:

  1. Generate initial sub-segmentation, we generate many candidate regions (產(chǎn)生子備選區(qū)域)
  2. Use greedy algorithm to recursively combine similar regions into larger ones (結(jié)合區(qū)域)
  3. Use the generated regions to produce the final candidate region proposals (產(chǎn)生最終區(qū)域)

這篇文章介紹了更多關(guān)于選擇性搜索算法(selective search algorithm)的內(nèi)容。

RCNN步驟:

  1. 這2000個(gè)備選的region proposals被wrap成一個(gè)square,并被喂進(jìn)卷積神經(jīng)網(wǎng)絡(luò)CNN,通過這個(gè)CNN輸出一個(gè)4096維度的特征向量。
    These 2000 candidate region proposals are warped into a square and fed into a convolutional neural network that produces a 4096-dimensional feature vector as output.
    這個(gè)CNN的作用是特征提取器(feature extractor)。它的全連接輸出層包含了從圖像中提取的特征。
    The CNN acts as a feature extractor and the output dense layer consists of the features extracted from the image
  2. 這些特征被送到一個(gè)SVM里去對(duì) candidate region proposal內(nèi)是否存在物體進(jìn)行分類。
    the extracted features are fed into an SVM to classify the presence of the object within that candidate region proposal.
  3. 為了更準(zhǔn)確地去預(yù)測(cè)region proposal內(nèi)是否存在物體,算法也預(yù)測(cè)了4個(gè)補(bǔ)償量(offset values)去增加bounding box準(zhǔn)確程度。
    In addition to predicting the presence of an object within the region proposals, the algorithm also predicts four values which are offset values to increase the precision of the bounding box.
    舉個(gè)例子,算法能預(yù)測(cè)某個(gè)region proposal內(nèi)有一個(gè)人。但是這個(gè)人的人臉不能被bounding box剪掉一半。所以,這些補(bǔ)償量(offset values)用于調(diào)整 bounding box of the region proposal。
R-CNN

R-CNN存在的問題:

  • 因?yàn)槊繌垐D都需要對(duì)2000個(gè)region proposals進(jìn)行分類,所以訓(xùn)練網(wǎng)絡(luò)還是需要耗費(fèi)大量時(shí)間。
  • 對(duì)測(cè)試集也不能做到實(shí)時(shí)出結(jié)果,47秒/張圖
  • 選擇性搜索算法(selective search algorithm)是固定算法。在這一步?jīng)]有任何學(xué)習(xí)的過程。這可能會(huì)導(dǎo)致產(chǎn)生bad candidate region proposals

三、Fast R-CNN

Fast R-CNN的幾個(gè)改進(jìn):
The same author of the previous paper(R-CNN) solved some of the drawbacks of R-CNN to build a faster object detection algorithm and it was called Fast R-CNN. The approach is similar to the R-CNN algorithm.

  1. 把原圖放進(jìn)CNN去找feature map。
    But, instead of feeding the region proposals to the CNN, we feed the input image to the CNN to generate a convolutional feature map.
  2. 基于feature map來確定 region of proposals 。
    From the convolutional feature map, we identify the region of proposals and warp them into squares and
  3. region of proposals通過RoI pooling layer變成固定大小,進(jìn)而通過全連接層。
    by using a RoI pooling layer we reshape them into a fixed size so that it can be fed into a fully connected layer.
  4. 用softmax函數(shù)預(yù)測(cè) class和offset value.
    From the RoI feature vector, we use a softmax layer to predict (1)the class of the proposed region and also (2) the offset values for the bounding box.

Fast R-CNN更快的原因是:

  • 每張圖只需要計(jì)算一次feature map.(之前是計(jì)算2000次)
    you don’t have to feed 2000 region proposals to the convolutional neural network every time. Instead, the convolution operation is done only once per image and a feature map is generated from it.
image.png

Fast R-CNN更快:
From the above graphs, you can infer that Fast R-CNN is significantly faster in training and testing sessions over R-CNN. When you look at the performance of Fast R-CNN during testing time, including region proposals slows down the algorithm significantly when compared to not using region proposals. Therefore, region proposals become bottlenecks in Fast R-CNN algorithm affecting its performance.

四、Faster R-CNN

上面兩個(gè)算法的缺點(diǎn):
selective search耗時(shí)
Both of the above algorithms(R-CNN & Fast R-CNN) uses selective search to find out the region proposals. Selective search is a slow and time-consuming process affecting the performance of the network.

Faster R-CNN的改進(jìn):
不用selective search去找region proposals;
用network去找region proposals;
Therefore, Shaoqing Ren et al. came up with an object detection algorithm that eliminates the selective search algorithm and lets the network learn the region proposals.

Faster R-CNN的步驟:

  1. 把輸入圖片放到CNN里去產(chǎn)生feature map: Similar to Fast R-CNN, the image is provided as an input to a convolutional network which provides a convolutional feature map.

  2. 用一個(gè)單獨(dú)的網(wǎng)絡(luò)去預(yù)測(cè)region proposals: Instead of using selective search algorithm on the feature map to identify the region proposals, **a separate network is used to predict the region proposals. **

  3. 這一步與fast-RCNN類似:用RoI pooling去reshape和輸出。The predicted region proposals are then reshaped using a RoI pooling layer which is then used to classify the image within the proposed region and predict the offset values for the bounding boxes.

image.png

時(shí)間上的對(duì)比:
Faster R-CNN最快并且能用作實(shí)時(shí)目標(biāo)檢測(cè)

image.png

五、YOLO: You Only Look Once

之前幾種算法的缺點(diǎn):
產(chǎn)生region的時(shí)候沒有縱覽整幅圖。其實(shí)圖的某些部分有更高的可能性包含物體。
All of the previous object detection algorithms use regions to localize the object within the image. The network does not look at the complete image. Instead, parts of the image which have high probabilities of containing the object.

YOLO的思想:
用一個(gè)單獨(dú)的網(wǎng)絡(luò)去預(yù)測(cè)bounding boxes和bounding boxes中存在物體的概率
YOLO or You Only Look Once is an object detection algorithm much different from the region based algorithms seen above.
In YOLO, a single convolutional network predicts (1) the bounding boxes and (2)the class probabilities for these boxes.

YOLO示意圖

YOLO的具體步驟:
How YOLO works is that:

  1. 把圖分成SxS個(gè)格子,每個(gè)格子中取m個(gè)bounding boxes。we take an image and split it into an SxS grid, within each of the grid we take m bounding boxes.
  2. 對(duì)每個(gè)bounding boxes,network輸出a class probability and offset values:For each of the bounding box, the network outputs a class probability and offset values for the bounding box.
  3. 選擇概率高的bounding boxes去定位物體:The bounding boxes having the class probability above a threshold value is selected and used to locate the object within the image.

YOLO的優(yōu)缺點(diǎn):

  • 優(yōu)點(diǎn):更快。YOLO is orders of magnitude faster(45 frames per second) than other object detection algorithms.
  • 缺點(diǎn):小物體難被檢測(cè)到。The limitation of YOLO algorithm is that it struggles with small objects within the image, for example it might have difficulties in detecting a flock of birds. This is due to the spatial constraints of the algorithm.
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容