๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
TIL/Boostcamp AI tech

[Boostcamp Lv2][P stage] Week9-Day38. 2 stage detector :: seoftwa

by seowit 2021. 9. 30.

๐Ÿ“œ ๊ฐ•์˜ ์ •๋ฆฌ 


2 Stage Detector

 

๊ฐ์ฒด๊ฐ€ ์žˆ์„ ๋ฒ•ํ•œ ์œ„์น˜๋ฅผ ํŠน์ •์ง“๊ณ , ํ•ด๋‹น ๊ฐ์ฒด๊ฐ€ ๋ฌด์—‡์ธ์ง€ ๋ถ„๋ฅ˜ํ•˜๋Š” 2๊ฐ€์ง€ ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์น˜๋Š” ๋ชจ๋ธ์„ 2 stage detector๋ผ๊ณ  ํ•œ๋‹ค.

 

1. R-CNN

โœ” Pipeline

  1. Input Image
  2. Extract Region proposals : Selective Search ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ์•ฝ 2000๊ฐœ์˜ RoI ์ถ”์ถœ
  3. Warping : RoI์˜ ์‚ฌ์ด์ฆˆ๋ฅผ ์กฐ์ ˆํ•ด ๋ชจ๋‘ ๋™์ผํ•œ ์‚ฌ์ด์ฆˆ๋กœ ๋ณ€๊ฒฝ
    • warping์„ ํ•˜๋Š” ์ด์œ ? FC layer์˜ ์ž…๋ ฅ ์‚ฌ์ด์ฆˆ๊ฐ€ ๊ณ ์ •๋˜์–ด์„œ, ๋™์ผํ•œ ์‚ฌ์ด์ฆˆ๋กœ ๋งž์ถฐ์ค˜์•ผํ•œ๋‹ค.
  4. Compute CNN features : ๊ฐ region ๋งˆ๋‹ค 4096(64×64)-dim feature vector ์ถ”์ถœ(2000×4096) - semantic ์ •๋ณด ํฌํ•จ๋˜์–ด ์žˆ๋‹ค, pretrained AlexNet ๊ตฌ์กฐ ์‚ฌ์šฉ(๋งˆ์ง€๋ง‰์— FC layer ์ถ”๊ฐ€, ํ•„์š”์— ๋”ฐ๋ผ fine-tuning)
  5. Classify regions
    1. Feature vector๋ฅผ SVM์— ๋„ฃ์–ด ๋ถ„๋ฅ˜ ์ง„ํ–‰ : Input : 2000×4096, Output : Class(c+1) + Confidence score
    2. Feature vector๋ฅผ regression์„ ํ†ตํ•ด BBOX ์˜ˆ์ธก : ์ค‘์‹ฌ์ ์˜ ์ขŒํ‘œ์™€ ๊ฐ€๋กœ ์„ธ๋กœ ๊ธธ์ด๋ฅผ ํ•™์Šตํ•œ๋‹ค.

 

โœ” Selective Search

ํ›„๋ณด ์˜์—ญ ์ถ”์ถœ์„ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ์•ฝ 2000๊ฐœ์˜ RoI๋ฅผ ์ถ”์ถœ

 

โœ” Training

  • AlexNet
    • Domain specific finetuning
    • Dataset ๊ตฌ์„ฑ(when finetuning)
      • IoU > 0.5: gt์™€ RoI์˜ IoU๊ฐ€ 0.5 ์ด์ƒ์ด๋ฉด positive samples. (GT ํ•˜๋‚˜์— ๋Œ€ํ•œ RoI๊ฐ€ ํ•˜๋‚˜ ์ด์ƒ์ž„)
      • IoU < 0.5: gt์™€ RoI์˜ IoU๊ฐ€ 0.5 ์ดํ•˜์ด๋ฉด negative samples
      • Positive samples 32, negative samples 96
  • Linear SVM
    • Dataset ๊ตฌ์„ฑ
      • Ground truth: positive samples
      • IoU < 0.3: negative samples
      • Positive samples 32, negative samples 96
  • Hard negative mining
    • Hard negative: False positive
    • ๋ฐฐ๊ฒฝ์œผ๋กœ ์‹๋ณ„ํ•˜๊ธฐ ์–ด๋ ค์šด ์ƒ˜ํ”Œ๋“ค์„ ๊ฐ•์ œ๋กœ ๋‹ค์Œ ๋ฐฐ์น˜์˜ negative sample ๋กœ mining ํ•˜๋Š” ๋ฐฉ๋ฒ•
  • BBox Regressor
    • Dataset ๊ตฌ์„ฑ
      • IoU > 0.6 : positive samples
      • Negative sample์€ bbox๊ฐ€ ์—†๋‹ค
    • Loss function
      • MSE Loss : ์ค‘์‹ฌ์ ๊ณผ, width, height๋ฅผ ์–ด๋Š์ •๋„๋กœ ๋ฐ”๊ฟ€ ๊ฒƒ์ธ์ง€ ๋ณ€ํ™”์— ๋Œ€ํ•ด ํ•™์Šต 

โœ” Shortcomings

  • 2000๊ฐœ์˜ Region ์„ ๊ฐ๊ฐ CNN ํ†ต๊ณผ → CNN ์—ฐ์‚ฐ 2000๋ฒˆ ์ง„ํ–‰ํ•ด์•ผํ•ด์„œ ๋งค์šฐ ๋Š๋ฆผ
  • ๊ฐ•์ œ Warping, ์„ฑ๋Šฅ ํ•˜๋ฝ ๊ฐ€๋Šฅ์„ฑ
  • CNN, SVM classifier, bounding box regressor, ๋”ฐ๋กœ ํ•™์Šต
  • End to End X

 

2. SPPNet

โœ” Overall Architecture & R-CNN๊ณผ ๋น„๊ต

R-CNN , SPP

R-CNN SPPNet
2000๊ฐœ์˜ RoI์— ๋Œ€ํ•ด 2000๋ฒˆ์˜ ConvNet ์—ฐ์‚ฐ ์ˆ˜ํ–‰ 1๋ฒˆ์˜ ConvNet ์—ฐ์‚ฐ์„ ํ†ตํ•ด ์ƒ์„ฑ๋œ Feature map์—์„œ
2000๊ฐœ์˜ RoI ์ถ”์ถœ
๊ณ ์ •๋œ size๋กœ warping Spatial Pyramid Pooling Layer๋ฅผ ํ†ตํ•œ size ๋ณ€๊ฒฝ

 

โœ” Spatial Pyramid Pooling

๊ณ ์ •๋œ feature vector size๋ฅผ ์œ„ํ•ด ๋‹ค์–‘ํ•œ RoI์˜ size๋ฅผ ๋™์ผํ•˜๊ฒŒ ๋ณ€๊ฒฝํ•˜๋Š” ๊ณผ์ •

  1. Target feature map์˜ size๋ฅผ ์ •ํ•œ๋‹ค
  2. Pooling์„ ์ง„ํ–‰ํ•˜์—ฌ size๋ฅผ ๋งž์ถ˜๋‹ค
  3. ๋‹ค์–‘ํ•œ size๋กœ pooling์„ ์ง„ํ–‰ํ•œ ๊ฒƒ์„ flattenํ•˜์—ฌ concatํ•œ๋‹ค
  4. FC layer์˜ input์œผ๋กœ ๋„ฃ๋Š”๋‹ค

 

โœ” Shortcomings

  • 2000๊ฐœ์˜ Region ์„ ๊ฐ๊ฐ CNN ํ†ต๊ณผ → CNN ์—ฐ์‚ฐ 2000๋ฒˆ ์ง„ํ–‰ํ•ด์•ผํ•ด์„œ ๋งค์šฐ ๋Š๋ฆผ
  • ๊ฐ•์ œ Warping, ์„ฑ๋Šฅ ํ•˜๋ฝ ๊ฐ€๋Šฅ์„ฑ
  • CNN, SVM classifier, bounding box regressor, ๋”ฐ๋กœ ํ•™์Šต
  • End to End X

 

3. Fast R-CNN

R-CNN๊ณผ ๋‹ฌ๋ฆฌ Selective Search๋ฅผ ์ œ์™ธํ•˜๊ณ ๋Š” End-to-end ๊ณผ์ •์ด๋‹ค. Conv layer, softmax classifier, bbox regressor ๋ชจ๋‘ ํ•˜๋‚˜์˜ NN์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. 

โœ” Pipeline

  1. ์ด๋ฏธ์ง€๋ฅผ CNN์— ๋„ฃ์–ด feature ์ถ”์ถœ(ํ•œ ๋ฒˆ๋งŒ ์ˆ˜ํ–‰ํ•œ๋‹ค) : VGG16 ์‚ฌ์šฉ
  2. RoI Projection ์„ ํ†ตํ•ด feature map ์ƒ์—์„œ RoI ๋ฅผ ๊ณ„์‚ฐ
  3. RoI Pooling ์„ ํ†ตํ•ด ์ผ์ •ํ•œ ํฌ๊ธฐ์˜ feature ๊ฐ€ ์ถ”์ถœ : ๊ณ ์ •๋œ vector ์–ป๊ธฐ ์œ„ํ•œ ๊ณผ์ •
  4. Fully connected layer ์ดํ›„ Softmax Classifier ๊ณผ Bouding Box Regressor

 

โœ” RoI Projection

์›๋ณธ์ด๋ฏธ์ง€์—์„œ Selective Search๋ฅผ ํ•˜๋Š” ๊ณผ์ •์€ ๊ฐ™๊ณ  Feature Map์— 2000๊ฐœ์˜ RoI๋ฅผ ํˆฌ์˜ํ•œ๋‹ค.

์›๋ณธ์ด๋ฏธ์ง€์™€ Conv Feature Map์˜ size๊ฐ€ ๊ฐ™์€ ๊ฒฝ์šฐ๋Š” ์œ„์ฒ˜๋Ÿผ ์ง„ํ–‰์„ ํ•˜๊ณ , ๋‹ค๋ฅธ ๊ฒฝ์šฐ๋Š” RoI์˜ ๋น„์œจ์„ ์›๋ณธ์ด๋ฏธ์ง€์™€ Feature Map์˜ ๋น„์œจ์— ๋งž์ถ”์–ด์„œ ์กฐ์ •ํ•ด์ค€๋‹ค. ์•„๋ž˜ ์˜ˆ์‹œ์—์„œ๋Š” 400×400์ด  40×40์ด ๋์œผ๋ฏ€๋กœ RoI๋„ 300์„ 10์œผ๋กœ ๋‚˜๋ˆ„์–ด์„œ 300×300์ด 30×30์œผ๋กœ ๋ฐ”๊พธ์–ด Projection ํ•œ๋‹ค.

 

โœ” RoI Pooling

Spatial Pyramid Pooling๊ณผ ๊ฑฐ์˜ ์œ ์‚ฌํ•˜์ง€๋งŒ, Spatial Pyramid Pooling์€ Pyramid size๊ฐ€ 1×1, 2×2, 4×4, 8×8 ๋“ฑ์˜ ์—ฌ๋Ÿฌ๊ฐ€์ง€์ธ๋ฐ, RoI pooling์€ target size๊ฐ€ 7×7 ํ•œ๊ฐœ๋ผ๋Š” ์ ์ด ๋‹ค๋ฅด๋‹ค.

โœ” Training

  • Multi Task Loss ์‚ฌ์šฉ
    • •(classification loss bounding box regression)
  • Loss funciton
    • Classification : Cross entropy
    • BB regressor : Smooth L1(outlier์— ๋œ ๋ฏผ๊ฐ)
  • Dataset ๊ตฌ์„ฑ
    • IoU > 0.5: positive samples
    • 0.1 < IoU < 0.5: negative samples
    • Positive samples 25%, negative samples 75%
  • Hierarchical sampling
    • R-CNN ์˜ ๊ฒฝ์šฐ ์ด๋ฏธ์ง€์— ์กด์žฌํ•˜๋Š” RoI ๋ฅผ ์ „๋ถ€ ์ €์žฅํ•ด ์‚ฌ์šฉ
    • ํ•œ ๋ฐฐ์น˜์— ์„œ๋กœ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€์˜ RoI ๊ฐ€ ํฌํ•จ๋จ
    • Fast R-CNN ์˜ ๊ฒฝ์šฐ ํ•œ ๋ฐฐ์น˜์— ํ•œ ์ด๋ฏธ์ง€์˜ RoI ๋งŒ์„ ํฌํ•จ
    • ํ•œ ๋ฐฐ์น˜ ์•ˆ์—์„œ ์—ฐ์‚ฐ๊ณผ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ์Œ

โœ” Shortcomings

  • 2000๊ฐœ์˜ Region ์„ ๊ฐ๊ฐ CNN ํ†ต๊ณผ → CNN ์—ฐ์‚ฐ 2000๋ฒˆ ์ง„ํ–‰ํ•ด์•ผํ•ด์„œ ๋งค์šฐ ๋Š๋ฆผ
  • ๊ฐ•์ œ Warping, ์„ฑ๋Šฅ ํ•˜๋ฝ ๊ฐ€๋Šฅ์„ฑ
  • CNN, SVM classifier, bounding box regressor, ๋”ฐ๋กœ ํ•™์Šต
  • End to End X

 

โœ” ๊ผญ ์ฝ์–ด๋ณผ ๊ฒƒ

Fast R-CNN ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ๋ธ”๋กœ๊ทธ

 

4. Faster R-CNN

 

โœ” Fast R-CNN vs. Faster R-CNN

Faster R-CNN์€ Selective Search๋กœ Region์„ ์ถ”์ถœํ•˜๋˜ ๋ถ€๋ถ„์ด ์—†์–ด์ง€๊ณ , RPN์ด๋ผ๋Š” ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ๋„คํŠธ์›Œํฌ๋ฅผ ๋„์ž…ํ•˜์—ฌ End-to-End ๋ชจ๋ธ ๊ตฌ์กฐ์ด๋‹ค. (Selective Search๋Š” cpu ์ƒ์—์„œ ๋™์ž‘ํ•˜๊ณ  ํ•™์Šต ๊ฐ€๋Šฅํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์•„๋‹ˆ๋‹ค)

โœ” Pipeline

  1. ์ด๋ฏธ์ง€๋ฅผ CNN ์— ๋„ฃ์–ด feature maps ์ถ”์ถœ (CNN ์„ ํ•œ ๋ฒˆ๋งŒ ์‚ฌ์šฉ)
  2. RPN ์„ ํ†ตํ•ด RoI ๊ณ„์‚ฐ
  3. NMS(Non Maximum Suppresion)

 

โœ” RPN - Region Proposal Network

selective search ๋Œ€์ฒด ๋ฐฉ๋ฒ•์œผ๋กœ Anchor box๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ๊ฐ ์…€๋งˆ๋‹ค ๋‹ค์–‘ํ•œ scale์˜ ์•ต์ปค๋ฐ•์Šค๊ฐ€ ์ฃผ์–ด์ง„๋‹ค.

์•„๋ž˜์™€ ๊ฐ™์ด 9๊ฐœ์˜ anchor box๊ฐ€ ์žˆ๊ณ  64×64 feature map์ด ์žˆ๋‹ค๋ฉด ์ด 64×64×9๊ฐœ(์•ฝ 36K๊ฐœ)์˜ RoI๊ฐ€ ์กด์žฌํ•˜๊ฒŒ ๋œ๋‹ค.

36K๋Š” 2K๊ฐœ์— ๋น„ํ•ด ๋งค์šฐ ํฐ ์ˆซ์ž์ด๋ฏ€๋กœ ์•ต์ปค๋ฐ•์Šค๊ฐ€ ๊ฐ์ฒด๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ๋Š”์ง€ ์˜ˆ์ธกํ•ด์•ผํ•˜๊ณ  bbox์˜ ์œ„์น˜์™€ ํฌ๊ธฐ๋ฅผ ๋ฏธ์„ธ์กฐ์ •ํ•ด์•ผํ•œ๋‹ค.

๋”ฐ๋ผ์„œ RPN์€ ๊ฐ ์…€๋งˆ๋‹ค N๊ฐœ์˜ ์•ต์ปค๋ฐ•์Šค๊ฐ€ ๊ฐ์ฒด๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ๋Š”์ง€ ํŒ๋‹จํ•˜๊ณ  ๋งŒ์•ฝ์— ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค๋ฉด ์ค‘์‹ฌ์„ ์ด๋™ํ•˜๊ณ  ํฌ๊ธฐ๋ฅผ ์กฐ์ •ํ•˜๋Š” ์ผ์„ ํ•œ๋‹ค.

 

  1. CNN ์—์„œ ๋‚˜์˜จ feature map ์„ input ์œผ๋กœ ๋ฐ›์Œ . ๐ป: ์„ธ๋กœ , ๐‘Š: ๊ฐ€๋กœ , ๐ถ: ์ฑ„๋„
  2. 3x3 conv ์ˆ˜ํ–‰ํ•˜์—ฌ intermediate layer ์ƒ์„ฑ
  3. 1x1 conv ์ˆ˜ํ–‰ํ•˜์—ฌ binary classification ์ˆ˜ํ–‰ : ๊ฐ ํ”ฝ์…€๋ณ„๋กœ 9๊ฐœ์˜ ์•ต์ปค๋ฐ•์Šค๊ฐ€ ๊ฐ์ฒด์ธ์ง€ ์•„๋‹Œ์ง€ ์ฑ„๋„์˜ ์ •๋ณด ์ €์žฅ
    • 2 ( object or not ) x 9 (num of anchors) ์ฑ„๋„ ์ƒ์„ฑ → 18๊ฐœ์˜ ์ฑ„๋„
    • 9๊ฐœ์˜ ์ฑ„๋„์— ๋Œ€ํ•ด sigmoid ์—ฐ์‚ฐ์„ ํ•˜์ง€ ์•Š๊ณ  object ์žˆ์„ ๋•Œ์˜ 9 ์ฑ„๋„ , ์—†์„ ๋•Œ 9 ์ฑ„๋„ ํ•ด์„œ 18๊ฐœ์˜ ์ฑ„๋„์„ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ (=object๊ฐ€ ์žˆ๋Š”์ง€ ์•„๋‹Œ์ง€ binary classification ๋ฌธ์ œ๋ฅผ logit๊ฐ’ 2๊ฐœ๋ฅผ ์ด์šฉํ•ด ํ‘œํ˜„ํ•˜๋Š” ์ด์œ ) : 9๊ฐœ์˜ channel์„ ๋งŒ๋“ค์–ด์„œ sigmoid๋ฅผ ์”Œ์›Œ์ฃผ๋Š” ๊ฒฝ์šฐ์— threshold๋ฅผ ํ†ตํ•ด์„œ ๋ฌผ์ฒด๊ฐ€ ์กด์žฌํ•˜๋Š”์ง€ ์•„๋‹Œ์ง€ ํŒ๋ณ„์„ ํ•ด์•ผํ•˜๋ฉฐ, ๊ทธ๋ ‡๋‹ค๋ฉด ํŠน์ • threshold ์ฐพ๊ณ  ์ด์— ๋งž๊ฒŒ ํ•™์Šต์„ ํ•ด์•ผํ•œ๋‹ค๋Š” ์ ์— ๋”ฐ๋ผ์„œ 18๊ฐœ์˜ channel์„ ๋งŒ๋“ค์–ด sigmoid๋ฅผ ์”Œ์›Œ ๋” ๋†’์€ ๊ฐ’์— ๋”ฐ๋ผ ๋ฌผ์ฒด๊ฐ€ ์กด์žฌํ•˜๋Š”์ง€ ์•„๋‹Œ์ง€ ํŒ๋ณ„ํ•˜๊ฒŒ ๋œ๋‹ค.
  4. 1x1 conv ์ˆ˜ํ–‰ํ•˜์—ฌ bounding box regression ์ˆ˜ํ–‰ : 3์˜ ๊ณผ์ •์—์„œ ๊ฐ์ฒด๊ฐ€ ํฌํ•จ๋˜๊ณ  ์žˆ๋‹ค๋ฉด 4 ์ง„ํ–‰(์–ด๋–ป๊ฒŒ ๋ฏธ์„ธ์กฐ์ •)
    • 4 ( bounding box ) x9 (num of anchors) ์ฑ„๋„ ์ƒ์„ฑ → 36๊ฐœ์˜ ์ฑ„๋„

 

Faster RCNN RPN์˜ NMS์—์„œ๋Š” Object๊ฐ€ ์กด์žฌํ•˜๋Š”์ง€ ์กด์žฌํ•˜์ง€ ์•Š๋Š”์ง€ ๊ณ„์‚ฐํ•œ Score ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ NMS๋ฅผ ์ง„ํ–‰

๋Œ“๊ธ€