๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
TIL/Coursera(Google ML Bootcamp)

[Convolutional Neural Networks] Object Detection :: seoftware

by seowit 2021. 9. 20.

๐Ÿ“œ ๊ฐ•์˜ ์ •๋ฆฌ 

 

* Cousera ๊ฐ•์˜ ์ค‘ Andrew Ng ๊ต์ˆ˜๋‹˜์˜ Convolutional Neural Network ๊ฐ•์˜๋ฅผ ๊ณต๋ถ€ํ•˜๊ณ  ์ •๋ฆฌํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.

* ์ด๋ฏธ์ง€ ์ถœ์ฒ˜ : Deeplearning.AI


Detection Algorithms

 

Object Localization

์œ„์™€ ๊ฐ™์€ ์˜ˆ์‹œ์—์„œ target label y๋ฅผ ์‚ดํŽด๋ณด๋ฉด [P_c, b_x, b_y, b_h, b_w, c1, c2, c3] ๋กœ ๊ตฌ์„ฑ์ด ๋˜์–ด์žˆ๋‹ค. P_c๋Š” ์ด๋ฏธ์ง€์— ๊ฐ์ฒด๊ฐ€ ์žˆ์œผ๋ฉด 1, ์—†์œผ๋ฉด 0์ด๊ณ  b_๋กœ ์‹œ์ž‘ํ•˜๋Š” ๊ฒƒ๋“ค์€ ๋ฐ”์šด๋”ฉ๋ฐ•์Šค์— ๊ด€๋ จ๋œ ๊ฒƒ, c๋กœ ์‹œ์ž‘๋˜๋Š” ๊ฒƒ๋“ค์€ ๊ฐ ํด๋ž˜์Šค์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ์ฒด๊ฐ€ ์ด๋ฏธ์ง€ ์•ˆ์— ์žˆ๋Š”์ง€ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์ด๋‹ค. 

๋งŒ์•ฝ P_c(=y_1)๊ฐ€ 1์ด๋ผ๋ฉด predict์™€ target๊ฐ„์˜ ๊ฐ๊ฐ์˜ ์œ„์น˜์— ํ•ด๋‹นํ•˜๋Š” ์ธ์ž๋ผ๋ฆฌ loss๋ฅผ ๊ตฌํ•ด์„œ ๋”ํ•ด์ฃผ๊ณ , P_c๊ฐ€ 0์ด๋ผ๋ฉด y1๊ณผ y1_hat ๊ฐ„์˜ loss๋งŒ ๊ตฌํ•ด์ค€๋‹ค.

 

Landmark Detection

 

landmark๋Š” bbox์˜ b_x, b_y, b_h, b_w์™€ ๋น„์Šทํ•˜๊ฒŒ ๊ฐ๊ฐ์˜ ์œ„์น˜์— ์ ์„ ์ฐ์–ด์„œ ํŠน์ง•์„ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์–ผ๊ตด์˜ ๊ฒฝ์šฐ ๋ˆˆ ์–‘ ๋, ํ„ฑ ์„  ๋“ฑ์— ๋žœ๋“œ๋งˆํฌ๋ฅผ ์ฐ์„ ์ˆ˜ ์žˆ๊ณ , ์‹ ์ฒด์—๋Š” ๋ชฉ, ์–ด๊นจ, ํŒ”๊ฟˆ์น˜, ๋ฌด๋ฆŽ ๋“ฑ์— ๋žœ๋“œ๋งˆํฌ๋ฅผ ์ฐ์„ ์ˆ˜ ์žˆ๋‹ค.

์ค‘์š”ํ•œ ์ ์€ ํ•™์Šต์„ ์‹œํ‚ค๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ๋ ˆ์ด๋ธ”๋ง์„ ์ง„ํ–‰ํ•  ๋•Œ landmark1, landmark2, ..., landmark_n์— ์ผ๊ด€์„ฑ์„ ์ง€์ผœ์•ผํ•œ๋‹ค๋Š” ์ ์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์–ผ๊ตด์—์„œ landmark1์ด ์ฝ”๋์ด๋ผ๋ฉด ๋ชจ๋“  ์ด๋ฏธ์ง€์˜ landmark1์€ ์ฝ”๋์„ ์ฐ์€ ์ ์ด์–ด์•ผ ํ•œ๋‹ค.

Object Detection

conv size๋ฅผ ์ ์  ํฌ๊ฒŒ ํ•ด๊ฐ€๋ฉด์„œ sliding window ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค. computational cost๊ฐ€ ๋งค์šฐ ๋†’๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค.

Convolutional Implementation of Sliding Windows

FC layer๋ฅผ convolutional layer๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

Conv ์—ฐ์‚ฐ์„ ํ†ตํ•ด ์ž๋™์ฐจ๊ฐ€ ์žˆ๋Š” ์˜์—ญ์„ ๋” ํšจ์œจ์ ์œผ๋กœ ์ฐพ์•„๋‚ด๊ธฐ๋Š” ํ•˜์ง€๋งŒ ๋ฐ”์šด๋”ฉ๋ฐ•์Šค๊ฐ€ ์ •ํ™•ํ•˜์ง€ ์•Š๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ๋‚จ์•„์žˆ๋‹ค.

Bounding Box Predictions

โœ” YOLO algorithm

3×3 grid, 19×19 grid ๋“ฑ์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ๋ถ„ํ• ํ•ด์„œ object localization ์ง„ํ–‰

์ž์„ธํ•˜๊ฒŒ ํ‘œํ˜„ ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค

Intersection Over Union

์ค„์—ฌ์„œ IoU๋ผ๊ณ  ํ•˜๋ฉฐ, ๊ฐ์ฒด ๊ฒ€์ถœ์ด ์ž˜๋˜์—ˆ๋Š”์ง€ ํ™•์ธํ•˜๋Š” metric์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค. 

 

Non-max Suppression

๊ฐ์ฒด ๊ฒ€์ถœ์˜ ๋ฌธ์ œ์ ์€ ๋™์ผํ•œ ๊ฐ์ฒด์— ๋Œ€ํ•ด์„œ ์—ฌ๋Ÿฌ๋ฒˆ ๊ฒ€์ถœํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ด๋‹ค. non-max suppression์€ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค. P_c๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ๊ฒƒ๋งŒ ๋ฝ‘์•„๋‚ธ๋‹ค.

 

Anchor boxes

์œ„์˜ ๋ฌธ์ œ๋Š” ํ•˜๋‚˜์˜ ๊ทธ๋ฆฌ๋“œ์…€ ์•ˆ์—์„œ ํ•˜๋‚˜์˜ ๊ฐ์ฒด๋งŒ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด anchor box๋ฅผ ๋„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค

 


Region Proposal

segmentation์„ ํ™œ์šฉํ•˜์—ฌ ํ•ด๋‹น ์˜์—ญ์— ๊ฐ์ฒด๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธ

 

Semantic Segmentation with U-Net

object detection ๋ณด๋‹ค ์กฐ๊ธˆ ๋” ์„ฌ์„ธํ•˜๊ฒŒ ๊ฐ์ฒด๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ”ฝ์…€์ด ์–ด๋–ค ๊ฐ์ฒด์— ์†ํ•˜๋Š”์ง€ ํ™•์ธํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

 

Transpose Convolutions

transpose convolution์€ U-Net ๊ตฌ์กฐ์˜ ์ฃผ์š” ๋ถ€๋ถ„์ด๋‹ค.

์ผ๋ฐ˜ convolution์˜ ๊ฒฝ์šฐ ํฌ๊ธฐ๊ฐ€ ์ž‘์•„์ง€๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ€์ง€๋งŒ transpose convolution์€ ํฌ๊ธฐ๋ฅผ ํ‚ค์šฐ๋Š” ๋ฐฉํ–ฅ(์›๋ณธ ์‚ฌ์ด์ฆˆ๋กœ ๋ณต๊ตฌ)ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ง„ํ–‰๋œ๋‹ค.

Transpose convolution ์ง„ํ–‰๊ณผ์ •

 

U-Net architecture intuition

์•ž์— ๋ฐ˜์ ˆ์€ convolution, ๋’ค์— ๋ฐ˜์ ˆ์€ transpose convolution ์ง„ํ–‰. earlier layer์˜ activation ๊ฒฐ๊ณผ๋ฅผ later layer์— copy ํ•ด์ฃผ๋ฉด ํšจ๊ณผ๊ฐ€ ์ข‹์€๋ฐ, ๊ทธ ์ด์œ ๊ฐ€ ๋ฌด์—‡์ผ๊นŒ?

low level์˜ ๋””ํ…Œ์ผ๊ณผ high level์˜ semanticํ•œ ํŠน์ง•์„ ๋ชจ๋‘ ๋‹ด์„ ์ˆ˜ ์žˆ๊ฒŒ ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

U-Net architecture

๋Œ“๊ธ€