๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
TIL/Coursera(Google ML Bootcamp)

[Improving Deep Neural Network] Week 2. Optimization algorithms :: seoftware

by seowit 2021. 9. 5.

๐Ÿ“œ ๊ฐ•์˜ ์ •๋ฆฌ 

 

* Coursera ๊ฐ•์˜ ์ค‘ Andrew Ng ๊ต์ˆ˜๋‹˜์˜ Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization ๊ฐ•์˜๋ฅผ ๊ณต๋ถ€ํ•˜๊ณ  ์ •๋ฆฌํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.

* ์ด๋ฏธ์ง€ ์ถœ์ฒ˜ : Deeplearning.AI


Optimization Algorithm

1. Mini-batch Gradient Descent

    • source : Deeplearning.AI
      ์ž˜๋ชป ๋ผ๋ฒจ๋ง๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ์„ž์—ฌ ์žˆ๋Š” batch์˜ ๊ฒฝ์šฐ์—๋Š” ํฐ loss ๊ฐ’์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๊ณ  batch๋งˆ๋‹ค loss ๊ฐ’์ด ์ผ์ •ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ๊ทธ๋ž˜ํ”„์—์„œ jittering์ด ๋ฐœ์ƒํ•œ๋‹ค.
    • source : Deeplearning.AI
      batch๊ฐ€ ์—ฌ๋Ÿฌ๊ฐœ์ธ ๊ฒฝ์šฐ์—๋Š” ๊ฐ iteration๋งˆ๋‹ค ๋„ˆ๋ฌด ๋งŽ์€ ์‹œ๊ฐ„์ด ๊ฑธ๋ฆฌ๊ณ , batch size๊ฐ€ 1์ธ ๊ฒฝ์šฐ(Stochastic Gradient Descent)์—๋Š” ํ›ˆ๋ จ ์†๋„๊ฐ€ ๋„ˆ๋ฌด ์˜ค๋ž˜๊ฑธ๋ฆฐ๋‹ค(lose speedup from vecterization). ๋”ฐ๋ผ์„œ ์ ๋‹นํ•œ batch size๋ฅผ ์ฐพ์•„์•ผํ•œ๋‹ค.
    • Tip1. If small training set(m<=2000) : use batch gradient descent
    • Tip2. Else,
 

๋”ฅ๋Ÿฌ๋‹ ์šฉ์–ด์ •๋ฆฌ, MGD(Mini-batch gradient descent), SGD(stochastic gradient descent)์˜ ์ฐจ์ด

์ œ๊ฐ€ ๊ณต๋ถ€ํ•œ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ œ๊ฐ€ ๋‚˜์ค‘์— ๋‹ค์‹œ ๋ณผ๋ ค๊ณ  ์ž‘์„ฑํ•œ ๊ธ€์ด๋‹ค๋ณด๋‹ˆ ํŽธ์˜์ƒ ๋ฐ˜๋ง๋กœ ์ž‘์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์ž˜๋ชป๋œ ๋‚ด์šฉ์ด ์žˆ๋‹ค๋ฉด ์ง€์  ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค. ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. MGD(Mini-batch gradient descen

light-tree.tistory.com

 

source : Deeplearning.AI

β๊ฐ’์ด 0.9 ์ผ ๋•Œ๋Š” ์ด์ „ 9๊ฐœ์˜ ๊ฐ’์„ ๋ฐ˜์˜(ํ˜„์žฌ ๊ฐ’๊นŒ์ง€ 10๊ฐœ์˜ ํ‰๊ท ), β๊ฐ’์ด 0.98์ด๋ฉด ํ˜„์žฌ ๊ฐ’ ํฌํ•จ 50๊ฐœ์˜ ํ‰๊ท , β๊ฐ’์ด 0.5์ด๋ฉด ํ˜„์žฌ ๊ฐ’ ํฌํ•จ 2๊ฐœ์˜ ํ‰๊ท 

 

  • vt(biascorrection)โ€‹ = vtโ€‹/(1−βt) = {βvt−1โ€‹+(1−β)θtโ€‹}/(1−βt)

5. Gradient Descent with momentum

Momentum have effects of velocity and accelerator. ๋ชจ๋ฉ˜ํ…€์„ ์ ์šฉํ•˜๋ฉด ๊ธฐ์กด์˜ ๋ฐฉํ–ฅ์— ์•…์…€์„ ๋”ํ•œ๋‹ค. ์œ„์˜ ์˜ˆ์‹œ๋กœ๋Š” ์˜ค๋ฅธ์ชฝ์œผ๋กœ๋Š” ๋งŽ์ด ๊ฐ€๊ณ  ์„ธ๋กœ ๋ฐฉํ–ฅ์œผ๋กœ๋Š” ์ ๊ฒŒ ๊ฐ„๋‹ค.

v๊ฐ€ velocity, dw๊ฐ€ acceleration์˜ ์—ญํ• ์„ ํ•œ๋‹ค.

 

6. RMSProp (Root Mean Squared Propagation)

์„ธ๋กœ์ถ•์„ b๋ผ๊ณ  ํ•˜๊ณ , ๊ฐ€๋กœ์ถ•์„ w๋ผ๊ณ  ํ•  ๋•Œ, ๊ฐ€๋กœ์ถ•์œผ๋กœ ๋น ๋ฅด๊ฒŒ ์ด๋™ํ•˜๊ณ  ์‹ถ์œผ๋ฉด dw๋ฅผ ์ž‘๊ฒŒํ•˜์—ฌ Sdw๋ฅผ ํฌ๊ฒŒ ๋งŒ๋“ค๋ฉด ๋œ๋‹ค. ์„ธ๋กœ์ถ•์œผ๋กœ ์ ๊ฒŒ ์ด๋™ํ•˜๊ณ  ์‹ถ์œผ๋ฉด db๋ฅผ ํฌ๊ฒŒํ•˜์—ฌ Sdb๋ฅผ ์ž‘๊ฒŒ ๋งŒ๋“ค๋ฉด ๋œ๋‹ค. 

 

7. Adam Optimization Algorithm

: RMSProp + Momentum

  • Hyperparameters
    • α : need to be tune
    • β1 : 0.9
    • β2 : 0.999
    • ε : 1e-8

8. Learning Rate Decay

learning rate α๋ฅผ ์ ์  ์ค„์—ฌ๋‚˜๊ฐ€๋Š” ๊ณผ์ •. α๊ฐ€ ๋„ˆ๋ฌด ํฌ๋ฉด ํ•™์Šต์€ ๋น ๋ฅด์ง€๋งŒ ์ตœ์ ๊ฐ’์— ๋‹ค๊ฐ€๊ฐ€๊ธฐ ์–ด๋ ต๊ณ , α๊ฐ€ ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด ํ•™์Šต์ด ๋„ˆ๋ฌด ๋Š๋ฆฌ๋‹ค

๋Œ“๊ธ€