๐ ๊ฐ์ ์ ๋ฆฌ
* Cousera ๊ฐ์ ์ค Andrew Ng ๊ต์๋์ Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization ๊ฐ์๋ฅผ ๊ณต๋ถํ๊ณ ์ ๋ฆฌํ ๋ด์ฉ์ ๋๋ค.
* ์์ด ๊ณต๋ถ๋ฅผ ํ๋ ค๊ณ ๋ถ๋ถ์ ์ผ๋ก ์์ด๋ก ๊ฐ์๋ฅผ ์ ๋ฆฌํ๊ณ ์์ต๋๋ค. ํน์ ํ๋ฆฐ ๋ถ๋ถ์ด๋ ์ด์ํ ๋ถ๋ถ์ด ์๋ค๋ฉด ๋๊ธ๋ก ์๋ ค์ฃผ์๊ฑฐ๋ ๋์ด๊ฐ์ฃผ์๋ฉด ๊ฐ์ฌํ๊ฒ ์ต๋๋ค
Setting Up your Machine Learning Application
1. Train/dev/test
- ratio of each part
- (Small dataset) Trainng : Validation : Testing = 60 : 20 : 20
- (Large dataset) Training : Validation : Testing = 98 : 1 : 1 or 95 : 2.5 : 2.5
Regularizing your Neural Network
1. Regularization
There is two ways to prevent high variance problems. One of the first thing is to try regularization and the other thing is to get more trainig datas.
- How regularization work?
- Regularization term is added to loss function. There are L1 regularization and L2 regularizaton, L2 regularization is just used much much much more often.
- L2 regularization is also called "Frobenius norm". It means the sum of square of elements of a matrix
- Neural Network
-
- λ/m*w_l is added to dw_l. It is derivative of l2-regularization.
2. Why Regularization Reduces Overfitting?
The regularization term(purple part) pernalize the weight matrices from being too large.
For example, let's suppose we use a tanh function as an activation function.W value is close to zero if lambda is big number because the cost function J should be close to zero. Tanh function has linear slope in the scope near zero, but it has smooth slope away from the zero.๋น์ฉํจ์๋ ๊ฐ์ด ์์์ ธ์ผํ๊ธฐ ๋๋ฌธ์ ๋๋ค๋ฅผ ํฐ ์๋ก ํ๋ฉด W๋ 0์ ๊ฐ๊น์์ ธ์ผ ํ๋ค. z = wx+b ์์ w๊ฐ 0์ ๊ฐ๊น์์ง๋ฉด z๋ 0์ ๊ฐ๊น์์ง๊ฒ ๋๋ค. z๋ tanh(ํ์ฑํจ์)์ input์ด ๋๋๋ฐ, tanh์ ๊ฒฝ์ฐ 0 ์ฃผ๋ณ์ ๊ธฐ์ธ๊ธฐ๋ linear function์ ๊ฐ๊น์์ ์๋์ ์ผ๋ก ์ ์ฒด์ ์ธ ํจ์๊ฐ linearํ๊ฒ ๋๋ฉฐ ์ฌํํด์ง๋ ํจ๊ณผ๊ฐ ๋ฐ์ํ๋ค.
์ฌ๊ธฐ์ ๊ถ๊ธํ ์ ์ tanh๊ฐ ์๋๋ฉด regularizationn ํจ๊ณผ๊ฐ ์๋? ์ด๋ฐ ์๊ฐ์ด ๋ ๋ค. ReLU์๋ ์ ์ฉ์ด ๋๋๊ฑด๊ฐ?
3. Dropout Regularization
Dropout is to set some probability(here 0.5) of eliminating a node in neural network. We are going to keeping or removing each node with a probability of 0.5.
4. Understanding Dropout
Why does dropout work?
Intuition : Can’t rely on any one feature, so have to spread out weights. Shrink weights
Notice that you should set different keep.probs for different layers, you should avoid to use dropout at the input layer and the classifier layer(=keep probability with 1.0).
In computer vision, the input sizes is so big in putting all these pixels that you almost never have enough data. And so dropout is very frequently used by the computer vision and there are some common vision research that dropout is pretty much always used as a default.
5. Other Regularization Methods
- Data augmentatio
- Early stopping
Setting Up your Optimization Problem
1. Normalizing Inputs
2. Vanishing/Exploding Gradients
3. Weight Initialization for Deep Networks
The way you initialize weight depends on activation function.
4. Numerical Aprroximations of Gradients
๋๊ธ