๐ ๊ฐ์ ์ ๋ฆฌ
* Cousera ๊ฐ์ ์ค Andrew Ng ๊ต์๋์ Deep Learning Specialization ๊ฐ์๋ฅผ ๊ณต๋ถํ๊ณ ์ ๋ฆฌํ ๋ด์ฉ์ ๋๋ค.
* ์์ด ๊ณต๋ถ๋ฅผ ํ๋ ค๊ณ ์์ด๋ก ๊ฐ์๋ฅผ ์ ๋ฆฌํ๊ณ ์์ต๋๋ค. ํ๋ฆฐ ๋ถ๋ถ์ด๋ ์ด์ํ ๋ถ๋ถ์ด ์๋ค๋ฉด ๋๊ธ๋ก ์๋ ค์ฃผ์๊ฑฐ๋ ๋์ด๊ฐ์ฃผ์๋ฉด ๊ฐ์ฌํ๊ฒ ์ต๋๋ค
Logistic Regression as a Neural Network
Binary classification is to get output y in (0, 1) with input image x. For example, y
$$
\hat{y}=\sigma\left(w^{T} x+b\right) \text { , where } \sigma(z)=\frac{1}{1+e^{-z}}
$$
$$
\hat{y}^{(i)}=\sigma\left(w^{T} x^{(i)}+b\right) \text { , where } \sigma(z^{(i)})=\frac{1}{1+e^{-z^{(i)}}}
$$
$$z^{(i)}=w^{T} x^{(i)}+b$$
$^{(i)}$ means i-th training example in dataset X
- Loss function
- Loss func1. $\mathscr{L}=(y-\hat{y})^{2}$
- Loss func2. $\mathscr{L}=-(y \log (\hat{y})+(1-y) \log (1-\hat{y}))$
- prefer func2 to func 1 because of opimization (Convex function has one global optimal, but non convex function has lots of local optimals)
- Cost function
- The cost function J, which is applied to your parameters W and b, is average of the loss function applied to each of the training examples
- $J({w}, {b})=\mathscr{L}(y-\hat{y})^{2}=-\frac{1}{m}\left[\sum_{i=1}^{m} y^{(i)} \log (\hat{y}^{(i)})+(1-y^{(i)}) \log (1-\hat{y}^{(i)}))\right]$
3. Gradient Descent
We want to find w, b to minimize J(w, b).
J is convex function and has one global optimal : $\mathscr{L}=-(y \log (\hat{y})+(1-y) \log (1-\hat{y}))$
w and b take a step in the steepest downhill direction to find the optimal point.
- w := w - α*dw
- b := b - α*db
4. Derivatives
The slope(gradient, derivative) of linear function is same at all the point on the graph.
5. More Derivative Examples
We can calculate derivative by derivative calculus and numerical substitution.
๋ํจ์๋ฅผ ๊ตฌํด์ ํธ๋ ๋ฐฉ๋ฒ๊ณผ x์ x + 0.001์ ์ถฉ๋ถํ x์ ๊ฐ๊น์ด ์์ ํจ์๊ฐ์ ๊ตฌํด์ ๋น๋ฅผ ๊ตฌํ๋ ๋ฐฉ๋ฒ์ด ์๋ค.
6. Computation Graph
The computations of a neural network are organized in terms of a forward pass, in which we compute the ouput of the NN, and followed by a backward pass(backpropagation) which we use to compute gradients.
The comptuation graph organizes a computations with the blow arrow, left-to-right computation, and the red arrow, right-to-left computation.
8. Logistic Regression Gradient Descent
I
9. Gradient Descent on m examples
The value of dw in the code is cumulative, so there is only one dw variale.
But this method need two for loops, so it takes time too much. First loop is "i to m" loop, second loop is used when you compute dw_1, dw_2, db. If we vectorize variables w_n, we can use only one for loop.
Python and Vectorization
1. Vectorization
- Numpy library provides lots of build-in functions, so use them. It makes code simpler and faster.
- How to get rid of for loop in Figure 1 (in Gradient Descent on m examples)
- There are n repeats for calculating dw_1, dw_2, ..., dw_n. If dw_1, ..., dw_n are vectorized to one vector, the inner for loop can be replaced with dw += x_i * dz_i.
3. Vectorizing Logistic Regression
Z = np.dot(W_t, X) + b
X.shape = (nx, m), W.shape = (nx, m), b.shape = (1, m)
4. Vectorizing Logistic Regression's Gradient Output
5. Broadcasting in Python
* axis=0 means to sum vertically, axis=1 means to sum horizontally
How can you divide a three by four matrix by a one by four matrix? By broadcasting!
- then this will copy it n times into an (m,n) matrix.
- 2. And then apply the addition, subtraction, and
๋๊ธ