Deep Learning from Scratch

读书笔记 Deep Learning from Scratch，Advanced Deep Learning with Python

数学的基本知识

求导

链式法则

例如：f(x) = sinx, g(x) = x^2^+1

f(g(x))' = sin(x^2^+1)'

f'(g(x))g'(x) = [sin(x^2^+1)]'* 2x

= 2cos(x^2^+1)x

附上求导公式表：大一学的，已经忘光光了。

其中，反三角函数是给定比值，求对应的角度，而三角函数是给定角度，求比值。

tan是对边比邻边，

cot是邻边比对边，

sin是对边比斜边，

cos是邻边比斜边，

sec是1/cos 也就是斜边比邻边，

csc是1/sin 也就是斜边比对边。

哎，高中数学忘得精光……工作了虽然也用不到，但还是复习一下。

书里对链式法则的图解也很直观，两个delta相乘。

为什么要复习导数呢？因为，正向传播就是函数嵌套的过程，f~3~(f~2~(f~1~(x)))，

而反向传播就是对这个嵌套函数进行递归求导。

向量乘法

概率

贝叶斯公式

Variance就是方差

正态分布函数

信息公式

香农的信息熵：

偏导

一个多输入函数： $\sigma$ (f(x,y))的正向和反向传播过程

(markdown里怎么打希腊字母： $\sigma$ )

多输入函数对x求偏导：

这个法则很重要，也很简单，它决定了反向传播最后一层（输入层）对某个变量的求导结果是多少。

对向量求导：

相当于对每个元素分别求导

Deep Learning:

“Repeatedly feed observations through the model, keeping track of the quantities computed along the way during this “forward pass.”

Calculate a loss representing how far off our model’s predictions were from the desired outputs or target.

Using the quantities computed on the forward pass and the chain rule math worked out in Chapter 1, compute how much each of the input parameters ultimately affects this loss.

Update the values of the parameters so that the loss will hopefully be reduced when the next set of observations is passed through the model.

每个Layer 都有Backward和Forward，也就是基本的Dense Layer

Normalize

其实这个很好理解，因为概率总是在0..1之间的。因此

对向量的正则化就是把每个元素除以元素的总和。

Softmax函数

Softmax等于是更加突出了那个极值，放大了标准差。

Cross Entropy

和前面的Softmax结合：

log消掉，剩下

是不是很神奇！

所以SCE loss_grad 就等于 softmax_x -y

用ReLU取代Sigmoid来提升训练速度

Sigmoid的最大斜率是0.25，而且在<-2 或>2时，斜率接近0，这样导致更新变量的速度非常慢。

ReLU就是另一个极端，f(x) = x (x > 0) , 0 (x <= 0).

它也是符合激活函数的定义的，单调且非线性。

Leaky ReLU和Tanh

Tanh是长这样的，值域在(-1,1)

Leaky ReLU和ReLU不同的地方在 x < 0的地方，Leaky ReLU是y=kx，0 < k < 1.

此外，还有ReLU6等。

Momentum

每一步更新x的速度不再是固定的，而是根据以往的速度计算得出。

Learning Rate Decay

随着Momentum慢慢变小，可能已经到了局部最优或者全局最优，再继续Training也只有很小的改变了。当Final Learning Rate低于阈值时，我们就可以停止训练了。

Weight Initialization

权重初始化，为什么需要这个呢？因为之前的Tanh，Sigmoid，在input = 0的时候有非常高的斜率。因此输入的权重 = 0并不是什么很好的选择。我们可以初始化权重，来让这种情况得到缓解。

Glorot初始化：

给每一层的权重赋值都是根据输入的个数和输出个数决定的。

Dropout

神经网络为啥不能无脑堆层数呢？因为容易过拟合，陷入局部最优。因此，我们不仅不能无脑堆，还要剪枝，把用不到的神经元关掉。很简单，把它们的权重设为0就好了，这样Forward和Backward都不会经过它。

同样，其他神经元的权重也会相应调整为Magnitude * (1-p)。

Convolutional

卷积核和矩阵，这个就不记录了，也是一样的矩阵元素相乘相加。

“The interpretation of each neuron of a fully connected layer is that it detects whether or not a particular combination of the features learned by the prior layer is present in the current observation.

The interpretation of a neuron of a convolutional layer is that it detects whether or not a particular combination of visual patterns learned by the prior layer is present at the given location of the input image.”

1x1卷积

可以改变输出的深度，也叫做Bottleneck layer

Dilated Convolution

Guided back propagation

Flatten Layer

把m height width 的 matrix 变成 1 m height * width的vector

Pooling Layers

池化层，下采样，降分辨率

从ResNet开始，就很少会用到了，因为会损失信息

Padding

处理边缘值的时候，填充0。否则输出尺寸会比原来小。

AlexNet

VGG

ResNet

ResNet的作者发现，56层的神经网络比20层错误更多。理论上，更深的神经网络即使有很多不激活的神经元，也至少应该达到浅层网络同样的性能。

于是，残差网络诞生了，图中的四个都称为Residual Block（残差块），右边增加的这条路称为Identity Shourtcut Connection（Skip connection）。这样在正向传播的时候，不仅会传播Learned features，还会传播原始的输入信号，这样网络就可以决定跳过某些层。同时，也用了Padding来解决输出维度的问题。