This article will introduce Momentum algorithm, RMSProp algorithm and Adam algorithm in neural network optimization algorithm. The information comes from the fastai course and Ng's deep learning course.
1.Momentum algorithm
The Momentum algorithm, also known as the momentum method, directly uses the idea of EWMA, that is, the exponentially weighted moving average method. When performing this optimization, the gradient size and direction of the previous optimizations have a greater impact on the vector size and direction of this optimization, because the previous optimization size and direction can well reflect the trend of recent changes, so as to achieve It accelerates the effect of model fitting and optimizes the fitting route.
Expressed as a formula:
If the optimization vector is called the "number of steps", then the number of steps of this optimization = the number of steps of the last optimization * + the gradient of this optimization * (1- ) . The weight to be optimized
However, there are two problems here:
Question one:
From the above formula, it can be concluded that when the optimization is performed for the first time, the number of optimization steps in the previous iteration process is very small, and the fitting speed is very slow.
Therefore, the formula can be rewritten as:
In this case, when the number of iterations t is small, the original number of iteration steps can be enlarged; when the number of iterations is large, the denominator approaches 1, which has no effect on the original number of iteration steps.
Question two:
If the moving average keeps increasing, it will cause the problem of gradient explosion, as shown in the figure below. And the problem of the fluctuation of the fitted route has not been solved.
To solve this problem, the RMSProp algorithm was born.
2.RMSProp algorithm
Adaptive learning rate adjustment. The RMSProp algorithm is improved on the basis of the momentum algorithm, using the hyperparameter and the learning rate of the mom algorithm to simultaneously restrict the number of steps in this move.
The formula is:
In this way, based on the influence of the gradient and the learning rate, the number of steps in this optimization is restricted by the previous accumulated moving average, which is equivalent to combining the gradient descent method with the momentum method to make the optimization route less fluctuating. The convergence rate is faster.
As shown in the figure, the blue is the mom algorithm optimized route, and the green is the RMSProp algorithm optimized route.
3. Adam method
Adam algorithm essentially superimposes mom method and RMS method again in order to get better results. In the RMS method, the influence of the number of steps previously optimized on the number of steps optimized this time is further enlarged.
formula:
For the Adam algorithm, it is actually a better fusion of the momentum method and the gradient descent method, but the problem mentioned in the first problem of the momentum method still exists, so the same method can be used to optimize the formula.