Summary of my understanding from this set of lectures:

- Gradient Descent takes a lot of epochs to converge to minima especially for smooth contours, thus increasing computation costs. Thus, Momentum based GD (MGD) came into picture
- MGD uses historical information (exponentially increasing ) as a factor to boost the learning rate and travel faster to minima. However, as it uses longer strides in ends up making a lot of large u-turns to reach minima
- Nesterov Accelerated Gradient (NAG) uses historical information but with cautious, i.e. it makes a correction by calculating a temporary GD based on just the historical factor and not latest GD
- The GD does not work practically for sparse features with low learning rates. It will take them forever to converge. Thus the need to have an Adaptive GD (or Adagrad) where the learning rate is enhanced artificially for sparse features by using history factor (which is usually lower for sparse features due to their sparseness) in the denominator of learning rate
- Problem with Adagrad is it decays the learning rate so fast that it doesn’t converge to minima as it becomes practically zero (historical factor in denominator very huge) till sparse features weights are converged to their ideal values. Gives rise to RMSProp Gradient
- RMSProp decelerates the learning rate by adding an exponentially decaying term (beta - any value between 0 and 1) to the history factor. This still ensures that the effective learning rate for sparse and dense features are different
- Adam Gradient Descent (or just Adam) combines the best of the world from Momentum GD and RMSProp. It uses the history factor for gaining momentum on flat surfaces and also the same history factor to render efficient learning rate till converging to minima

In the lecture, we are told that a combination of Momentum GD and RMRProp is mostly used in practical world. Isn’t the same combination equal to Adam Gradient Descent, that uses best of both the worlds?