Although I did follow the above links I am not quite getting the explanation for why is the gradient in the direction of steepest ascent? A simple intuition for this would be really helpful. Thanks in advance.
So it’s basically like asking:
Why is it that only when we subtract the gradient from the parameter, we perform minimization (or gradient “descent”), or rather move in the direction opposite to gradient ascent?
Or in simple words,
Why does the update rule “subtract” the gradient to minimize the cost function?
Please check if this video helps in understanding the math involved behind that: