Doubt Regarding AdaGrad Function!

Can someone explain why specifically we chose the adagrad update formulae to be:


Our aim was to degrade the learning rate as in proportion to the history accumulated, right?

So, why couldn’t we just use any specific decay function like e^(-v(t))*n as our learning rather , this specific learning rate defined??

Yes, there could have been any other decay function used, but there are various hypothesis and conclusions, based on which they observed the one they’ve used to be good.

I would recommend you to read the AdaGrad Paper to get into their minds and get to the conclusion they got.

Thanks a lot sir!

I’ll read this for sure!