Question on Math behind LSTM Vanishing gradient

While explaining the math behind LSTM vanishing gradient…in the below equation, prof said:

ds(t)/ds(t-1) = d/d(s(t-1)[ f(t)*s(t-1) + i(t)*s(tcand)], let us consider i(t)*s(tcand) = 0 and make our lives tough,
and we will prove
derivate of f(t)*s(t-1) > 0, and if so we can prove gradients do not vanish, following principle if a> 0 , and assuming b = 0,
then a+b > 0.
My question is about the term b, which is i(t)*s(ctand), can’t it be negative? if so can’t it negate a, hence
even if a > 0 it would not mean a + b > 0?
If we further take derivate of i(t)*s(tcand) wrt s(t-1),
= sigmoid’(i(t))*o(t-1)*W(i)*sigmoid’(s(cand(t)))Wo(t-1)
which would mean W(i), W or o(t-1) would control whether gradient would vanish or not along with
the first term a which was already explained in the video, is this understanding correct?

Hi @Ishvinder: Can you please help?

\frac {\partial s(t)} {\partial s(t-1)} = {\frac {\partial f(t)*s(t-1)} {\partial s(t-1)}} + {\frac {\partial i(t)*s\sim(t)}{\partial s(t-1)}}

discussion is around the second term

{\frac {\partial i(t)*s\sim(t)}{\partial s(t-1)}}

Does it equate to


Hi @karrtikiyer,
Really sorry i missed this thread somehow. Will check your doubt after some revision, and get back to you.

I’m not completely sure if this is the reason behind it, but as per my observations, He’s talking about the 2nd term considering the fact that S~(t) is also a function of S(t-1), and in the paper it was considered as constants. (That what i read in an article)
I haven’t found a need to read that paper though, but you can give it a go.
The explaination and pointer to that paper : Why LSTMs Stop Your Gradients From Vanishing: A View from the Backwards Pass

1 Like

Thanks @Ishvinder. Appreciate it.