LSTM exploding gradient question

While explaining this topic in the video, professor took a path t1 which had the term dh(t)/do(t)
And this was further expanded to Diag(sigmoid(s(t))*o’(t)).

Given h(t) = o(t) * sigmoid(s(t))
Wouldn’t dh(t)/do(t) = Diag(sigmoid(s(t))) ?

Why did professor add additional o’(t) in his slides?

As o(t) is a function of h(t-1), we can’t just directly say dh(t)/do(t) will be Diag(sigmoid(s(t)))

Consider the following example for simplicity:
f(x(t)) = 2z(t-1)
y(t) = 3
x(t)

If we want to find dy(t)/dx(t), we can’t just write it as 3, as x(t) is a function of z(t-1), which will further be solved as a chained equation.
Our goal was just for the sake of simplicity, write it as dy(t)/dx(t) = 3*x’(t)

Thanks @Ishvinder, I understand the simple example and chaining part of it. However in this case what I am finding difficult is that I only thought, h(t) —> o(t), however I am unable to think why o(t) —> h(t-1)?
Is it because s(t) * o(t) = h(t) —> o(t) = h(t)/s(t), and s(t) —> h(t-1)? Where s(t) = Wh(t-1) + U(x(t) + b) ?

Hope you are able to follow my question/response…

No, i think you missed the formulation for o(t) from the first point.
SmartSelect_20200524-125104_Guvi

1 Like

Thanks @Ishvinder, I definitely missed this.