Error in vectorized GD algorithms notebooks

In the notebook 0407_VectorizedGDAlgorithms, there is one logical error in the fit function for NAG algorithm:

The code in the notebook is:

for i in range(1,self.num_layers+1):
    self.update_params["v_w"+str(i)]=gamma*self.prev_update_params["v_w"+str(i)]
    self.update_params["v_b"+str(i)]=gamma*self.prev_update_params["v_b"+str(i)]
    temp_params["W"+str(i)]=self.params["W"+str(i)]-self.update_params["v_w"+str(i)]
    temp_params["B"+str(i)]=self.params["B"+str(i)]-self.update_params["v_b"+str(i)]
    self.grad(X,Y,temp_params)
    for i in range(1,self.num_layers+1):
        self.update_params["v_w"+str(i)] = gamma *self.update_params["v_w"+str(i)] + eta * (self.gradients["dW"+str(i)]/m)
        self.update_params["v_b"+str(i)] = gamma *self.update_params["v_b"+str(i)] + eta * (self.gradients["dB"+str(i)]/m)
        self.params["W"+str(i)] -= eta * (self.update_params["v_w"+str(i)])
        self.params["B"+str(i)] -= eta * (self.update_params["v_b"+str(i)]) 

Note that we have already multiplied the previous value of v_w and v_b with gamma before the lookahead, and so we do not need to do the same again during the actual weight update after computing gradients at W_temp, b_temp

Secondly, we have already multiplied the learning rate eta once when setting self.update_params, we do not need to do the same again when updating self.params

Hi @databaaz,
As per the theory lecture, Intution behind NAG , the update rules are correct. I would suggest you to refer it once again, if there’s some sort of confusion on this, please let me know.
SmartSelect_20200510-114052_Guvi

Alright got your point,
Actually I was referring to the code in the notebook for the single neuron (0407_GDAlgorithms) and there the update rule was not as per the theory in the lecture.
Check this

elif self.algo == 'NAG':
      v_w, v_b = 0, 0
      for i in range(epochs):
        dw, db = 0, 0
        v_w = gamma * v_w
        v_b = gamma * v_b
        for x, y in zip(X, Y):
          dw += self.grad_w(x, y, self.w - v_w, self.b - v_b)
          db += self.grad_b(x, y, self.w - v_w, self.b - v_b)
        v_w = v_w + eta * dw
        v_b = v_b + eta * db
        self.w = self.w - v_w
        self.b = self.b - v_b
        self.append_log()

Here the gamma was not being multiplied to v_w and v_b in the final update step

Also could you please comment on the multiplication of the learning rate eta twice - once with the gradients dW and dB and then again in the final update with v_w and v_b ?

Dear @Ishvinder

I’d suggest you look at the code once again, the second instance of multiplication with gamma is not even achieving the theory explained in the video.
Our new v_w is supposed to be gamma*prev_v_w + eta*dW
We are already doing gamma*prev_v_w as the first step, we have to just reuse this modified v_w and add d_W to it to get final updated v_w which would effectively be equal to gamma*prev_v_w + eta*dW.
What we are currently doing is equivalent to v_t = gamma*gamma*v_(t-1) + eta*dW

Ok, I’ll go through it, and let you know.

Hi,
The multiplication of gamma & eta in the 2nd loop doesn’t make sense as we are already multiplying gamma in the first loop while initializing the history(v_w) and then using that history in the second loop.So second time multiplication of gamma doesn’t make sense as in that case we are multiplying by gamma twice.Similarly 2nd time multiplication with eta too doesn’t’ make sense.

1 Like