Here I trained three same models with the same set of hyperparameters, the only difference being vectorization

The non-vectorized achieves max accuracy amongst the three

Also during training, the two models flatten at a slightly higher loss

I don’t have a doubt in particular but what could be the reason behind this variation.

If Vectorisation is the **only difference**, the loss plots should be same and the difference should come only in the time taken to run the training.

Apart from hyperparameters, you can check how w’s and b’s initialisation is done. If they are initialised randomly, does the random numbers assigned are same (controlled through np.random.seed).

Yes but I used the same weights and bias matrices for all the networks (for initialization). Yet I find the same issue

Can you share link to notebook?

I observed that the **weight update between input and hidden layer were different even after 1 epoch** (self.W1 in vectorised and corresponding weights in non-vectorised).

Having said that, I couldn’t spot the difference in code in a quick comparison between the the vectorised and non-vectorised implementation (both looked doing the same steps). Since the vectorised implementation gives poor result (~70%), I think it might have some issue.

I will look at the forward_pass and grad function more closely later when time permits.

@Siddhant_Jain

I ran the code with epoch=1, and compared the values of `dW1`

with `dw1, dw2, dw3 and dw4`

and how they are updating the the corresponding weight. This led me to observation below:

dW1, a 2x2 matrix being used in following calculation might be transpose of what it actually should be

` self.W1 = self.W1 - learning_rate * (dW1/m)`

I didn’t check any further. It will be easier for you to review the code (I found it time consuming :P)