Using the simple FFNetworkW1 class to train the network for just weight w1. I see, against what I see in the video lecture and using the same piece of code, the weight is increasing as my MSE is decreasing. Enclosing the screenshot of the plots. Any explanation why this might be happening. Although, I can see that my w1 ends up at a much lower value than shown in the video lecture. Is choosing learning_rate as 1 (was just running the code with their default set of configuration for hyper-parameters) causing this?
I used now a lower learning rate (0.5). But the w1 seems to have started from a higher side and reducing in this new plot (enclosing again). It still seems to have asymptotically ended at a higher side than when I used learning rate of 1. My question now is, like loss plot, why isn’t the weight keeps reducing instead of increasing for some learning_rate value?
The objective itself is on reducing the loss rather than the parameters (weights) associated to the features across some number of epochs. That is, a negative weight conveys some negative importance to some features and a positive weight conveys some positive importance to features whereas a zero weight conveys no importance to feature as well.
Therefore, we can say that better a initialization of the parameters is essential for obtaining minimal loss. An analogy for choosing the best initialization of weights would be observing from highest peak of the hill (there are no more highest peak in that hill) such that we can see the valley which is the lowest point (the lowest loss) where we can have any value for weights.
Adding to what @4deep.prk has already mentioned.
Comparing the two plots shared, the loss seems to reach its minimum value when w1 is close ~1.5 (approx).
So as w1 will approach ~1.5 in your case, loss will reach its minimum value.
If your initial value for w1 is higher than ~1.5, it will decrease as loss reduces. And if w1 is smaller than ~1.5, it will increase with iterations so that loss get lesser.
Basically loss will be minimum for certain value of w1 and you want to get there. Depending on your initial value of w1, you might have to increase it or decrease it to reach that optimum value. Learning Rate will have little say in it , they will only decide how sooner or later you can reach those numbers. Step size may cause deviations in path taken as well