In Squared Error loss function, say each term is close upto 0.1 or more

100 such records would probably give a loss of 10. How is it ever made 0???

the actual formula to calculate Square Error loss is (sigma( yi - y)^ 2/ N ).

Why do we ignore that ‘division with N’ all the time?

Won’t this affect the final loss value or the grad_w and grad_b( which we calculate based on Loss )?

Omitting N does not really make much of a difference for following reasons:

- N is a constant and hence it does not affect the general Loss function expression.
- Values of grad_w and grad_b are all relative and change by small amount so including N or excluding N should not bring a considerable difference in the final value. If you want then keeping N in the equation is no problem since it is the complete form of the expression.

This is as far as my understanding goes.

@Ishvinder sir, kindly correct me if I am wrong.

Thanks.

