In Deep Learning/Machine Learning, we are trying to find the function which will correctly (approximately) maps the given data. In this process, we are using the Gradient Descent algorithm to minimize the Loss function by updating the values of W and B.

From Calculus, we know that the first-order derivative of any function equating to 0 gives the values of the parameters of that function such that the function is minimum at that values.

Why we are using the Gradient Descent algorithm? Any problems using the first-order derivation?

After finding dw and equating it to 0, we can get the w value such that the function is optimal,.But instead, we are trying to update w = w - alpha*dw to find the value of w why?

Good question. So basically, as for using partial derivatives (not total derivative), the reasons are the following:

Though theoretically one can compute the total derivatives and solve for weights (given that train data is fixed), practically it’s intractable.

For a basic neural network, say a network to calculate XOR (2 layers, 2 neurons per layer, data is small and fixed), one can easily compute total derivatives and solve for the parameters with the data as base-cases.

For even a slightly more complicated NNs, we see that solving for weights given large amount of data and parameters easily becomes very messy when you use total derivatives.

There can also be many solutions for a given data or no solution (if the data is large or noisy respectively)

We see that it is computationally intractable as number of parameters increase and many not also get a perfect solution as data increases/becomes noisy. This is where gradient descent gradient comes to the rescue.

We use partial derivatives in GD because computing total derivative is extremely tangled even for small NNs, and approximating it with partial derivatives should be enough to solve for a convex function even to obtain a global optimal solution.

Gradient descent is a faster approach for complicated networks and it also prevents over fitting as it is a non deterministic solution. Having said that there has been a study of deterministic methods to solve weight and biases. Please refer to the paper below.

Edit There is a correction gradient descent is deterministic algorithm but what i meant is that you can stop the algorithm before it reaches the global minima to get multiple solutions based on your error on the validation set. The same cant be done when you are solving a set of first order equations to determine the weight and biases. Also you would need an equation for each of the parameters making the system of equations a huge task to solve. And you may reach the global minima but the solution may not be useful and in most cases an overfit solution.

I have doubt on what non deterministic solution means in Gradient Descent. Isn’t Gradient Descent deterministic? As in the case of gradient descent algorithm, don’t we arrive at the same optimum solution every-time as it iterates over entire data-set to find the minimum loss?

i am sorry gradient descent is deterministic but but you may stop it before it converges to the minimum loss depending upon the error on your validation set. In a way you have multiple options for the weights and biases to work with which is not true when you solve a set of first derivative equations to arrive at the solutions.