Clarification regarding Mini Batch in Optimization Algorithms

Hi, I just finished watching the video where the Mini Batch is put to test performance wise compared to its predecessors (GD, Momentum) etc in coding for a given set of points. However i found that when i implemented the GD for a set of points pre - provided, It performs just as equivalently as the Mini Batch version having batch size = 6. So where is the real difference between their performance?

Can you share how you measured performance?

In Regular (or full batch) Gradient Descent, weights are updated after calculating loss over all the training set.
In Minibatch, weights are updated more frequently as only part of the training set is used in each iteration.
This implies there is less computation per weight update. Since calculations are less, weights updates are faster, you may move quickly towards optimum solution.

Note since minibatch involves only part of the training data per update, it also means, you may take more steps to reach optimum soln since individual steps are not necessarily the most optimum. Basically, your speed of individual step increases but so does the total number of steps. So its a tradeoff but as I understand, generally minibatch wins.

1 Like

Well, I had a total of six different data points for both X and Y and for those set of points, I first applied the Mini batch with batch sizes in increasing order and when i reached batch size = 6, its performance was the best. Then when I moved to GD, its performance was equal to the performance of Mini Batch with batch size = 6. Which was why i was confused as to if the Max performance of Mini batch equals that of GD, then where is the point that Mini Batch stands out? , as in the lectures it is said that Mini Batch is just another version of GD known as "Stochastic GD ". By performance, I mean performance shown while plotting both the algos in a 2d(only Contour plot) or a 3d(both Contour plot and Surface plot) surface.

I don’t have a clear answer and I’m not sure how performance is being measured in that code.
I’m listing my understanding of general concept, hopefully it helps.

Mini batch does w, b updates faster, so better performance for mini-batch mean faster computation for updating individual w and b.

Gradient Descent(Regular/Full batch) and SGD are just special cases of Mini Batch GD. If the batch size in Mini Batch consist of all examples, it becomes Regular GD, and if batch size is just 1, it becomes SGD.

By performance if you mean number of epochs (which I think you are looking at), its not a good measure to compare GD vs Mini GD.
By performance if you mean amount of time taken (not epochs) to reach close to optimum w and b, mini gradient will be faster.

1 Like