## Batch Gradient Descent –

In Batch Gradient Descent we use all of the training data to compute the gradient at each step which makes it slow when the training set is large.

## Stochastic Gradient Descent –

The opposite of Batch Gradient Descent is Stochastic Gradient Descent. SGD picks a random instance in the training set at every step and compute the gradient based only on that single instance.

The benefit of this is that it is faster to train a very large data set in a short period of time. The downside of this algorithm is that due to stochastic (i.e. random) nature of this algorithm it is less regular than the Batch Gradient Descent. Instead of gently decreasing until it reaches minimum, the cost function will bounce up and down decreasing only on average, Overtime it will end up very close to the minimum but once it gets there it will continue to bounce around never settles down. So once the algorithm stops, the final parameter values are good but not optimal.

But when the cost function is very irregular this can actually helps the algorithm to jump out of local minimum. So stochastic gradient descent has a better chance of finding the global minimum than the batch gradient descent.

## Mini-Batch Gradient Descent –

At each step instead of computing the gradient based on full training set (as in Batch Gradient Descent) or based on just once instance( as in Stochastic Gradient Descent), Mini-Batch Gradient computes the gradient on small random set of instances called Mini-Batches. The main advantage of Mini-batch over stochastic gradient descent is that you can get a performance boost from hardware optimization of matrix operation especially when using GPUs.