Standardization (When?)

Hi,

In this blog they’ve mentioned that we should use standardization only when data is normally distributed
https://towardsdatascience.com/why-not-mse-as-a-loss-function-for-logistic-regression-589816b5e03c. I think it is wrong, as standardization actually changes the data distribution to normal. And as standardization changes the data distribution (to normal) unlike normalization so we should use standardization when the data distribution isn’t important for training. Which one seems correct?

Pls help! Thanks in advance.

1 Like

Hi ,

As far as i understood , Standardization is usually performed for scaling data and as per central limit theory no matter what your data distribution is if consider sample size which is greater than 30 and repeatedly perform the experiment it will become close to normal distribution. With this we should only be worrying only when to do standardization irrespective of distribution of data .

Standardization should be done when you are dealing with math such as linear regression and logistic regression , ANN but for decision trees and Random forest we do not require standardization as they are working as many if and else conditions.

1 Like

Standardization does not change the data distribution to normal. It is just a method to reshape the range of values, so that it could be easier to find the deviation of the data points from the mean i.e. how many standard deviations is the data away from the mean.

Standardization is applied on a normally distributed data to give a standard normally distributed data. The original data followed the normal distribution with mean(mu) and standard deviation(sigma),irrespective of fact that standardization is used later or not.

Hence, there should not be any perceived notion like data distribution is not important then standardization should be applied etc.