To understand the data better and to improve the accuracy of the model in the Kaggle MP neuron model (mobile-like/dislike dataset), I am going through each feature of the training dataset individually to get to know whether a particular attribute has gaussian distribution or not. I am just trying to see the outliers (using box plot) and also trying to perform gaussian distribution check (using histogram)? Is this the right way or is there a better way to deal with the problem?

My question is do I need to go through each attribute in the data set and apply standardization if the attribute follows Gaussian distribution or can I assume that I do not care about the distribution and blindly rely on normalization however we use a linear function (MP Neuron model)?

2 Likes

Feature scaling plays an essential role for all the datasets. Your approach is better than blindly normalizing the features. Here’s an opinion from my side, try out comparing the performance of models using normalization and standardisation for some attributes.

The point i wanted to share is, using histograms might not be a fool proof method for checking whether or not a distribution is normal. Many of the times a near-normal distribution might also look like a normal.

You can take a try to read about Normal Q-Q Plot, or some other complex tests.

I found these while exploring these wonderful articles on PsychWiki and Towards Data Science

2 Likes

Thanks for the approach.