Why should we apply standardization of test data?

Please refer Module 3, the practical classes. The mentor says that the train and test data need to be standardized separately because test data isn’t available to apply manipulations that we can apply on train data. Agreed.
My question is if the above holds true, why we got to apply same standardization on test data? If we anyways were applying same standardization, won’t it make sense to do that before splitting the data into train/test?

The training and test data in ideal cases need to be identical in the respect of division to avoid unnecessary bias. Standardizing on train data and applying it on test data is justified, according to me, since the model sees only the train data and understands its statistics. Standardizing the entire dataset before splitting would change the metrics. Hope this helps!!

Thanks. Here is what I understand.
Standardization would take distribution metrics (mean, sd) of a particular feature column to standardize values. If you are doing post-split standardization, it would mean no test data’s contribution into those metrics. This would keep the train data unbiased of test data, which I believe is the real world situation while developing a model. Including the test data for standardization may result in over-fitting sort of situation. I hope this is what you meant.

1 Like