There are many possibilities. let’s take a couple of examples using Linear Regression:
- Assume the data follows a non-linear pattern (red dots in figure below), but you are trying to fit a linear model (blue line). A linear model will clearly perform badly (its just not a good representation of the data).
So you will need to first fix your model (for e.g. consider a polynomial one) and then retrain the model on the same training dataset.
- Another case: let’s assume only data represented by ‘RED’ dots was available during training (though green dots are part of population as well). Here, a linear model is good fit based on the training data (red dots). But, the training data is not a good representative of the population.
In this case, it would be better to resample/recollect the data to improve the quality of data for training. Garbage (data) in, garbage(model) out.
- Its also possible, the data is good, model assumption is good, but you haven’t performed sufficient training (for e.g. iterations for loss minimization).
You can think of other cases. So, retraining may require new data, or just fixing the model and retraining on the same data etc etc.
I would think, while learning ML we can trust the available data more often than not. But in real life scenarios, we might have to give equal thought about the quality of data.