Machine Learning doubt

I haven’t started ML yet but i was going through the overview of it. And there’s this thing of retraining the model again if the model is not performing well.

So do we re-train it on the same dataset we previously used to train it? Or do we extract new data. Or we divide our initial data into modules and use them to train it and another module of the full data set to re train it and so?

There are many possibilities. let’s take a couple of examples using Linear Regression:

  • Assume the data follows a non-linear pattern (red dots in figure below), but you are trying to fit a linear model (blue line). A linear model will clearly perform badly (its just not a good representation of the data).

So you will need to first fix your model (for e.g. consider a polynomial one) and then retrain the model on the same training dataset.

  • Another case: let’s assume only data represented by ‘RED’ dots was available during training (though green dots are part of population as well). Here, a linear model is good fit based on the training data (red dots). But, the training data is not a good representative of the population.
    In this case, it would be better to resample/recollect the data to improve the quality of data for training. Garbage (data) in, garbage(model) out.


  • Its also possible, the data is good, model assumption is good, but you haven’t performed sufficient training (for e.g. iterations for loss minimization).

You can think of other cases. So, retraining may require new data, or just fixing the model and retraining on the same data etc etc.
I would think, while learning ML we can trust the available data more often than not. But in real life scenarios, we might have to give equal thought about the quality of data.