Overfitting issue- Pretrained model- Medical imaging


I am doing the image classification problem on medical imaging dataset, but the caveat is that the total no. of samples is very low(<1000). I am using transfer learning(pretrained Dense Net 121) model and have applied some image augmentation techniques on train data(random rotation, flipping etc.) After training for around 60 epochs, using lr of 0.001 , and also applied weight decay of 1e-5 , the model is getting overfitted. (I know, since the number of observations are very less, hence it ought to be the case), but even after doing augmentation and weight decay, the max validation accuracy I am getting is ~77% . Even after increasing the batch size from 16 to 32 , there are no impressible results.

How to solve this ?

1 Like

Some suggested things in addition to these would be:

  1. Using Dropout
  2. L1/L2 Regularization

If it doesn’t improves, then Densenet 121 seems to be pretty complex of an architecture for your data, maybe trying out a simpler one would be the best thing to do.

Thanks for your reply.

I already added weight_decay in optimizer. Doesn’t that work for L2 regularization? How to add L1 regularization?
should I use Resnet 18 or squeezenet?


Yes, then probably a lighter/less complex architecture should be able to help.
The very first thing would be to try out with a custom architecture, a basic one.
Don’t use a pretrained one, and check the performance.

Hi @Ishvinder,

So I tried out with a basic custom architecture like this one,
class NN(nn.Module):

def __init__(self): 

    super(NN, self).__init__()

    self.features = nn.Sequential(

        nn.Conv2d(3, 32, 3),         # (N, 3, 224, 224) -> (N,  32, 222, 222)



        nn.Conv2d(32, 64, 3),        #(N, 32, 222, 222) -> (N, 64, 220, 220)



        nn.MaxPool2d(2, stride=2),  # (N, 64, 220, 220) -> (N,  64, 110, 110)

        nn.Conv2d(64, 64, 3),       #(N, 64, 110, 110) -> (N, 64,108,108)



        nn.MaxPool2d(2, stride=2),       #(N, 64, 108, 108) -> (N, 64, 54, 54)

        nn.Conv2d(64, 128, 3),       #(N, 64, 54, 54) -> (N, 128,52,52)



        nn.MaxPool2d(2, stride=2),   #(N, 128, 52, 52) -> (N,  128, 26, 26)

        nn.Conv2d(128, 256, 3),       #(N, 128, 26, 26) -> (N, 256,24,24)



        nn.MaxPool2d(2, stride=2)    #(N, 256,24,24) -> (N, 256,12,12)




    self.classifier = nn.Sequential(

        nn.Linear(36864, 256),         # (N, 36864) -> (N, 256)



        nn.Linear(256,256),            # (N, 256) -> (N, 256)



        nn.Linear(256,2)               # (N, 256) -> (N, 2)



def forward(self, x):

    x = self.features(x)

    x = x.view(x.size(0), -1)

    x = self.classifier(x)

    return x

But there seems to be no improvement in either accuracy or loss

Now you can gradually go up, and try to train some complex architectures like ResNet, Inception, ResNext…
But a better way would be to avoid using pretrained models, as they might be the cause of overfitting.

Thanks .

So, what I understood is I should load model like Resnet18 (but without pretrained) and then train from scratch!
Like this?
model_1 = models.resnet18(pretrained = False).

1 Like

Okay, so after much permutations and combinations regarding model choice, when I trained ResNext with pretrained = True and increase the learning rate to 0.01 from 0.001, I am getting a much higher accuracy of 90%.
But, the validation loss is much below the training loss and the validation accuracy goes higher than the training accuracy over epochs.
The loss and accuracy plot


Have you used a larger dropout?
Because during the training, if a reasonable number of filters are turned off, the accuracy would be lesser as compared to the eval phase.

No, I didn’t use the dropout layer in the fc part of Resnext. I just loaded the pretrained model as it is and only change the final layer in the number of classes from 1000 to 2.

Can I go ahead with these results?

Yes, in the end what matters is validation loss. If it’s not much much smaller than the training loss.
It seems an interesting case to study more, can you share the hyperparameters that you used?

May I know the reasoning behind overfitting due to the use of pretrained models ?

It is the choice of architecture that determines whether it will overfit or not right ?
The harm that a pretrained model can cause is it may cause to underfit since its weights may not be best optimized for our dataset, in which case we can unfreeze some of the layers and tune the weights.

PS: I am referring to this reply of yours.

Hi @databaaz,
No, there can be instances where a pretrained model might lead to overfitting.
Please refer this discussion for more.
The point of view for freezing less number number of layers can be the underlying reason, due to the complex structure that network has.

The guy (in that discussion forum) was not freezing the earlier layers, which would destroy the original weights of the network, and would defeat the overall purpose of ‘transfer learning’. It is a standard practice to first freeze the whole network and optimize the last linear layer, following which we can gradually unfreeze the penultimate, pre-penultimate and some more layers before them if required.

1 Like

Hi ,

Still after selecting the best model based on the vaidation accuracy(~90%) , when I load the model to evaluate it on test data, the model seems to be performing with only 73% accuracy with high false negative, low recall for the positive class(0.60) and low precision(0.70) for the negative class .
This is quite vague to understand , the model is quite well performing on the validation dataset and underperforming on the test dataset(both have the same dataloaders attributes, same batch size and no shuffling).
I have run this particular “resnext50_32x4d” several times, and each time the validation loss after say 20 epochs starts decreasing more than the training loss and similary validation accuracy starts increasing more than the training accuracy(for instance see the image below)

There is no imabalance in the classes, that’s why I didn’t modify either the loss function or do weighted random sampling!

I only finetune the pretrained “resnext50_32x4d” without freezing the weights and didn’t use any dropout layer.


criterion = nn.CrossEntropyLoss()

Observe that all parameters are being optimized

optimizer_1 = optim.SGD(model_1.parameters(), lr=0.01,momentum = 0.9,weight_decay=1e-5)

Decay LR by a factor of 0.1 every 7 epochs

exp_lr_scheduler = lr_scheduler.StepLR(optimizer_1, step_size=7, gamma=0.1).

Hi @sn07,
It sometimes also depends on dataset’s complexity.
Can you share some details about the data you’re using?