RuntimeError: CUDA out of memory - Need help with GPU memory allocations issue in Pytorch

RuntimeError: CUDA out of memory. Tried to allocate 374.00 MiB (GPU 0; 15.90 GiB total capacity; 15.09 GiB already allocated; 49.88 MiB free; 15.20 GiB reserved in total by PyTorch)

I am facing the above error while training my encoder decoder model does anyone have a solution
My process look like this

  • Data-loader converts a batch of the dataset and returns tensors
  • Tensors loaded to GPU using tensor_variable_name.to(device) fed to the forward pass of the net which returns outputs
  • The forward pass outputs are collected inside a train_batch function in which net.forward(input_tensors....) is called followed by the loss computation
  • opt.zero_grad() is called
  • loss.backward() is called
  • and finally opt.step()

After the entire training process is completed the tensor_variable_name doesn’t exist due to functionalized implementation so i cant use
del tensor_variable_name to clear GPU memory
and torch.cuda.empty_cache() is not clearing the allocated memory.

I am assuming but not sure -> ( According to me the last network graph that is created when the last batch is trained is still stored in the cuda device. Is there anyway to clear the created graph)
I cant find any variable still stored in memory using dir()

Please help the only solution i found is restarting the session again

If you’re facing MemoryError in the first attempt, then you need to reduce the memory. (Refer this).

Another possible situation for this error is when we committed once, and then stopped it in order to make a modification to the code. In this case, the memory gets allocated to each of the dynamic graphs, which you can release by restarting the runtime.

It happens when the entire training is done Yup that’s what i am doing now restarting the runtime. Let me know if there is any other way or it may be a limitation of Pytorch.