Transformers for seq2seq transliteration in capstone project

I’m trying to use transformers instead of GRU with attention for the transliteration part of the capstone project but am facing some issues in which loss function to use. I’m using character level embeddings in both the encoder as well as decoder to create the source and target sequences each of dimension 64. I was trying BCElosswithlogits from pyTorch but it’s not working

I’m still stuck on this.Any help would be much appreciated.

Binary Cross Entropy works only for Binary classification. I think what you’re trying to solve is a multi class classification. Try with Cross Entropy Loss.

Also check if the Transformer implementation you’re using has a softmax layer at end( because, the architecture proposed in paper uses softmax at the end). The Pytorch API for Cross Entropy only works with Logits.

Check this link to see how to implement cross entropy with soft targets.