I have a couple of doubts in Transformers Decoder:
- It is said that the output of the decoder is fed back as input to the decoder for the next time step , so since the output of the decoder is a one hot encoder vector of size of the output vocab, then the same one hot encoded vector is fed as the input for the next time step?
2.During Training they implemented teacher forcing on decoder. So for example if my true output sentence is “I am Fine” then is the entire sentence fed as the input to the decoder?
- My last doubt is in masked self attention where they mention that the model should not have access to the output words which are not yet produced. And the masking is implemented using a lower triangular matrix before the softmax (for this to happen I assume that the entire o/p sent “I am Fine” is given as input for the decoder). So in this case is the output of the decoder a single output word? or the probability dist for all the words in the output sentence (eg ‘I’ or “I am Fine”). Because if the it would have generated one output word at single time step then it doesen’t make sense because u are inputting “I am Fine” at every time step and getting different output words . How this is possible?