Doubt in Transformer Decoder

I have a couple of doubts in Transformers Decoder:

  1. It is said that the output of the decoder is fed back as input to the decoder for the next time step , so since the output of the decoder is a one hot encoder vector of size of the output vocab, then the same one hot encoded vector is fed as the input for the next time step?

2.During Training they implemented teacher forcing on decoder. So for example if my true output sentence is “I am Fine” then is the entire sentence fed as the input to the decoder?

  1. My last doubt is in masked self attention where they mention that the model should not have access to the output words which are not yet produced. And the masking is implemented using a lower triangular matrix before the softmax (for this to happen I assume that the entire o/p sent “I am Fine” is given as input for the decoder). So in this case is the output of the decoder a single output word? or the probability dist for all the words in the output sentence (eg ‘I’ or “I am Fine”). Because if the it would have generated one output word at single time step then it doesen’t make sense because u are inputting “I am Fine” at every time step and getting different output words . How this is possible?
  1. The softmax at the top of the decoder outputs probabilities of the words in the target vocabulary. So, the actual word can be decoded at this stage. Let’s assume the timestep of the current decoding step as t. The decoded words until the timestep will be embedded and fed as the input to the t+1 timestep of decoding. If teacher forcing is used, instead of decoded words the original corresponding words from target sentences will be used.

  2. The decoder outputs a single word at a time. So, for the sentence "I am fine"
    At t=0, input will be "<sos>", output will be "I"
    At t=1, input will be "I", output will be "am"
    At t=2, input will be "I am", output will be "Fine"
    Like this, the decoder takes in the all the words until the current timestep and outputs the next word. This will continue until the end of the sentence.

  3. The masked self attention is for making the words in the decoder attend to only the words before it. So, for the timestep t=2 as in above, the word "I" will only attend to the word "I" and not "am".

Please post a question if anything is unclear :slightly_smiling_face:

1 Like

For eg as you mentioned at t=1 input will be I only so how will I attend to am which is not yet given as input?

In 3rd point I have mentioned for t=2. For t=1, the input will be "I" and it will only be attending to "I".
For t=2, the input will be "I am". Here, "I" will be attending to "I" alone and "am" will be attending to "I" and "am" both.

1 Like

Oh got it!!! And suppose there are only 5 words in my output vocabulary. So lets say I am using teacher force and lets say the actual output was the 2nd word from vocab so the one hot rep will be [0 1 0 0 0] so while giving this word as input to my decoder will I need to convert this one hot rep to some embedding or directly pass as [0 1 0 0 0] ?

The output from the decoder will be probabilities not one hot vectors. From probabilities vector, we use argmax to find the word. This word will be the input for the decoder for next timestep if teacher forcing is not used.
For example at timestep t0, suppose the output is [0.6,0.2,0.2], the actual word can be obtained by taking max index word. Embedding for this word will be obtained and then the positional embedding will be added and this will be used as input for the next timestep.

So by embedding you mean something like word2vec to represent that input word right?

Transformer architecture doesn’t use any pretrained embeddings. Instead, the embeddings will be learned during training.

1 Like

And talking about batching… encoder can be used for baching many sentences at a time but decoder cannot be batched right ? coz it outputs one output at a time.

Thank you very much for clearing my doubts . You literally came like a god . I was searching on internet for like 2 days to solve my doubts but every where they only talk about encoder and do not talk about decoder in much detail. Again thanks a lot :slight_smile:

Decoder can also be batched. Suppose you want to translate N sentences from English to Hindi,

  1. Create a list of length N to have the output predictions by decoder for each input sentence, which will be filled with <sos> tokens initially.
  2. You pass this batched input to the decoder and it will return the next words for each of the sentences.
  3. Add the predictions for each sentence at the corresponding position to the list and now the list will have 2 tokens for each output sentence and this will be fed to the decoder for next timestep.
  4. This will continue until the maximum length is reached.