Doubt in Self Attention: How are query, key and value matrices learnt?

I have a doubt in self attention mechanism in Attention is all you need paper.
In self attention for each word embedding we calculate 3 vectors(query, key, value). I searched on google what these 3 are , I found that these 3 vectors are result of matrix multiplication . eg query = word_embedding*Wq . But from where does this Wq come? Are they learned during backpropagation?

paper link: https://arxiv.org/abs/1706.03762

Yes. The weights matrices for Query(W_q), Key(W_k) and Value(W_v) are all learned through backpropagation.

1 Like

Are they learning for the loss of the transformer or they are seperately learned for some other task?

The whole transformer network will be trained using the Cross Entropy Loss between the target and prediction vectors. They are not using any auxiliary tasks for training Encoder and Decoder separately.

1 Like