I have a doubt in self attention mechanism in Attention is all you need paper.
In self attention for each word embedding we calculate 3 vectors(query, key, value). I searched on google what these 3 are , I found that these 3 vectors are result of matrix multiplication . eg query = word_embedding*Wq . But from where does this Wq come? Are they learned during backpropagation?
paper link: https://arxiv.org/abs/1706.03762