Positional Encodings and Positional Embeddings for Self-Attention Explained
Vanilla Transformers are permutation-invariant models. By default, the output of the model will not change if you permute all words in the input sentence. But this is really bad for language modeling and for Image recognition, as sentences and images have a specific structure and the order of words and pixels do change the semantic meaning.
Consequently, for successful learning, there is a need to incorporate the order of the words/pixels in the input sequence into our self-attention model. This can be done by explicitly attaching the information about the order to every element in a sequence before feeding it to the model. The most widely used approaches are Precomputed Sinusoidal Positional Encodings and Learnable Positional Embeddings.
🟡 In the case of Sinusoidal Positional Encodings, position i
is encoded by a series of K
sine-cosine pairs (sin(w_k t), cos(w_k t))
with decreasing frequencies w_k, k=1, K
.
🟢 In the case of Positional Embeddings, for every possible position i
we randomly initialize a learnable d-dimensional embedding p_i
and concatenate it to every element in the input sequence.
To learn more details about Positional Encodings and Embeddings and how to implement them, refer to the following blogposts:
📜 Positional Encodings
📃 Positional Embeddings
>>Click here to continue<<
