Linear Relationships in the Transformer’s Positional Encoding
In June 2017, Vaswani et al. published the paper “Attention Is All You Need” describing the “Transformer” architecture, which is a purely attention based sequence to sequence model. It can be applied to many tasks, such as language translation and text summarization. Since its publication, the paper has been cited more than one thousand times and several excellent blog posts were written on the topic; I recommend this one.
Vaswani et al. use positional encoding, to inject information about a token’s position within a sentence into the model. The exact definition is written down in section 3.5 of the paper (it is only a tiny aspect of the Transformer, as the red circle in the cover picture of this post indicates). After the definition, the authors state:
“We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$.”
But why is that? In this post I prove this linear relationship between relative positions in the Transformer’s positional encoding.
Problem Statement
Let $\boldsymbol{E}\in\mathbb{R}^{n\times d_\text{model}}$ be a matrix that contains $d_\text{model}$-dimensional column vectors $\boldsymbol{E}_{t,:}$ which encode the position $t$ in an input sequence of length $n$. The function $e:\left\{1,\dots,n\right\}\rightarrow\mathbb{R}^{d_\text{model}}$ induces this matrix and is defined as
$$
e(t) = \boldsymbol{E}_{t,:} := \begin{bmatrix}
\sin\left(\frac{t}{f_1}\right)\\
\cos\left(\frac{t}{f_1}\right)\\
\sin\left(\frac{t}{f_2}\right)\\
\cos\left(\frac{t}{f_2}\right)\\
\vdots\\
\sin\left(\frac{t}{f_{\frac{d_\text{model}}{2}}}\right)\\
\cos\left(\frac{t}{f_{\frac{d_\text{model}}{2}}}\right)
\end{bmatrix}\,,\tag{1}
$$where the frequencies are
$$
f_m = \frac{1}{\lambda_m} := 10000^{\frac{2m}{d_\text{model}}}\,.\tag{2}
$$
The paper states that a linear transformation $\boldsymbol{T}^{(k)}\in\mathbb{R}^{d_\text{model}\times d_\text{model}}$ exists, for which
$$
\boldsymbol{T}^{(k)}\boldsymbol{E}_{t,:}=\boldsymbol{E}_{t+k,:}\tag{3}
$$holds for any positional offset $k\in\left\{1,\dots,n\right\}$ at any valid position $t\in\left\{1,\dots,n-k\right\}$ in the sequence.
Derivation
That is true, because a definition for $\boldsymbol{T}^{(k)}$ can be found with no dependence on $t$:
$$
\boldsymbol{T}^{(k)} = \begin{bmatrix}
\boldsymbol{\Phi}^{(k)}_1 & \boldsymbol{0} & \cdots & \boldsymbol{0} \\
\boldsymbol{0} & \boldsymbol{\Phi}^{(k)}_2 & \cdots & \boldsymbol{0} \\
\boldsymbol{0} & \boldsymbol{0} & \ddots & \boldsymbol{0} \\
\boldsymbol{0} & \boldsymbol{0} & \cdots & \boldsymbol{\Phi}^{(k)}_{\frac{d_\text{model}}{2}}
\end{bmatrix}\,,\tag{4}
$$where $\boldsymbol{0}$ denotes $2\times2$ all-zero matrices and the #$\frac{d_\text{model}}{2}$ transposed rotation matrices $\boldsymbol{\Phi}^{(k)}$ positioned on the main diagonal are defined by
$$
\boldsymbol{\Phi}^{(k)}_m = \begin{bmatrix}
\cos(r_mk) & -\sin(r_mk) \\
\sin(r_mk) & \cos(r_mk)
\end{bmatrix}^\intercal\,,\tag{5}
$$with wave length $r_m$ (not to be confused with the encoding wave length $\lambda_m$).
Now we can reformulate Eq. 3, which we want to prove, as
$$
\underbrace{\begin{bmatrix}
\cos(r_mk) & \sin(r_mk) \\
-\sin(r_mk) & \cos(r_mk)
\end{bmatrix}}_{\displaystyle\boldsymbol{\Phi}^{(k)}_m}
\begin{bmatrix}
\sin\left(\lambda_mt\right)\\
\cos\left(\lambda_mt\right)
\end{bmatrix} =
\begin{bmatrix}
\sin\left(\lambda_m\left(t+k\right)\right)\\
\cos\left(\lambda_m\left(t+k\right)\right)
\end{bmatrix}\,.\tag{6}
$$Expanded, that is
$$\begin{align}
\sin(\lambda k+\lambda t) &= \sin(rk)\cos(\lambda t)+\cos(rk)\sin(\lambda t)\tag{6a}\\
\cos(\lambda k+\lambda t) &= \cos(rk)\cos(\lambda t)-\sin(rk)\sin(\lambda t)\,,\tag{6b}
\end{align}$$and we seek to determine $r$ in dependence of $\lambda$ and $k$, while eliminating $t$.
The addition theorems$$\begin{align}
\sin \left( {\alpha + \beta } \right) &= \sin \alpha \cos \beta + \cos \alpha \sin \beta\tag{7a}\\
\cos \left( {\alpha + \beta } \right) &= \cos \alpha \cos \beta – \sin \alpha \sin \beta\tag{7b}
\end{align}$$help solving the problem at hand. When applying the addition theorems to the expanded form (Eq. 6a, 6b), it follows$$\begin{align}
\alpha&=\lambda k=rk\tag{8a}\\
\beta&=\lambda t=\lambda t\,,\tag{8b}
\end{align}$$and consequently $r=\lambda$. Applying that finding and the definition of $\lambda_m$ (Eq. 2) to Eq. 5, we get
$$
\boldsymbol{\Phi}^{(k)}_m = \begin{bmatrix}
\cos\left(\lambda_mk\right) & \sin\left(\lambda_mk\right) \\
-\sin\left(\lambda_mk\right) & \cos\left(\lambda_mk\right)
\end{bmatrix}\,,\tag{9}
$$where $\displaystyle\lambda_m=10000^{\frac{-2m}{d_\text{model}}}$. With that, $\boldsymbol{T}^{(k)}$ fully specified and depends solely on $m$, $d_\text{model}$, and $k$. The position within the sequence, $t$, is not a parameter, q.e.d.
Special thanks go to my colleagues Stefan Baur and Lennart Van der Goten, for their friendly support and readiness to share their knowledge.
Thanks for your sharing. Helps a lot.
I have another question about the positional encoding:
why do sin and cos both used? why not just use one of them?
Thanks for your help~
I think it would not be possible to attend e.g. to position 0 of the pure sine function because it is all zero and any dot product with that vector is 0 too. With the juxtaposition the magnitude is sort of “balanced”. This figure shows it a bit:
Amazing proof! This problem has confused me for weeks, your insight solve it.
Thanks for the sharing, helps a lot. However, I still have some question
have the same question
Thanks for your proof.
I also used to confuse this for weeks and your proof help me get better understanding of this. Hoping that more people can learn from your insight, I am wondering if I can translate it in Chinese and share it in Chinese community?
You’re most welcome. Feel free to do so and please be sure to reference this post from your translation then.
I am still confuse to the following expression where can we get it ?
$$
f_m = \frac{1}{\lambda_m} := 10000^{\frac{2m}{d_\text{model}}}\,.\tag{2}
$$
This is directly derived from the equation in Section 3.5 in the Transformer paper.
Thanks Timo. I have read the paper several times. But I still don’t understand why this parameter $\frac{1}{\lambda_m}$ takes this formula $10000^{\frac{2m}{d_\text{model}}}$. Is there an inevitable theory, or is it just a hyperparameter ?
$\lambda$ represents the wavelength. The paper says
For examples, with $d_\text{model}=512$ and $x=\text{position}$, the wavelength that corresponds to the first and second dimensions is $2\pi$ as you can see that $\sin\left(x/10000^{2\times\frac{0}{512}}\right)=\sin(x)$ and $\cos\left(x/10000^{2\times\frac{0}{512}}\right) =\cos(x)$. The wavelength for the last two dimensions is $10000\times2\pi$ since $\sin\left(x/10000^{2\times\frac{256}{512}}\right)=\sin\left(\frac{x}{10000}\right)$ and $\cos\left(x/10000^{2\times\frac{256}{512}}\right)=\cos\left(\frac{x}{10000}\right)$. I hope this helps!
Can you provide any references to this?
Hi Timo, I have one question.
Will the positional embeddings cycle? After the Least Common Multiple of the periods of the comprising sinusoidal functions? Does that resulting embedding has period?
Thanks.
Hey Jesse, the positional embeddings would cycle, yes. However, in most practical applications one would artificially limit the maximum sequence length to something like 512 tokens. In BERT that is being done for example (even though it is worth mentioning that they use learned position embeddings but that is another story). If one did not do that the model would indeed not be able to distinguish first and, say, 513th symbol in the sequence.
Hi Timo, thanks for your reply,
I guess the period of the positional embeddings would be longer than common sequence length(say the 512 limit). Maybe the period has something to do with the chosen number 10000.
I wish the author of the paper clarified the theoretical limit though.
Hey Jesse, you are right – and yeah, that is also not clear to me.
Did you figure this out yet? I don’t think the cycle would be as small as 10k or 10k*2pi, but when would it cycle?
The frequency of $\sin(\frac{t}{f_m}) $ is $\frac{1}{2\pi f_m}$ ? I can’t understand Equation (2)
Thanks for this great article ! Just to improve it: there is a little confusion about the frequencies (w=2*pi*f) and wavelength (lambda). In a canonical form frequencies are written as follows: sin(w*t)=sin(2*pi*f*t)=sin(2*pi*t*v/lambda) (where c is the celerity and can be equal to 1 in our case). More details here: https://en.wikipedia.org/wiki/Frequency#Related_types_of_frequency
just like almost anything involving trigonometric identities, this is an example of something that’s easier to understand and faster to prove with a bit of complex analysis and/or electrical engineering background
words are considered impulses (dirac deltas) in the time domain (what time they occur in the sentence).
the fourier transform of an impulse is a complex exponential (e^(ti/f) = cos(t/f) + isin(t/f)).
each dimension of the positional encoding is a fourier coefficient of the word’s position
the positional encoding is a truncated fourier series of the word’s position. the position shift operator T is the same as time shifting, time shifting property of the fourier transform applies (multiply 2 complex exponentials together is same as adding their exponents).
as for why does the transformer architecture work with this, i suppose at some point it learns how to do an inverse fourier transform to get the final output and/or the guts of the model think of semantics in the frequency domain (each frequency / color / pitch of the sentence is a component of its meaning)