Linear Relationships in the Transformer’s Positional Encoding

In June 2017, Vaswani et al. published the paper “Attention Is All You Need” describing the “Transformer” architecture, which is a purely attention based sequence to sequence model. It can be applied to many tasks, such as language translation and text summarization. Since its publication, the paper has been cited more than one thousand times and several excellent blog posts were written on the topic; I recommend this one.

Vaswani et al. use positional encoding, to inject information about a token’s position within a sentence into the model. The exact definition is written down in section 3.5 of the paper (it is only a tiny aspect of the Transformer, as the red circle in the cover picture of this post indicates). After the definition, the authors state:

“We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$.”

But why is that? In this post I prove this linear relationship between relative positions in the Transformer’s positional encoding. Continue reading Linear Relationships in the Transformer’s Positional Encoding

Least Squares Derivation

The least squares optimization problem searches for a vector, that minimizes the euclidean norm in the following statement as much as possible: $$x_\text{opt}=\arg\min_x\frac{1}{2}\left\lVert Ax-y\right\rVert^2_2\,.$$This article explains how $x_\text{opt}=(A^\top A)^{-1}A^\top y$, the solution to the problem, can be derived and how it can be used for regression problems. Continue reading Least Squares Derivation