Linear Relationships in the Transformer’s Positional Encoding

In June 2017, Vaswani et al. published the paper “Attention Is All You Need” describing the “Transformer” architecture, which is a purely attention based sequence to sequence model. It can be applied to many tasks, such as language translation and text summarization. Since its publication, the paper has been cited more than one thousand times and several excellent blog posts were written on the topic; I recommend this one.

Vaswani et al. use positional encoding, to inject information about a token’s position within a sentence into the model. The exact definition is written down in section 3.5 of the paper (it is only a tiny aspect of the Transformer, as the red circle in the cover picture of this post indicates). After the definition, the authors state:

“We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$.”

But why is that? In this post I prove this linear relationship between relative positions in the Transformer’s positional encoding. Continue reading Linear Relationships in the Transformer’s Positional Encoding

Notes on “Haskell Programming – from first principles”

From November, 13th 2017 to June, 9th 2018, a friend and I were working our way through the 1285 pages of “Haskell Programming – from first principles” by Christopher Allen and Julie Moronuki. That’s more than six pages per day! While reading and discussing, I took a few notes here and there, which I want to publish in this post. Some of the sentences are directly taken from the book, which I highly recommend to anyone who wants to learn Haskell, by the way. Continue reading Notes on “Haskell Programming – from first principles”

Least Squares Derivation

The least squares optimization problem searches for a vector, that minimizes the euclidean norm in the following statement as much as possible: $$x_\text{opt}=\arg\min_x\frac{1}{2}\left\lVert Ax-y\right\rVert^2_2\,.$$This article explains how $x_\text{opt}=(A^\top A)^{-1}A^\top y$, the solution to the problem, can be derived and how it can be used for regression problems. Continue reading Least Squares Derivation

Cubic Spline Interpolation

Cubic spline interpolation is a mathematical method commonly used to construct new points within the boundaries of a set of known points. These new points are function values of an interpolation function (referred to as spline), which itself consists of multiple cubic piecewise polynomials. This article explains how the computation works mathematically.

After an introduction, it defines the properties of a cubic spline, then it lists different boundary conditions (including visualizations), and provides a sample calculation. Furthermore, it acts as a reference for the mathematical background of the cubic spline interpolation tool on which is introduced at the end of the article. Continue reading Cubic Spline Interpolation

Guess Solutions of Polynomials

For a given polynomial of $n$th degree

$$P_n(x)=\sum_{i=0}^n a_ix^i = a_nx^n+a_{n-1}x^{n-1}+…+a_1x+a_0$$

you can guess rational solutions $x$ for the corresponding problem $P_n(x)=0$ by applying the following two rules:

  1. $$x=\frac{p}{q}\text{, with } p \in \mathbb{Z} \land q \in \mathbb{N}\land p\mid a_0 \land q\mid a_n$$
  2. $$\lvert x\rvert\le2\cdot \max\left\lbrace \sqrt[k]{\frac{\lvert a_{n-k}\rvert}{\lvert a_n\rvert}}, k=1, …, n\right\rbrace$$

Continue reading Guess Solutions of Polynomials

Trigonometric Functions Formulary

This formulary has been created during the online onboarding process at Baden-Wuerttemberg Cooperative State University (DHBW). It is suitable for the related online tests and might be helpful for other people, seeking for formulas in this field of mathematics.


$$\begin{array}{l} \tan x = \frac{{\sin x}}{{\cos x}}\\ \cot x = {\tan ^{ – 1}}x = \frac{{\cos x}}{{\sin x}} \end{array}$$ Continue reading Trigonometric Functions Formulary