Linear Relationships in the Transformer’s Positional Encoding

In June 2017, Vaswani et al. published the paper “Attention Is All You Need” describing the “Transformer” architecture, which is a purely attention based sequence to sequence model. It can be applied to many tasks, such as language translation and text summarization. Since its publication, the paper has been cited more than one thousand times and several excellent blog posts were written on the topic; I recommend this one.

Vaswani et al. use positional encoding, to inject information about a token’s position within a sentence into the model. The exact definition is written down in section 3.5 of the paper (it is only a tiny aspect of the Transformer, as the red circle in the cover picture of this post indicates). After the definition, the authors state:

“We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$.”

But why is that? In this post I prove this linear relationship between relative positions in the Transformer’s positional encoding. Continue reading Linear Relationships in the Transformer’s Positional Encoding

[Paper Recap] Multiple Hypotheses Prediction

The paper Learning in an Uncertain World: Representing Ambiguity Through Multiple Hypotheses was publish by Christian Rupprecht et al. in late 2016. The authors propose a training technique for machine learning models which makes them predict multiple distinct hypotheses. This is an advantage for many prediction tasks, in which uncertainty is part of the problem. In this article I am going to summarize the paper and name further thoughts. Continue reading [Paper Recap] Multiple Hypotheses Prediction