Linear Relationships in the Transformer’s Positional Encoding

In June 2017, Vaswani et al. published the paper “Attention Is All You Need” describing the “Transformer” architecture, which is a purely attention based sequence to sequence model. It can be applied to many tasks, such as language translation and text summarization. Since its publication, the paper has been cited more than one thousand times and several excellent blog posts were written on the topic; I recommend this one.

Vaswani et al. use positional encoding, to inject information about a token’s position within a sentence into the model. The exact definition is written down in section 3.5 of the paper (it is only a tiny aspect of the Transformer, as the red circle in the cover picture of this post indicates). After the definition, the authors state:

“We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$.”

But why is that? In this post I prove this linear relationship between relative positions in the Transformer’s positional encoding. Continue reading Linear Relationships in the Transformer’s Positional Encoding

Using TensorFlow’s Batch Normalization Correctly

The TensorFlow library’s layers API contains a function for batch normalization: tf.layers.batch_normalization. It is supposedly as easy to use as all the other tf.layers functions, however, it has some pitfalls. This post explains how to use tf.layers.batch_normalization correctly. It does not delve into what batch normalization is, which can be looked up in the paper “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” by Ioeffe and Szegedy (2015). Continue reading Using TensorFlow’s Batch Normalization Correctly

[Paper Recap] Multiple Hypotheses Prediction

The paper Learning in an Uncertain World: Representing Ambiguity Through Multiple Hypotheses was publish by Christian Rupprecht et al. in late 2016. The authors propose a training technique for machine learning models which makes them predict multiple distinct hypotheses. This is an advantage for many prediction tasks, in which uncertainty is part of the problem. In this article I am going to summarize the paper and name further thoughts. Continue reading [Paper Recap] Multiple Hypotheses Prediction

Timœ Goes By: Verifying the Age Estimator How-Old.net

how-old.net is a project from Microsoft that guesses the age of people on a photo based on their faces. I have tested it with an image that consists of 21 photos. Each photo was taken around the time of my n-th birthday, so the ideal classification would label them as 0, 1, 2, …, 20. Let’s see what how-old.net comes up with. Continue reading Timœ Goes By: Verifying the Age Estimator How-Old.net