# My Favorite Papers at NeurIPS 2019

In December 2019 I had the pleasure to attend the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019) in Vancouver, Canada, to present my work on document understanding. During the three main conference days there were a total of 1,428 papers on exhibition in poster sessions. I tried to read at least all titles and to look at the posters more carefully that seemed interesting and relevant to my work. Out of these posters, I took notes of the paper title for the ones that made me want to read the paper. This post lists these papers, grouped by category, with some additional comments. I am certain that there are plenty more papers that would deserve to be on the list and the selection process of nominating a “favorite” paper is surely very noisy. The list might still be useful to some readers and which is why I publish it here.

### Best

1. Superposition of many models into one Brian Cheung, Alexander Terekhov, Yubei Chen, Pulkit Agrawal, Bruno Olshausen
It is commonsense knowledge that deep neural networks are over-parameterized, i.e. the number of parameters is much larger than theoretically needed to approximate the target function. Many papers are therefore dealing with model compression, distillation, pruning, or quantization which is about reducing a model’s size after training. In multi-task learning, on the other hand, the goal is to train a single model on multiple tasks, often sequentially. Cheung et al. take an entirely different approach in that they train the same network architecture on multiple tasks with distinct sets of weights and merge the weight matrices in a subsequent post-training step. The surprising insight is that the merged weight matrices are still comparably well-performant on the tasks they were previously trained on. The authors have published their code here.
2. Guided Similarity Separation for Image Retrieval Chundi Liu, Guangwei Yu, Maksims Volkovs, Cheng Chang, Himanshu Rai, Junwei Ma, Satya Krishna Gorti
The authors from layer6 propose a novel method for “image retrieval” — a task dealing with the search of related images based on a query that is an image itself. Given a large set of images, the authors compute a vector representation for every image using a pre-trained image classifier (e.g. ImageNet) and create a $k$ nearest neighbor graph. They iteratively update the graph to retrieve new image representation vectors which better resemble the semantic similarity of the images. The intuitive idea is that the $k$-NN graph allows different images of the same thing to be connected through several nodes. The method achieves strong empirical results. On top of that, the source is open.
3. Zero-shot Knowledge Transfer via Adversarial Belief Matching Paul Micaelli, Amos J. Storkey
It is an established idea to train student networks to approximate the function of a teacher network with more weights. Commonly, one would train a large model on a given dataset $\mathcal{D}$ and subsequently train a smaller network to resemble the behavior of the larger network by feeding samples from $\mathcal{D}$ into both and making the outputs match. The paper proposes a method where $\mathcal{D}$ is not required for training the student: Instead of sampling from the distribution of images that the teacher was trained on, they take a third, generative model and make it generate images on which the student is asked to match the teacher. Specifically, the generator is trained to maximize the KL divergence between student and teacher, while the student tries to minimize it, and the teacher’s weights are frozen. I find it intriguing that the generated images — the authors call them “pseudo data” — do not look like any image from $\mathcal{D}$ at all, and yet the student will later perform very well on images similar to the ones in $\mathcal{D}$. The code is publicly available.

### NLP

Transformer-based

• Visualizing and Measuring the Geometry of BERT Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B. Viegas, Andy Coenen, Adam Pearce, Been Kim
The authors look at the internal dynamics of BERT models; BERT is everywhere in NLP so understanding it a bit better seems relevant.
• Novel positional encodings to enable tree-based transformers Vighnesh Shiv, Chris Quirk
The positional encodings of Transformer models are made for 1D sequences. The authors come up with a novel positional encoding scheme to represent node positions within trees. I see this as one work in the general framework of conditioning Transformers on information other than the set of inputs themselves.
• Are Sixteen Heads Really Better than One? Paul Michel, Omer Levy, Graham Neubig
During test time a large proportion of attention heads of Transformer models can be removed, in some cases all but one. This is a surprising way of reducing the parameter count of Transformer models, somewhat comparable to removing CNN filters that are underutilized.

### Other

(65 Posts)
Applied Scientist with Zalando and Founder of Denk Development. Interested in data science, software engineering, math, microcontrollers, and sports.
View all author’s posts