# My Favorite Papers at NeurIPS 2019

In December 2019 I had the pleasure to attend the

**33rd Conference on Neural Information Processing Systems**(NeurIPS 2019) in Vancouver, Canada, to present my work on document understanding. During the three main conference days there were a total of 1,428 papers on exhibition in poster sessions. I tried to read at least all titles and to look at the posters more carefully that seemed interesting and relevant to my work. Out of these posters, I took notes of the paper title for the ones that made me want to read the paper. This post lists these papers, grouped by category, with some additional comments. I am certain that there are plenty more papers that would deserve to be on the list and the selection process of nominating a “favorite” paper is surely very noisy. The list might still be useful to some readers and which is why I publish it here.

### Best

**Superposition of many models into one**Brian Cheung, Alexander Terekhov, Yubei Chen, Pulkit Agrawal, Bruno Olshausen

*It is commonsense knowledge that deep neural networks are over-parameterized, i.e. the number of parameters is much larger than theoretically needed to approximate the target function. Many papers are therefore dealing with model compression, distillation, pruning, or quantization which is about reducing a model’s size after training. In multi-task learning, on the other hand, the goal is to train a single model on multiple tasks, often sequentially. Cheung et al. take an entirely different approach in that they train the same network architecture on multiple tasks with distinct sets of weights and merge the weight matrices in a subsequent post-training step. The surprising insight is that the merged weight matrices are still comparably well-performant on the tasks they were previously trained on.***Guided Similarity Separation for Image Retrieval**Chundi Liu, Guangwei Yu, Maksims Volkovs, Cheng Chang, Himanshu Rai, Junwei Ma, Satya Krishna Gorti

*The authors from layer6 propose a novel method for “image retrieval” — a task dealing with the search of related images based on a query that is an image itself. Given a large set of images, the authors compute a vector representation for every image using a pre-trained image classifier (e.g. ImageNet) and create a $k$ nearest neighbor graph. They iteratively update the graph to retrieve new image representation vectors which better resemble the semantic similarity of the images. The intuitive idea is that the $k$-NN graph allows different images of the same thing to be connected through several nodes. The method achieves strong empirical results. On top of that, the source is open.***Zero-shot Knowledge Transfer via Adversarial Belief Matching**Paul Micaelli, Amos J. Storkey

*It is an established idea to train student networks to approximate the function of a teacher network with more weights. Commonly, one would train a large model on a given dataset $\mathcal{D}$ and subsequently train a smaller network to resemble the behavior of the larger network by feeding samples from $\mathcal{D}$ into both and making the outputs match. The paper proposes a method where $\mathcal{D}$ is not required for training the student: Instead of sampling from the distribution of images that the teacher was trained on, they take a third, generative model and make it generate images on which the student is asked to match the teacher. Specifically, the generator is trained to maximize the KL divergence between student and teacher, while the student tries to minimize it, and the teacher’s weights are frozen. I find it intriguing that the generated images — the authors call them “pseudo data” — do not look like any image from $\mathcal{D}$ at all, and yet the student will later perform very well on images similar to the ones in $\mathcal{D}$. The code is publicly available.*

### NLP

**Cross-lingual Language Model Pretraining**Alexis Conneau, Guillaume Lample

*This work is from the same Facebook lab as Word Translation Without Parallel Data, one of the most intriguing papers I have ever come across. In their new work they introduce a Translation Language Modeling (TLM) objective in which a model BERT model is provided the same sentence in two languages with masking and is trained to fill the gaps.***Training Language GANs from Scratch**Cyprien de Masson d’Autume, Shakir Mohamed, Mihaela Rosca, Jack Rae

*GANs are very common in image generation which is why I find it interesting to apply them successfully to text too, here without maximum likelihood pre-training (the standard approach to learning language models).***Reverse engineering recurrent networks for sentiment classification reveals line attractor dynamics**Niru Maheswaranathan, Alex Williams, Matthew Golub, Surya Ganguli, David Sussillo

*The authors look into RNNs and find that the dynamics are low-dimensional. This seems related to the dynamics of BERT models, see “Are Sixteen Heads Really Better than One?” which is linked below.***On the Downstream Performance of Compressed Word Embeddings**Avner May, Jian Zhang, Tri Dao, Christopher Ré

*There was very interesting, theoretically grounded work on the optimal dimensionality of word embedding at NIPS last year. May et al. propose a measure of compression quality for word embedding.*

**Transformer-based**

**Visualizing and Measuring the Geometry of BERT**Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B. Viegas, Andy Coenen, Adam Pearce, Been Kim

*The authors look at the internal dynamics of BERT models; BERT is everywhere in NLP so understanding it a bit better seems relevant.***Novel positional encodings to enable tree-based transformers**Vighnesh Shiv, Chris Quirk

*The positional encodings of Transformer models are made for 1D sequences. The authors come up with a novel positional encoding scheme to represent node positions within trees. I see this as one work in the general framework of conditioning Transformers on information other than the set of inputs themselves.***Are Sixteen Heads Really Better than One?**Paul Michel, Omer Levy, Graham Neubig

*During test time a large proportion of attention heads of Transformer models can be removed, in some cases all but one. This is a surprising way of reducing the parameter count of Transformer models, somewhat comparable to removing CNN filters that are underutilized.*

### Adversarial Robustness

**Efficient and Accurate Estimation of Lipschitz Constants for Deep Neural Networks**Mahyar Fazlyab, Alexander Robey, Hamed Hassani, Manfred Morari, George Pappas

*Finding a tight bound for the estimation of a neural network’s Lipschitz constant is one way of certifying a neural network’s security with respect to adversarial attacks. There has been a lot of work on exactly this topic and the authors make the bound tighter.***Provably Robust Deep Learning via Adversarially Trained Smoothed Classifiers**Hadi Salman, Jerry Li, Ilya Razenshteyn, Pengchuan Zhang, Huan Zhang, Sebastien Bubeck, Greg Yang**Adversarial Robustness through Local Linearization**Chongli Qin, James Martens, Sven Gowal, Dilip Krishnan, Krishnamurthy Dvijotham, Alhussein Fawzi, Soham De, Robert Stanforth, Pushmeet Kohli**Unlabeled Data Improves Adversarial Robustness**Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, John C. Duchi, Percy S. Liang

*The authors use semisupervised training to make a model more robust while retaining the classification accuracy.***On the Hardness of Robust Classification**Pascale Gourdeau, Varun Kanade, Marta Kwiatkowska, James Worrell

### Graph Networks

**On the equivalence between graph isomorphism testing and function approximation with GNNs**Zhengdao Chen, Soledad Villar, Lei Chen, Joan Bruna

*Zhengdao was in the waiting line with me on the first conference day. Naturally, I got a private presentation on the paper and found it very interesting.***Deep Set Prediction Networks**Yan Zhang, Jonathon Hare, Adam Prugel-Bennett

*Converting from set to vector and from vector to set is relevant in many applications. The authors use a set to vector model and use its gradient with respect to the input for a given vector and thereby find a set for it.***Graph Neural Tangent Kernel: Fusing Graph Neural Networks with Graph Kernels**Simon S. Du, Kangcheng Hou, Russ R. Salakhutdinov, Barnabas Poczos, Ruosong Wang, Keyulu Xu**Multi-relational Poincaré Graph Embeddings**Ivana Balazevic, Carl Allen, Timothy Hospedales

### Model Architectures

**Putting An End to End-to-End: Gradient-Isolated Learning of Representations**Sindy Löwe, Peter O’Connor, Bastiaan Veeling**FastSpeech: Fast, Robust and Controllable Text to Speech**Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

### Deep Learning Understanding

**The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies**Basri Ronen, David Jacobs, Yoni Kasten, Shira Kritchman**Random deep neural networks are biased towards simple functions**Giacomo De Palma, Bobak Kiani, Seth Lloyd**Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent**Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, Jeffrey Pennington**Large Scale Structure of Neural Network Loss Landscapes**Stanislav Fort, Stanislaw Jastrzebski

*Loss landscapes are a much studied and still poorly understood subject of research. The authors push the boundary and present new findings in the field.*

### Computer Vision

**High-Quality Self-Supervised Deep Image Denoising**Samuli Laine, Tero Karras, Jaakko Lehtinen, Timo Aila**HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models**Sharon Zhou, Mitchell Gordon, Ranjay Krishna, Austin Narcomey, Li F. Fei-Fei, Michael Bernstein

### Causality

**Perceiving the arrow of time in autoregressive motion**Kristof Meding, Dominik Janzing, Bernhard Schölkopf, Felix A. Wichmann**Causal Confusion in Imitation Learning**Pim de Haan, Dinesh Jayaraman, Sergey Levine

### Other

**Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration**Meelis Kull, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, Peter Flach

*We used temperature scaling with Chargrid at SAP and I find it therefore interesting to look into the new research that improves upon it.*