Since its launch in 1987, the annual Conference on Neural Information Processing Systems (NIPS) has brought together researchers working on neural networks and related ﬁelds, but it later diversiﬁed to become one of the largest conferences in machine learning. In recent years, the trend towards deep learning has brought the conference closer to its roots. The 2016 program spanned six days (Dec 5 to 10) and included tutorials, oral and poster presentations, workshops, and invited talks on a broad range of research topics.

Following their previous *Insights* post on ICML 2016, Two Sigma researchers Vinod Valsalam and Firdaus Janoos discuss below the notable advances in deep learning, optimization algorithms, Bayesian techniques, and time-series analysis presented at NIPS 2016.

## Overview

With 550+ accepted papers and 50+ workshops, the number of attendees more than doubled in the past two years (from more than 2500 in 2014 to over 5000 in 2016), demonstrating rapidly growing interest in machine learning and artiﬁcial intelligence. That included strong industry participation (Two Sigma was among the more than 60 sponsors), both for recruiting talent as well as for presenting advances in the ﬁeld.

Several interesting invited talks were given by researchers who are established in both academia and industry. Yann LeCun (NYU & Facebook) made a case for why unsupervised learning is the future, Drew Purves (DeepMind) talked about why AI and nature need each other, and Marc Raibert (Boston Dynamics) showed how it is possible to engineer sophisticated legged robots without using learning. Although not yet mainstream, the ultimate goal of achieving Artiﬁcial General Intelligence (AGI) is also gathering industry momentum. The major players in this area are attracting the research community to their software platforms with OpenAI announcing their Universe platform, DeepMind open-sourcing their Lab platform, and GoodAI releasing an update to their Brain Simulator.

The dominating theme at the conference was deep learning, sometimes combined with other machine learning methods such as reinforcement learning and Bayesian techniques. The core area of deep learning seems to be maturing with topics shifting from architectures and layers to better learning algorithms and analysis. A notable area that is still outside the strong inﬂuence of deep learning is time-series analysis, which classical and Bayesian approaches still dominate. There was also an interesting track on optimization. Below is a highly pared down list of the myriad presentations that Vinod and Firdaus thought were promising, relevant, or generally interesting.

## Tutorials

The tutorials this year were very good, touching upon many of the exciting and salient technologies in machine learning.

### Deep Reinforcement Learning Through Policy Optimization

Reinforcement Learning (RL) has been grabbing headlines recently by applying deep learning techniques. One way to build RL agents is to optimize their policies directly, which is often more convenient than working with state value functions, and which was the focus of this tutorial by Pieter Abbeel (Berkeley and OpenAI) and John Schulman (OpenAI) [slides]. The key to building such agents using deep learning is computing policy gradients eﬀectively. However, gradient evaluations are often noisy, requiring variance reduction techniques that add complexity to the algorithms. For example, the vanilla policy gradient algorithm (Reinforce) uses an advantage estimate that subtracts a ﬁtted baseline from returns. More advanced algorithms, such as Asynchronous Advantage Actor Critic (A3C) and Generalized Advantage Estimation (GAE), use value functions for variance reduction. Another important consideration for RL algorithms is step size and sample complexity that Trust Region Policy Optimization (TRPO) and Guided Policy Search (GPS) are designed to address.

They concluded the tutorial with an announcement of the OpenAI Universe, a platform for training and measuring AI agents. It is based on the previous OpenAI Gym and includes a lot of games and other applications to train agents with the goal of artiﬁcial general intelligence (AGI). With the same goal, DeepMind announced in another talk that they are open-sourcing their Lab platform, a 3D learning environment based on Quake. It seems like both companies are placing their bets on deep RL to make inroads into AGI.

### Theory and Algorithms for Forecasting Non-Stationary Time Series

In this interesting—but theory-heavy—tutorial, Vitaly Kuznetsov (Google) and Mehryar Mohri (NYU) presented a relatively new way of analyzing time-series data based on a learning-focused concept called discrepancy [slides]. They started with a clear introduction to the classical autoregressive family of models but pointed out that the model and data assumptions often don’t hold in practice. On the other hand, the new approach is based on a discrepancy measure that is estimated from data, taking into account the loss function and hypothesis set used for learning. Essentially, discrepancy measures the degree of non-stationarity and can be used to guide the design of new learning algorithms. They showed results of such an algorithm that outperformed ARIMA on ﬁnancial and weather data in most cases. A further extension of their method—combining batch and online learning for time-series prediction—also seems promising.

### Large-Scale Optimization: Beyond Stochastic Gradient Descent and Convexity

This useful tutorial by Suvrit Sra (MIT) and Francis Bach (INRIA) summarized recent progress in optimization research with a focus on algorithms and their convergence guarantees [slides1, slides2]. The optimization problem in supervised machine learning is to ﬁnd the parameters of a prediction function such that the average loss on a given set of data is minimized. Since computing the average loss entails summing over the data points, the tutorial focused on methods for ﬁnite sums with an optional regularization term. Stochastic gradient descent (SGD) and its many variants are often used as the optimization algorithm to solve such problems. But designing such algorithms and analyzing their convergence behavior is hard for non-convex problems that arise in such methods as deep learning and matrix factorization. Nevertheless, recent work has extended the convergence results for convex ﬁnite sums to non-convex sums, as well. Moreover, similar results are now available for the new breed of variance reduction methods, such as stochastic variance reduced gradient (SVRG), that converge much faster [Reddi et al., 2016a]. The large-scale aspect of the tutorial was relegated to the end with a brief discussion of asynchronous and distributed versions of the aforementioned stochastic algorithms.

### Generative Adversarial Networks (GANs)

GANs by Ian Goodfellow (OpenAI) are a really neat idea of using neural networks to generate samples from a distribution learned from data—with the twist that the sampling happens by playing an adversarial game rather than MCMC. The learning process consists of a game between two adversaries: a generator network that attempts to produce realistic samples, and a discriminator network that attempts to identify whether samples originated from the training data or from the generative model. At the Nash equilibrium of this game, the generator network reproduces the data distribution exactly, and the discriminator network cannot distinguish generated samples from the training data. Both networks can be trained using stochastic gradient descent with exact gradients computed by maximum likelihood. While most of the use-cases presented here involved generating samples from images, with a little more thought and experience it should be possible to ﬁnd many more relevant applications of this cool technology.

### Variational Inference: Foundations and Modern Methods

This tutorial by David Blei (Columbia), Shakir Mohamed (Deep Mind), and Rajesh Ranganath (Princeton) covered variational inference (VI) methods for approximating probability distributions through optimization. These methods tend to be faster than other methods such as Monte Carlo sampling, and are making inroads beyond their stronghold of Bayesian networks into neural networks. Towards the end of this tutorial, they described some of the newer advances in VI such as Monte Carlo gradient estimation, black box variational inference, stochastic approximation, and variational auto-encoders.

## Deep Learning Papers

The following are a few selected papers on deep learning, covering topics in reinforcement learning, training techniques, generative modeling, and recurrent networks.

### Value Iteration Networks for Deep Reinforcement Learning

This was an award talk [Tamar et al., 2016], based on a key observation that deep reinforcement learning networks are very similar to image recognition networks, i.e., they have convolution layers for feature extraction followed by fully connected layers that map features to action probabilities. Speciﬁcally, the expectation in the value iteration RL algorithm corresponds to convolution, max to max pool, and the number of iterations to the number of layers. This means that value iteration can be implemented as a convolutional network, which can then be trained to plan actions for new tasks such as for navigating a new map. Learning to plan in this way leads to better generalization across similar tasks, but the network does require some careful engineering.

### Deep Learning Without Poor Local Minima

As one of the few purely theoretical talks at the conference, this talk nevertheless tackled the important problem of characterizing the nature of local minima when optimizing deep neural networks [Kawaguchi, 2016]. In particular, the author proved the following conjecture about deep linear networks that was published in the late 1980s [Baldi and Hornik, 1989]. Every local minimum is a global minimum and every critical point that is not a global minimum is a saddle point. Note that this result is not obvious since these networks have non-convex and non-concave loss functions despite having linear activation functions. Furthermore, under two unrealistic assumptions, he also extended this result to deep nonlinear networks. Although this work is not directly applicable to practical deep nonlinear models because of the unrealistic assumptions, it does show that any bad local minima that results are due to their nonlinear activations.

### Alternatives to Batch Normalization

Two promising alternatives to the commonly used batch normalization (BN) technique [Ioﬀe and Szegedy, 2015] for speeding up the training of deep networks were presented at the conference. The ﬁrst one called weight normalization (WN) reparameterizes each weight vector by decoupling its length and direction and then performs SGD on the length and direction parameters directly [Salimans and Kingma, 2016]. Unlike BN, WN doesn’t use noisy estimates of mini-batch statistics; in fact, it avoids dependence on the mini-batch altogether. Therefore, it works well with noise-sensitive applications such as recurrent neural networks, deep reinforcement learning networks, and generative networks. WN is also simpler and less computationally expensive while providing much of the speed-up. However, it does require more careful parameter initialization.

The second alternative is called layer normalization (LN) [Ba et al., 2016b]. For normalizing the input to each neuron, BN uses the summed activations to that neuron for all the examples in a mini-batch, while LN uses the summed activations to all the same-layer neurons for a single training example. In this way, LN avoids the undesirable dependence of BN on mini-batches. Moreover, applying LN to RNNs becomes straightforward, since each layer is normalized separately at each time step. The paper also contains a nice table showing whether or not BN, WN, and LN are invariant under various input and parameter transformations. For example, they showed that LN is robust to both input and weight matrix scaling, just like BN. Experimental results on a number of benchmark tasks showed that LN outperforms both BN and WN signiﬁcantly, especially in RNN models. Assuming that these results hold more broadly, LN may work as a good default replacement for BN.

### Learning Interpretable and Disentangled Representations using InfoGANs

The feature representations learned by regular GANs [Goodfellow et al., 2014] are not interpretable since the noise inputs to the generator do not correspond to any semantic features of the data. InfoGANs were designed to remedy this issue by decomposing the generator noise vector into an incompressible noise part and a latent code part that targets salient semantic features of the data [Chen et al., 2016]. During training, they use an additional objective to maximize the mutual information between the latent code and the generator distribution.

The authors showed some pretty impressive results for digits and faces datasets in which they obtained interpretable and disentangled representations that are competitive with those produced by supervised methods. For example, the latent code for MNIST digit classiﬁcation can be designed to comprise one discrete random variable to represent each of the ten digits and two continuous variables to represent the angle and stroke thickness of the digit. Varying the value of these variables on a trained network varies the digit, its angle, and its stroke thickness respectively in the generated output. Besides being simply cool, these networks are likely to have applications in many unsupervised learning tasks.

### Stochastic Depth for Training Very Deep Networks

Recently, very deep networks with hundreds of layers have become possible using techniques such as Highway Network [Srivastava et al., 2015] and Residual Network (ResNet) [He et al., 2016]. These techniques introduce skip connections between layers to improve the ﬂow of activations and gradients that would otherwise diminish across a large number of layers. However, training gets slower with depth. Stochastic depth is a neat idea that makes it possible to train with shallow networks and test with deep networks, resulting in substantial reductions in training time [Huang et al., 2016]. For each training mini-batch, a random subset of layers is bypassed with the identity function. During testing, all layers are used with weights based on how often they were used during training. This method acts as a regularizer similar to dropout [Srivastava et al., 2014], but doesn’t lose eﬀectiveness when used with batch normalization [Ioﬀe and Szegedy, 2015] on ResNets. On the CIFAR-10 image dataset, applying stochastic depth to a ResNet with more than 1200 layers successfully avoided previously observed overﬁtting problems and improved test error to produce a new record!

### Using Fast Weights to Attend to the Recent Past

This paper augments the standard neural network model with weights and activations to contain fast variables that change faster than activations but slower than weights [Ba et al., 2016a], based on the observation that biological synapses have dynamics at many diﬀerent time-scales. These “fast weights” can be used to store temporary memories of the recent past and they provide a neurally plausible way of implementing the type of attention to the past that has recently proved very helpful in sequence-to-sequence models. By using fast weights, the authors claim that networks can avoid the need to store copies of neural activity patterns, as they do in LSTMs and other memory networks.

### Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences

This thought-provoking paper extends the LSTM unit by adding a new time gate, which is controlled by a parametrized clock with a frequency range that produces updates of the memory cell only during a small percentage of the duty cycle [Neil et al., 2016]. The advantage of such gating is that it can allow LSTMs to work with multi-frequency and asynchronously sampled data. Furthermore, it makes it possible to train with very long roll-outs without experiencing the vanishing/exploding gradients problem.

### Sequential Neural Models with Stochastic Layers

This paper glues together a deterministic recurrent neural network and a Bayesian state space model to form a stochastic and sequential neural generative model, which enables tracking the posterior distribution on the model’s states [Fraccaro et al., 2016]. They then build in another neural network to do structured variational inference on the Bayesian state-space model, the parameters of which they learn—you guessed it—through deep learning. By retaining both the nonlinear recursive structure of a recurrent neural network and averaging over the uncertainty of the state-space, they showed signiﬁcant improvements over competing results on the Blizzard and TIMIT speech modeling data sets by a large margin, while achieving comparable performances on polyphonic music modeling.

## Optimization Papers

This year, in addition to faster and large-scale convex optimization, there was a lot of work on large-scale and distributed non-convex and non-smooth optimization. Below is a very small subset of the many interesting papers presented.

### A Multi-Batch L-BFGS Method for Machine Learning

This paper presents a batch method that uses a sizeable fraction of the training set at each iteration to facilitate parallelism, and that employ second-order information [Berahas et al., 2016]. In order to improve the learning process, they follow a multi-batch approach in which the batch changes at each iteration. This can cause diﬃculties because L-BFGS employs gradient diﬀerences to update the Hessian approximations, and when these gradients are computed using diﬀerent data points the process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, illustrates the behavior of the algorithm in a distributed computing platform, and studies its convergence properties for both the convex and nonconvex cases.

### Proximal Stochastic Methods for Nonsmooth, Nonconvex Finite-Sum Optimization

This paper deals with stochastic algorithms for optimizing nonconvex, non-smooth ﬁnite-sum problems, where the non-smooth part is convex [Reddi et al., 2016b]. Unlike the smooth case, it is not known whether proximal SGD with constant minibatch converges to a stationary point. The paper develops stochastic algorithms that provably converge to a stationary point for constant minibatches and converge faster than batch proximal gradient descent. This paper is highly recommend—it brings together concepts from variance reduction for convex optimization with a novel way for handling non-convex and non-smooth functions.

### A Simple Practical Accelerated Method for Finite Sums

This paper presents a simple fast incremental gradient (FIG) method for optimizing ﬁnite sums (such as empirical risk), building on the recently introduced SAGA method [Defazio, 2016]. The method exhibits accelerated convergence rate on strongly convex smooth problems while having one tuning parameter (a step size) and is much simpler than other acceleration techniques (such as Nesterov’s acceleration). Although they did not analyze the case, they claimed that it exhibited good empirical speedup when applied to non-smooth problems. They did not provide any results for non-convex problems but claimed that it should work in that case, too.

### NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization

This paper presents a primal-dual algorithm with augmented Lagrangian-based splitting for solving large nonconvex problems whose objective consists of a sum of non-convex smooth functions plus a non-smooth regularizer, in a distributed fashion [Hajinezhad et al., 2016]. To the best of our knowledge, nobody had previously tackled the problem of solving non-convex with non-smooth optimization in the distributed setting, and their results seem quite good. Furthermore, they exposed a nice connection between their method and primal-only methods such as IAG, SAG, and SAGA.

### Regularized Nonlinear Acceleration

This is a very enjoyable paper in which they introduce some cool concepts from control theory (minimal polynomial extrapolation) to not only accelerate but also stabilize generic optimization problems [Scieur et al., 2016]. The scheme computes estimates of the optimum via a nonlinear average of the iterates, where the weights are obtained from a simple linear system in an online fashion. It can plug-and-play into any standard optimization algorithm and—if their results are to be believed—it provides much better convergence than most other acceleration schemes. Most promising though is its potential to reduce the instabilities caused by improperly selected learning rates and other hyper-parameters on the convergence of the underlying algorithm.

## Workshops

Below are highlights from the time series and Bayesian deep learning workshops, which were just two of the more than 50 all-day workshops at the conference.

### Time-Series Workshop

Although deep learning was the prevailing theme at the conference, the time-series workshop stood out for its lack of emphasis on deep learning. This discord was also at the center of the ﬁrst question to the discussion panel consisting of Yan Liu (USC), Andrew Nobel (UNC), Stephen Roberts (Oxford), and Mehryar Mohri (NYU). They seemed to agree that while deep learning is great for classiﬁcation, other techniques like regressions, curves, and probabilities are better suited for time-series analysis. For example, we know how to encode invariances in time series using Bayesian non-parametrics, which we don’t know how to do with deep networks. Another relevant question to the panel was on strategies to deal with noisy time-series data. Again, they suggested using techniques that are simple and dumb, instead of deep and fancy.

However, the panel seemed unbalanced without any deep learning experts to argue the case for using deep learning. While deep learning doesn’t have any killer applications yet for time-series analysis, a few successful results were still presented at the conference, especially using recurrent neural networks. For example, Shengdong Zhang (Bosch Research) presented deep learning approaches for predicting rare events from multi-variate heterogenous time-series data [Zhang et al., 2016]. His architecture based on LSTMs worked well on real-world applications, such as hard disk failure prediction without using the hand-engineering features typically required by other state-of-the-art methods. Another poster from Kaspersky Lab showed how they used an LSTM network to detect faults in industrial multivariate time-series data [Filonov et al., 2016].

Most of the work presented at the workshop used other methods. For instance, Inderjit Dillon (UT Austin and Voleon) presented an interesting matrix factorization approach using temporal regularization to predict high dimensional time series with a lot of missing and correlated data [Yu et al., 2016]. The temporal regularizer used an autoregressive time-series model to incorporate the structure of temporal dependencies. This approach was two orders of magnitude faster and generated better forecasts than traditional approaches on a Walmart e-commerce dataset. In another interesting work, Muhammad Amjad (MIT) learned classiﬁers for trading bitcoins using a simple buy-sell-hold strategy based on price deltas and hand-designed features. Although the ﬁrst diﬀerence of prices were stationary and mixing, his classiﬁcation algorithms outperformed the classical ARIMA models.

### Bayesian Deep Learning Workshop

Although Bayesian and probabilistic methods have been used with neural networks since the 1990s, they are more heavily inﬂuencing the way deep learning is currently evolving. As Zoubin Ghahramani (Cambridge) noted in his informative history talk, Bayesian neural networks (BNNs) handle both parameter and structure uncertainties eﬀectively, addressing a fundamental limitation of traditional deep learning. Being able to represent uncertainty is particularly useful in applications such as forecasting and decision-making. The disadvantage of BNNs is often their higher computational cost.

Finale Doshi-Velez (Harvard) presented model-based reinforcement learning as a new application for BNNs because they can represent both parameter and environment uncertainties [Depeweg et al., 2016]. The BNNs were trained by minimizing α divergence with α = 0.5, which according to another nice presentation at the workshop by Jose Miguel Hernandez-Lobato (Cambridge) [Hernandez-Lobato et al., 2016], performs better than variational Bayes (VB) and expectation propagation (EP) in regression and classiﬁcation problems. He also pointed out that α-divergence minimization can be implemented more eﬃciently with better convergence guarantees than EP and it allows one to explore the value of α that is best suited for the problem.

The workshop also included two independent papers that arrived at the same idea for back-propagating gradients through stochastic categorical variables in neural networks [Jang et al., 2016, Maddison et al., 2016]. This was previously possible for stochastic continuous variables through the reparameterization trick that decomposes each stochastic variable into a diﬀerentiable deterministic function and random noise. In order to apply the reparameterization trick to categorical latent variables, the authors introduced the Concrete distribution, a,k.a. the Gumbel-Softmax distribution, that is continuous over the simplex but can approximate samples from a categorical distribution. Using this technique, they reported an impressive 2x speedup when training MNIST without compromising performance compared to previous methods.

## Conclusions

The conference reaﬃrmed the current focus on deep learning, which is driving research in a number of related ﬁelds and has created active inter-disciplinary areas such as deep reinforcement learning and Bayesian deep learning. Improvements are also being made to core deep learning architectures, algorithms, and techniques, resulting in incremental advances to the state-of-the-art in a number of applications. At the same time, advances in optimization methods such as those based on variance reduction, are making it possible to apply sophisticated machine learning algorithms to more complex problems and at a larger scale. Progress in these areas is not only making a wider range of applications possible, but it is also generating interest in tackling the grand challenge of Artiﬁcial General Intelligence.