My takeaways from NIPS 2015

2015.12.12

My takeaways from NIPS 2015

I’ve just wrapped up my trip to NIPS 2015 in Montreal and thought I’d jot down a few things that struck me this year:

Saddle Points vs Local Minima

I heard this point repeated in a talk almost every day. In low-dimensional spaces (i.e. the ones we can visualize) local minima are the major impediment to optimizers reaching the global minimum. But this doesn’t generalize. In high-dimensional spaces, local minima are almost non-existent. Instead, there are saddle points: points which are a minimum in some directions but a maximum in others. Intuitively, this makes sense: in N dimensions, the odds of the curvatures all going the same way at a point is (1/2)^N. As Yoshua Bengio said, “it’s hard to build an n-dimensional wall.” This gives an intuition for why procedures like gradient descent are effective at optimizing the thousands of weights in a neural net: they won’t get stuck in a local optimum. And it gives an intuition for why momentum is helpful: it helps gradient descent escape from saddle points.
Model Compression

The tutorial on Hardware for Deep Learning was less about new hardware and more about how to make your software get the most out of existing hardware. Due to the high cost of uncached, off-chip memory reads, reducing the memory footprint of your models can be a huge performance win. Bill Dally presented a result on model pruning that I found interesting: by iteratively removing small weights from a model and retraining, they were able to remove 90+% of the weights with zero loss of precision. This parallels an observation from transfer learning, that small networks are most effectively trained using the output of larger networks. It would be nice if we could train these smaller networks directly. See the Deep Compression paper.
The importance of canonical data sets / problems

Over and over, talks and posters referenced the same canonical data sets: the MNIST set of handwritten digits, the CIFAR and ImageNet images, the TIMIT speech corpus and the Atari/Arcade Learning Environment (ALE). These have given researchers in their fields a shared problem on which to experiment, compete, collaborate and measure their progress. If you want to push a field forward, built a good challenge problem.
One-shot Learning

There was much high-level talk about how the human brain is very good at learning to perform new tasks quickly. Contrast this with neural nets, which require thousands or millions of training examples to reach human performance. This comparison is somewhat unfair because adult humans have years of experience interacting with the real world from which to draw on. There seems to be a great deal of interest in getting machines to do a better job of transferring general knowledge to specific tasks.
This conference has gotten huge!

I’ve read that registrations have gone up significantly over the last few years. This was palpable at the conference. Many of the workshop rooms were packed to the gills and I watched most of the larger talks in overflow rooms. This is probably good thing for the ML community. I’m not sure if it’s a good thing for NIPS.

A few smaller bits that struck me:

Highway Networks are cool. They let you train very deep networks, where the depth has to be learned. They found that depth > 20 was not helpful for MNIST, but was for CIFAR.
AlexNet seems to have become a canonical neural net for experimentation. I saw it referenced repeatedly, e.g. on the Deep Compression poster.
As an example of the above, I really enjoyed the Pixels to Voxels talk. Pulpit Agarwal & co showed that the activations of individual layers of AlexNet correlate to activations in regions of the brain. They were able to use this correspondence to learn what some of the mid-level regions of the visual cortex are doing.
We heard several speakers say that “backprop is not biologically plausible.” I assume this is because we don’t consume nearly enough labels for it to be practical at the scale of a human brain?
I asked someone at the NVIDIA booth whether the ML industry is large enough to drive GPU design & sales (as opposed to the game industry). It is. The GPUs designed for ML tend to be more robust than those used in games. They have error correcting codes built-in. In a game, if an arithmetic unit makes a mistake, it’ll be fixed on the next frame. When you’re training a neural net, that mistake can propagate.

I do relatively little machine learning in my day-to-day. NIPS is always a bit over my head, but it’s a good way to rekindle my interest in the field.