Nvidia Introduces cuDNN, a CUDA-based library for Deep Neural Networks

Nvidia earlier this month released cuDNN, a set of optimized low-level primitives to boost the processing speed of deep neural networks (DNN) on CUDA compatible GPUs. The company intends to help developers harness the power of graphics processing units for deep learning applications.

The library is described as "very straightforward to use" by their author but is expected to be more frequently used indirectly via higher-level toolkits such as Caffe or Torch with minimal programming effort. Nvidia's official benchmarks show a 10x speed-up when using Caffe with cuDNN.

Deep Learning is one of the hottest topics in artificial intelligence (AI) at the moment. It designates a class of machine learning algorithms that learn abstract concepts such as "human face" or "cat" in a more autonomous, generic and accurate way that previous techniques, by combining two neurophysiological concepts: multi-layer artificial neural network and hierarchical representation of the perceptual world.

Inspired by 1962 work of neurophysiologists and Nobel laureates Hubel and Wiesel on the cat's visual cortex, and proposed as an artificial system in 1980 by Kunihiko Fukushima in his Neocognitron architecture, deep learning systems are now widely used by leading applied-research government agencies and commercial companies such as Google, Facebook, IBM, Microsoft, Baidu and NEC for handwriting recognition, speech recognition, face detection, video surveillance, etc.

The term "deep" used in deep learning and deep neural network (DNN) is, however, more recent; as explained Yann LeCun, leading expert in deep learning and director of the AI Research Lab at Facebook, during an interview for KDnuggets.

Deep learning has come to designate any learning method that can train the system with more than 2 or 3 non-linear hidden layers. Around 2003, Geoff Hinton, Yoshua Bengio and myself initiated a kind of "conspiracy" to revive the interest of the machine learning community in the problem of learning representations (as opposed to just learning simple classifiers). It took until 2006-2007 to get some traction, primarily through new results on unsupervised training (or unsupervised pre-training, followed by supervised fine-tuning), with work by Geoff Hinton, Yoshua Bengio, Andrew Ng and myself. But much of the recent practical applications of deep learning use purely supervised learning based on back-propagation, altogether not very different from the neural nets of the late '80s and early '90s.

One of the most popular and successful deep learning architecture use a particular form of DNN called convolutional networks or ConvNets. They are particularly powerful to build internal representations of the world that are robust to irrelevant variations (Eg. background noise for speech recognition, illumination changes for image recognition) while preserving essential information to describe the perceptual object (Eg. spoken word, human face). Because of their remarkable performances, a relative simplicity and a wide variety of applications, they are also a perfect candidate for hardware implementation.

It is in this context that Nvidia introduced cuDNN, raising mixed reaction from the community.

Yann welcomed the news with a very enthusiastic post on his personal Facebook page.

An increasingly large number of Nvidia GPU cycles are being spent training and running convolutional nets. It is an awesome move on Nvidia's part to be offering direct support for convolutional nets. Some of us have been talking with Nvidia's management and engineering teams about ConvNet and such for some time, and I'm really delighted to see their work come to fruition. [...] Kudos to the CUDAs and happy convolving!

However, Yangqing Jia, Caffe's creator and research scientist at Google appears more reserved in a Google Plus post.

As someone involved in the discussion along the Caffe path, I should say that cuDNN is probably overstating the speedup on the GPU side a little bit... They used the standard Caffe CPU code, which used multi-core blas with either MKL or OpenBLAS. It seems that multi-core GEMM in these packages do not fully use the CPU power. [...] These benchmarks are some times too tricky to make right... I think using GPUs is definitely faster and more cost-effective, but CPU is not that bad.

On a more negative side, Eugenio Culurciello, founder and lead of TeraDeep claimed in another post that cuDNN does not offer the best performances.

Cool, but note that FFT based solution in Theano are now the best in performance for large networks, also cuda-convnet2 is better than cuDNN.

Jonathan Tompson, Ph.D. candidate at NYU, also appears skeptical about Nvidia's benchmarks on torch7's forum.

When training my network big ConvNet, where convolution of very large images takes up the vast majority of fprop and brprop time [...] So the new cudnn module came up quite a bit slower for me (I'm using a K40).

InfoQ Software Architects' Newsletter

Follow us on

Rate this Article

This content is in the Infrastructure topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter