Today we begin our quest for the perfect NVIDIA DGX-1 benchmark and we invite you to join us.
In the next few days we hope to gather comments and suggestions from the Deep Learning community on how to best benchmark NVIDIA’s DGX-1 for training DL architectures. We will soon have this supercomputer at our disposal for testing and would like your opinion on how to best evaluate its performance.
We start by presenting you with the two benchmarks we ran on more commonly found GPUs, here at Addfor. The first benchmark compares the efficiency of various GPUs when training different networks using different frameworks. This is a common approach that allows us to decide on which framework to use for a given GPU and a given architecture. Results from this benchmark can also be used for deciding on which GPU better suits your DL training needs.
The second benchmark, on the other hand, compares the mini-batch efficiency for each GPU when training different architectures. As a rule of thumb, larger minibatch size means more efficient training, however be prepare for a few surprises. Finally, at the end of this post you will find a brief analysis of the results and an overview of the GPUs and networks used during these benchmarks.
Again, we emphasise that our goal is to get your comments and suggestions on how these benchmarks could be tailored to the NVIDIA DGX-1 architecture. So if there are any aspects of the NVIDIA DGX-1 Supercomputer that you would like to measure, please speak up and contact us either by leaving a comment below or by sending us an email.
GPU vs Framework vs Network
We begin our series of benchmarks by evaluating the efficiency of seven different architectures for both inference (forward pass) and training (forward and backward passes) using four different GPUs. Some very good guides for building personal DL computers  and benchmarks comparing different DL frameworks are available online [2,3]; however, we find those to be either a bit outdated in terms of either software, hardware or state of the art architectures or to lack a detailed analysis on the combination of these three factors.
Therefore, in this first benchmark we consider four different GPUs (Tesla K40, Titan-X Maxwell, GTX 1080, and Titan-X Pascal) while training seven different networks (AlexNet, Overfeat, Oxford VGG, GoogLeNet, ResNet-50, ResNet-101 and ResNet-52) using four different DL frameworks (Torch, Caffe, TensorFlow and Neon) all relying on NVIDIA’s cuDNN 5.1 except for Neon. During these experiments we use 64 samples per minibatch and report forward times and forward+backward times averaged over 100 runs. Missing data in each graph indicates that a particular combination of DL framework and GPU resulted in out-of-memory.
Forward Forward + Backward
Titan X – Maxwell
Forward Forward + Backward
Forward Forward + Backward
Titan X – Pascal
Forward Forward + Backward
Minibatch Efficiency for TensorFlow
A useful information when training DL architectures is the number of samples per minibatch that will lead to faster training. In our second benchmark we analyze the training efficiency as a function of the minibatch size. We restrict our analysis to the use of TensorFlow 1.0.0 since it was the framework with the least occurences of out-of-memory. In this experiment we again estimate the average forward-pass time and forward+backward-pass considering 100 runs.
Regarding the first benchmark, we notice that Neon offered almost always the best results for both Titans and the GTX 1080, while being the worst for the K40. This is because Neon is optimized for the Maxwell and Pascal architectures. The Tesla K40, being it a Kepler GPU, lacks such low-level optimizations. Torch consistently gave good results over all architectures , except when used on modern GPUs and deeper models. Again, this is where Neon really shines. Finally, we point out that TensorFlow was the only framework capable of training all networks without incurring into out-of-memory, which then led us to choose it as the framework for our second benchmark.
Regarding our second benchmark, as a rule of thumb larger minibatches result into less processing time per sample and thus less time to train each epoch. However, this is not always true. As we can see from the plots above, the GTX 1080 takes 420.28 ms to perform a forward and backward pass for a 64-sample minibatch when using a VGG network. The same configuration took 899.86 ms training for 128 samples, i.e. almost 60ms more than twice the previous value. Also, we notice that the Tesla K40 has a concave-down curvature for all networks at minibatch size 8 and that the Titan X Pascal shows a concave-up curvature for shallower architectures such as AlexNet and Overfeat using the same batch size. A concave-down curvature indicates that the “efficiency-rate” is decreasing while a concave-up means the opposite. Interestingly enough this particular value of minibatch size is where this effect is more evident. Profiling both GPUs could should give us some answer to why this happens.
These are our observations so far and again, we are very interested in your comments and suggestions on how to modify/improve these benchmarks for the DGX-1. Please leave us a comment below or send us an email with your ideas. We will share all our results in this blog.
Here we briefly decribe each GPU used during these benchmarks along with the architectures and the frameworks versions used.
The K40 has 2880 cuda cores, base clock of 745MHz and 12 GB of GDDR5 RAM achieving 288GB/s of memory bandwidth. This is a server GPU based on the Kepler architecture and having compute capability of 3.5. The K40 is no longer in production, however it is still widely available on many data center and knowing its performance is critical when considering whether or not to buy new hardware.
Titan X Maxwell:
The Titan X is the flagship consumer grade GPU for the Maxwell architecture having compute capability of 5.1. It has 3072 cuda cores, a base clock of 1000MHz, and also 12 GB of GDDR5 capable of transferring 336.5GB/s. Given its hardware specifications and that most DL applications rely just on single-precision floating-point operations, the Titan X Maxwell was considered the best cost-effective alternative to server based GPUs with original price tag of US$1000.00.
The GTX 1080 is currently the top gaming GPU produced by NVIDIA costing less than US$800.00 . It offers 2560 cuda cores, a base clock of 1607MHz and 8GB of GDDR5X which provides 320GB/s of bandwidth. Its modern Pascal architecture gives it a compute capability of 6.1.
Titan X Pascal:
The Titan X Pascal maintained its tradition as being the best consumer grade GPU for DL. It has 3584 cuda cores operating at 1417MHz, and 12GB of GDDR5X which provides 480GB/s of memory bandwidth. It has the same compute capability of the GTX 1080 and it is currently priced at US$1,200.00. Although it is a consumer grade GPU, this card is sold directly by NVIDIA with a current limit of 2 GPUs per customer.
In 2012, Alex Krizhevsky won the ImageNet Large Scale Vision Challenge (ILSVRC)  using a CNN having five convolutional and three fully-connected layers . The network proved the effectiveness of CNNs on classification problems by achieving 15.3% of top-5 classification error, while the second best entry obtained 26.2% and the winner of the previous edition achieved 25% top-5 error. This is considered the first milestone architecture for computer vision using Deep Learning.
In 2013, the Overfeat network improved AlexNet’s architecture by lowering strides in the first layers yielding to 14.2% top-5 classification error . It also showed that training a convolutional network to simultaneously classify, localize and detect objects in images can improve the accuracy in all these tasks.
In a complete study authors in  showed the importance of depth in classification accuracy by training and evaluating CNNs having from 11 to 19 layers. Their work showed that the use of two consecutive convolutional layers with small 3×3 spatial kernels gave better accuracies than using a single 5×5 convolutional layer and associated this gain to the extra use of non-linearity between layers. Moreover, authors empirically verified that a 19-layer CNN produced similar accuracy as a 16-layer one, exposing the difficulty in training deep CNNs with the techniques available so far. Finally, VGG Net further reduced the top-5 classification error in the ILSVRC-2014 Classification task to 7.3%.
In 2014, researches at Google presented the GoogLeNet , a 22-layer convolutional network based on a composed module known as Inception. The name of the block derives from the fact that the module itself can be considered a network since it is concatenation of parallel and serial convolutional layers. The network achieved top-5 classification error of 6.67%.
In 2015, authors in  proposed a new architecture for CNN called Residual Network (ResNet) based on the concept of skip connections. The goal behind a Residual block in ResNet was to learn a difference of representation between consecutive outputs. This approach allowed authors to achieve 3.57% top-5 error on the ImageNet test set using 110-layer models.
–Caffe: commit 746a77e6d55cf16d9b2d4ccd71e49774604e86f6
–Torch7: commit d03a42834bb1b674495b0c42de1716b66cc388f1
–Nervana Neon: 1.8.1
 Imagenet Classification with Deep Convolutional NN
 Going Deeper with Convolutions
 Deep Residual Learning for Image Recognition