ROI Calculator Explained
“Few months (~3) payback costs wrt AWS for these kinds of configs”, Andrej Karpathy (Director of AI, Tesla)
The DeepLearning11 (DL11) server has 10 Nvidia 1080Ti GPUs, typically 256GB of RAM, and 2 Intel XEON CPUs, along with both a 10GbE Ethernet and an Infiniband network adapter (56-112 Gb/s). The closest GPU server provided by AWS to DeepLearning11 is the p2.8xlarge server with 8 Nvidia P100 GPUs. The P100 is somewhat faster (782 GB/s vs 484 GB/s memory bandwidth), but DL11 has 10 GPUs compared with 8 GPUs in the p2.8xlarge server. The p2.8xlarge server also has NVLink (80 GB/s) vs 16 GB/s in the DL11 server.
There are very few comparisons of 1080Ti and P100/V100 GPUs for training neural networks. The best we could find was a comparison of Resnet-101 and Resnet-50/152. The Resnet-101 benchmark is by us, Logical Clocks AB, published in an O’Reilly Blog. The Resnet-50/152 benchmarks are by Google. From the results below, we can see that the Resnet-101 convolutional neural network on DL11 can process roughly 1000 images/second. The GPU server with 8 P100 GPUs has results for Resnet-50 (1734 images/sec) and Resnet-152 (716 images/sec). Resnet-101 should lie somewhere between both benchmarks, making our estimate of its throughput about 15-30% higher than the DL11. This is roughly what we would expect when memory bandwidth is the bottleneck (5474 GB/s vs 4840 GB/s aggregate bandwidths).
[Image from https://www.tensorflow.org/performance/benchmarks, Retrieved March 10, 2018]
Note that these comparisons assume single-precision training. Mixed precision training is still very problematic.
Primary Use Case for DL11
If you have a base GPU load, then running that on your own hardware makes a lot of economic sense. Variable GPU loads can cost effectively be offloaded to the cloud, as shown below: