Charged up on Volta
Over a period of four years and four different hardware generations, ECHELON has quadrupled its speed. Moore's law performance scaling on NVIDIA hardware continues with Volta.
It was only one year ago last November that I published SRT’s first results running ECHELON on the NVIDIA P100. I presented performance numbers for three different reservoir models running on three different generations of NVIDIA GPUs: K40, TitanX and P100. The numbers showed that performance scaled beautifully with the memory bandwidth of the hardware. For example, the P100 boasts 2.5x the memory bandwidth of the K40 and on average ECHELON ran the three models 2.54x faster on P100 than on the K40.
V100 benchmarking results
The newest NVIDIA GPU, the V100, was released a few months ago and we were able to get access to it through NVIDIA’s internal cluster. We used the same three benchmark models to collect the V100 data: Model A (4.4M cells), Model B (9M cells) and Model C (2M cells). Table 1 lists all the chips used along with their corresponding technical specifications. The K40 and Titan X numbers were collected using internal SRT servers, standard Intel Xeon based nodes with the GPU boards installed via PCIe. The P100 results were collected using a cluster of IBM OpenPower nodes each with two Power8 processors (10 cores each) and 4 NVIDIA Tesla P100 boards and the V100s were accessed at NVIDIA. By the numbers, V100 offers 25% more bandwidth and nearly 50% more single and double precision FLOPS than its predecessor, the P100.
Table 1 NVIDIA hardware used in this benchmarking study. Comparing memory bandwidth and FLOPS
Figure 1 below shows the results of running each of the three models on all four hardware platforms. The performance results are normalized to the K40 to make it easier to compare between models, so K40 performance is 1.0 by definition. On average ECHELON ran 3.9x faster on V100 than on the K40.
Figure 2 plots ECHELON’s speed vs memory bandwidth with speed again normalized to the K40. The data points for V100 are actually above the ideal linear scaling line. While the V100 has 25% more memory bandwidth than the P100, its performance factor over P100 ranges from 47% to 60% more on the three models. We attribute the super-linear scaling to efficiency improvements that NVIDIA has introduced in the Volta memory controller. Most of ECHELON is memory bound as Figures 2 clearly shows. One exception to this is the compositional flash calculation which is bound by floating point performance. Figure 3 shows the relative performance of compositional flash on four generations of GPUs and the Haswell CPU vs hardware FLOP capacity. Haswell performance is normalized to 1.0. As with the memory bound results, performance follows the linear trend line very closely. Over a period of four years and four different hardware generations, ECHELON has quadrupled its speed. Moore's law performance scaling on NVIDIA hardware continues with Volta.
Figure 2 ECHELON speed vs hardware memory bandwidth for three different models. K40 performance is normalized to 1.0x.
Figure 3 ECHELON’s flash calculation speed vs hardware FLOPS for four different GPUs.