The ECHELON Advantage
From the very inception of ECHELON, SRT’s
goal was to create the fastest simulator in the
world, using a holistic code base that is
executed in its entirety on NVIDIA GPUs.
The use of GPUs for numerical simulation continues to gain momentum both in academic research, the national labs, and industry. In our own domain, reservoir simulation, one of our competitors announced a hybrid (CPU/GPU) version of their product last year and rumors about similar efforts in the development of other commercial and in-house reservoir simulators abound. We at SRT welcome these developments as an endorsement of our strategic vision for high-performance technical computing and the foundational development work that we began a decade ago. The exceptional performance results published by SRT and its customers have demonstrated the capabilities that GPUs offer to an initially skeptical industry and have served as a wake-up call to investigate alternative computational hardware strategies.
While we recognize the achievements and ongoing development efforts elsewhere, we think that it is appropriate and fair to highlight some important differences between our approach and the more recent entrants into the world of GPU computing. The principal difference is simply stated: From the very inception of ECHELON, SRT’s goal was to create the fastest simulator in the world, using a holistic code base that is executed in its entirety on NVIDIA GPUs. Our numerical algorithms employ the overt and nuanced advantages of the hardware in an assiduous effort to both maximize performance and minimize memory footprint. Our passion for speed led us to avoid the use of off-the-shelf solvers, high-level frameworks, abstractions, and automatic compilers, which are preferred for quick results. Instead, we have taken a ground-up approach, meticulously evaluating and innovating at each step, choosing the most numerically optimized algorithms, crafting every computational kernel, and requiring it all to yield results which match the standard industry simulators. By implementing directly in NVIDIA’s ‘down-to-the-metal’ CUDA language, we exercise complete control over the hardware’s performance. We did all this from a fresh slate unencumbered by a legacy code optimized for multi-core CPU.
There is a significant difference between
setting course to create the fastest simulator in
the world on GPU as we did at SRT and
grafting in GPU code as an afterthought to an
already mature and well-optimized CPU code.
Hybrid solutions, such as those pursued by some of our competitors and which begin from a mature CPU code base, necessarily split execution between the CPU and GPU as numerically intense code sections are offloaded to GPU. In our opinion, such strategies suffer from 3 important deficiencies. First, data transfer between CPU and GPU creates bottlenecks and adds unnecessary runtime e.g. where Jacobian construction is executed on CPU but the solver is resident on GPU. Second, Amdahl’s law fundamentally limits overall application performance due to unaccelerated code remaining for execution on the CPU. In reservoir simulation, the first target for acceleration is usually the linear solver and this will typically yield perhaps a 2x in performance. To make a significant difference, over 90% of the computational burden of the simulation must be transferred to GPU and in a typical black-oil run of ECHELON, this requires porting the top 50 computational kernels. An additional 100 or so kernels beyond that are needed to cover 99% of runtime. Finally, hybrid approaches do not benefit from the reduced hardware footprint of pure GPU solutions since the full CPU hardware machinery must be retained for the efficient execution of those parts of the code remaining on CPU.
Porting a mature, optimized legacy CPU code to the GPU, while at the same time maintaining and/or even expanding the existing code base, is an extremely resource intensive undertaking with many non-trivial challenges. First, users will expect to obtain the same result whether using the CPU or hybrid version of the code. This will be difficult to achieve especially with complex well and field controls that trigger discrete events (e.g. wells closing or opening) that are sensitive to small differences in numerical precision. Second, hybrid codes impose a requirement to maintain two branches, one for CPU and one for GPU. While certainly not impossible it makes for more challenging code maintenance with bug fixes and feature additions needed for both branches. Finally, and perhaps most damaging, is the strong incentive to port mature CPU algorithms to GPU “as is”, without considering the vastly different hardware eccentricities and forgoing the opportunity for optimization. One is compelled in this direction primarily to produce quick results but also to retain the same algorithmic approach on both platforms and therefore minimize the differences between CPU and GPU.
There is a significant difference between setting course to create the fastest simulator in the world on GPU as we did at SRT and grafting in GPU code as an afterthought to an already mature and well-optimized CPU code. We believe that difference will become evident in overall application performance, scaling and memory footprint in the coming years.