Credit: Google News
History and economics – as if you could separate the two – are burgeoning with examples of products being developed for one task and then being used, perhaps after some tweaking, for an entirely new and usually unexpected task. History is also full of stories of technologies aimed squarely at a task that, for one reason or another, miss the mark even if it looks like they were right on target.
Product substitution as a means of lowering costs and thereby making a technology more prevalent is one of the primary reasons that economies exist. Some people make money in the transformation, and others lose out, but the overall economy improves from the efficiency engendered in that change. So it is a net good, and if done right, there is some money left over to invest in something else entirely.
Every once in a while, you get a product substitution working from two different angles, and you can get a whole bunch of different things converging on a technology. There is no question that there is now a harmonic convergence between the worlds of HPC and AI, which have been drifting closer and closer to each other in the past decade. Each field was pursuing its own needs, adapting the GPU for very different kinds of massively parallel processing, and now they have arrived at a happy place where what is good for HPC simulation and modeling is also good for machine learning training.
The adoption of Nvidia GPU accelerators for HPC simulation and modeling workloads – performed by academics and further tweaked by major HPC centers and their vendor partners – got the ball rolling back in 2006. The initial success in speeding up parallel applications by offloading some routines from CPUs (with fast threads, low memory bandwidth, and relatively low concurrency) to GPUs (with slow threads, high memory bandwidth and relatively high concurrency) compelled Nvidia to create the CUDA programming environment, which has become very rich over the years and which has made it relatively easy for programmers to parallelize and offload their code.
At this point, if organizations are going to accelerate HPC simulation or modeling workloads, then they are going to use a GPU, and generally at this point it means a Tesla accelerator from Nvidia. Other possible accelerators – all of them funky kinds of vector engines – remain niche products. But IBM’s Cell certainly demonstrated the viability of the concept with the “Roadrunner” petaflops-busting supercomputer at Los Alamos National Laboratory, which was announced in 2006 at about the same time that researchers at Stanford University and GPU chip maker Nvidia were first porting and offloading the parallel portions of C++ and Fortran code to the shader units on the GPUs, which could do single precision floating point math.
This fortunate innovation was driven by the need to pack more flops in fewer watts, and to get more performance per dollar, but the drive to make better GPUs was really about playing high resolution games and doing HPC was really a secondary consideration far out into the future. HPC was doing a product substitution for massive and expensive CPUs.
Once good GPUs with enterprise-class error scrubbing on their frame buffer memories were available, everybody started to look around to see if these GPU motors might be good for something else. This is where machine learning, a branch of artificial intelligence with great but unfulfilled promise, picked up the GPU ball with the CUDA logo on it and ran. The neural networks behind machine learning did not really work all that well until the confluence of enough threads and enough memory bandwidth, plus a decent programming environment, came together. The hyperscalers could ride on the HPC coattails and then create something different once they fully understood the correlation between big data and machine learning.
As Christopher Nguyen, a former Googler, pointed out to us four years ago, big data is precisely as much data as it takes for machine learning training to work, and the GPU is, at least thus far, the engine of choice for creating the neural networks because it has the right mix of threads and bandwidth – metrics that keep going up and up with each Moore’s Law jump – to allow the GPUs to handle more data and ever deeper neural networks that perform the machine learning training. This is why GPU-accelerated systems are, with a few exceptions, the default platform on which machine learning training runs today. If some other device comes along that can do it better, you can bet that the hyperscalers will port their machine learning frameworks to it in a heartbeat, and they have the technical chops to do it fast.
The GPU does not have the same hegemony with HPC applications as it enjoys in machine learning training. Not every HPC application has been accelerated by GPUs, but every year more of them are, so the use of GPU compute for HPC is on the rise for capability-class systems where scaling a handful of applications across hundreds of teraflops to tens of petaflops and now up to exaflops of compute is paramount. There are still plenty of users that have applications that only scale across dozens or hundreds of cores and they are not as time sensitive to answers, and for them, a CPU-based segment of a capacity-class supercomputer, which might be juggling the jobs of hundreds or thousands of users at the same time, is more than enough iron. We would argue that if all HPC workloads were accelerated by GPUs, then you could push more variables through the simulations and do more simulations per hour, week, day, month, and year to create better ensembles that as a group have a chance to create better overall simulations – meaning ones that better reflect reality, which is the whole point, after all.
This harmonic convergence between HPC and AI has been a very good thing for these two aspects of the upper echelon of computing that are driving architectures today. It absolutely was not planned, and it was serendipitous as well as lucky that the hardware converged in such a way that both workloads can be supported and run on similar hardware. It has literally been fortunate for Nvidia, which had aspirations for the use of GPU compute in the HPC arena and then saw a much larger opportunity in AI explode when machine learning training actually started to yield models that could infer better than humans for the first time, starting in 2012 with images and moving into video, text, audio, and other kinds of data formats since then.
But our question – and for right now, it is just a question, which we will be working on in the coming months to answer – is: Can this harmonic convergence of hardware for HPC and AI last?
There is an old adage in journalism that for any headline that asks a question, the answer is always, “No.” We don’t hold to this philosophy, of course, and hence we sometimes – very rarely, really – ask questions in our headlines. What can be honestly said about questioning headlines is that the answer depends on when you ask the question in the timeline in any phenomenon – ask too early and you can’t really know and ask too late and the answer is definitive. We think if you ask the question at the right time, it is at an inflection point where phenomenon could go either way and you are basically calling the coin toss, or maybe even bumping the arm of the ref a little.
Now is the time to ask this question. So we did. And frankly, we think it could go either way.
To one school of thought, HPC centers will be embedding machine learning techniques in their applications, perhaps using simulated data as well as real data (where possible) to feed neural networks that can then make better predictions. Google has proven with its research and its business that more data trumps a better algorithm, something a hyperscaler with the data from billions of people can test. We would amend this even further, given the advances in AI and the changes coming to HPC applications, that more data at a lower resolution can trump less data at a higher resolution to get a better result. Machine learning training uses a mix of floating point and integer data types at ever-decreasing resolutions to try to get better answers, and there are many who are thinking the unthinkable that HPC applications can back off from the double precision or single precision formats they have processed in. It remains to be seen. But if this happens, then just perhaps a common hardware base for HPC and AI can be maintained.
This, we think, would be a good thing for the computer industry in general and for HPC and AI in particular. The common hardware base – compute, memory, networking, and storage – that can be deployed for modern HPC and AI systems means that the research and development costs can be spread over one product rather than two, and the manufacturing costs of devices can be lowered because of the higher volumes, which in turn should mean significantly lower prices for units of compute, storage, and networking in these converged HPC/AI systems. We said lower costs, not inexpensive systems. At an estimated $205 million for the 200 petaflops “Summit” CPU-GPU hybrid at Oak Ridge National Laboratory, and at an expected cost of around $600 million for its “Frontier” successor in the 2021 timeframe (that’s five times the performance at three times the cost, roughly speaking), no one would call such systems inexpensive. But the beautiful thing is this: Anyone who wants to do HPC and AI, either separately or together, can do so on the same “Newell” Power AC922 system from IBM that Oak Ridge chose. IBM, its partners Nvidia and Mellanox Technologies, and the US Department of Energy labs did the development work, and now the world can benefit.
This has not always been the case with HPC iron in the past, which has often been exotic in terms of architecture and purpose-built for very narrowly defined workloads. The pity has been that the same machines that ran HPC workloads in days gone by did not become the platforms that financial services used to run risk analysis or do trading, although for a while, there certainly was some commonality between high end NUMA machines for running database applications and clusters of the same iron for running parallel HPC workloads. But generic X86 clusters running the Message Passing Interface (MPI) protocol on a Linux operating system quickly became the norm, with differentiation largely in how loosely or tightly the compute elements were coupled together. (This is a very broad generalization.)
But today, the field is wide open. There are growing options for compute, networking, and storage, and HPC workloads could be run on any number of different hybrid architectures that mix CPUs and various kinds of accelerators, including but not limited to GPUs and FPGAs as well as specialized vector units either embedded in the CPUs or riding on the PCI-Express or other peripheral buses. As for machine learning, there are efforts underway to create specialized ASICs that can do both training and inference, and the potential for these chips is to unite the two workloads of modern AI. But in doing so, they may split machine learning training from HPC simulation and modeling.
This may not be a good thing for the industry, which has benefitted from having a single substrate evolve in the past decade. The Intel Xeon processor running Linux made it easy, and certainly less costly than the RISC/Unix and vector alternatives that preceded it at the heart of HPC. The addition of GPU accelerators and the CUDA environment changed things, but this is the new normal where cramming the most flops in the least space and scaling it out is the most important thing.
The future depends in large measure on the economics of delivering alternative chips and how they might be integrated into a hybrid compute complex. If CPUs of many kinds compete for the dollars, that’s good, and it is even better still if they are all given plenty of bandwidth and have open protocols, like CCIX, OpenCAPI, and Gen-Z, to link to accelerators and various kinds of memory class storage (or storage class memory. In that future world, what we think of as a server will be busted into pieces and system architects will be able to mix and match capabilities and capacities on nearly a whim, across nodes, across racks, and across the datacenter. Dynamic composability is therefore a key. All of these compute components might be more expensive at relatively lower volumes than might be the case if there was more homogeneity – as has been the case with the Xeon processor in the past decade. But then again, we can see the effect of having most of our compute in one Intel basket; Intel has been gradually increasing the cost of compute in recent years. It might be less risky to create highly tuned clusters for specific workloads using the best parts for portions of the workflow than try to do everything on one device from one vendor; the resulting price/performance could be better, too, across this hodgepodge.
The point is, the economics will determine the future as much as the feeds and speeds of any particular device, and organizations will be counting risk as one of the costs they need to calculate. They cannot afford to build systems that cannot do HPC and AI workloads. If bandwidth comes faster and cheaper, then having two distinct systems linked to share data becomes a possibility and composability becomes a relative snap; if bandwidth can’t keep up with compute or memory, then a tightly coupled hybrid system that scales using MPI will probably be the way people will have to go.
All we know for sure is that no one can afford to get this wrong. Not at these prices.
Credit: Google News