So you’re planning to launch an AI project or startup, or maybe adding an AI-based team to an existing organization. Better late than never! Now, if you want to run machine learning, deep learning, computer vision or other AI-driven research project you can’t just buy any off-the-rack computer from an office superstore; you need hardware that can handle your workload. This leaves you with an important decision: build, buy, or rent.
In this context, “renting” would generally refer to using cloud compute resources, which tend to be more expensive in the long run, but may be a good choice in some cases (great for startups or when you’re not planning on scaling in a big way). This article, however, is concerned with balancing hardware and computational requirements and is based on the assumption that you will be spec’ing custom AI hardware or building an AI computer yourself.
It may not be immediately obvious at first, but the best AI hardware will depend on the type of operations you plan on running which, in turn, depends on the size and type of dataset you will be primarily working with. It’s easy to over-engineer your system and end up paying for more capability in one area than you really need. With a little diligent consideration at the outset though, you can avoid unnecessary costs and find a solution that’s optimized for both your needs and budget.
We’ll consider the main points of determining specifications for a deep learning system, including CPU for general compute, GPU (and GPU compute) for those neural network primitives, and system memory for handling large datasets.
To make things more concrete, we’ll compare two hypothetical case studies with different approaches to similar domains and ultimately very similar prime directives. These will be based on two different approaches to realizing a startup concerned with developing new drugs for solving neglected diseases.
Our first hypothetical startup wants to find new therapeutics by analyzing vast datasets of biomedical images from a complementary high-throughput screening program, while the second is more interested in a computational chemistry and virtual screening strategy.
Both startups want to make the world a better place by treating pathologies that have been ignored or have stymied traditional drug development, but plan to go about it in very different ways. Their respective strategies will lead them to weigh the importance of different components differently, and we’ll focus on those in this article.
When building AI tools from absolute scratch, it’s also important to choose an efficient power supply unit (gold or platinum) that meets the combined thermal design power (TDP) of all components plus a ~25% margin, an SSD for storage (plus a large HD if desired), and a motherboard and case that fits everything.
Go easy on the RGB LEDs…
Overview of two different strategies for bringing deep learning to bear for novel drug development.
There will be similarities between each approach.
For example, they may want to install systems at different scales for different aspects of the research and development process: individuals and teams may want access to powerful workstations for experimentation, while training production models on big datasets might be relegated to a dedicated server or on-site cluster.
In some cases inference needs may even be met on cloud resources, once models are finalized and inference workloads are fully predictable (to avoid surprise charges).
What CPUs are good to use in AI?
It may surprise some readers that this article starts out by considering CPUs first. Isn’t deep learning all about the GPU? After all, the “ImageNet Moment” of deep learning was based on the demonstration by Alex Krizhevsky et al. that training a convolutional neural network like AlexNet could be much more efficient on GPUs than plain old CPUs.
But a good CPU can make a big difference, especially in training situations that benefit from good multi-threading, like running multiple physics simulations for reinforcement learning with PyBullet, or parallelizing molecular docking simulations with open source tools like Smina.
Some molecular docking simulation programs do have GPU support (we’re looking at you, Autodock-GPU), but it’s much more common that they take advantage of CPU multi-threading.
The speed and efficiency of docking simulations depends in part on what’s running under the hood. Many of these programs use some sort of evolutionary algorithm, which tend to be embarrassingly parallel (i.e. they have very few serial bottlenecks that limit the utility of running multiple operations in parallel).
Autodock-GPU, for example, uses a Lamarckian genetic algorithm (pdf), while Smina uses the BFGS algorithm, and can readily parallelize optimization when exhaustiveness is set sufficiently high. Perhaps surprisingly, Smina’s multithreading can actually be faster than GPU acceleration with Autodock-GPU in some cases.
AMD Is the Leader for AI Computer CPUs
When considering the best option for a high-performance CPU a clear winner has emerged in the past few years.
Although Intel has dominated the bulk of the CPU market for years, the latest generation of AMD chips and their innovative chiplet architecture now offers substantially more bang for your buck. AMD has more cache memory, more cores, and plenty of PCIe 4.0 lanes in their range of consumer and high-end desktop offerings.
It can be hard to find reasonable CPUs from Intel for comparison with the higher end Threadripper CPUs from AMD (think the 64-core 3990X and 3995WX Threadripper Pro) at similar price and performance point. Take this comparison from 2020 of a 32-core 3990X Threadripper against dual Xeon Platinum 8280 processors as an example. The former is an impressive processor and sells for around $4,000, but remains a reasonable component in an individual workstation, where the dual Xeons represented around $20,000 USD of silicon at the time, and were designed to be used in HPC clusters and data centers.
Of course, if your intention is to build a HPC cluster for training to production, a more sensible comparison would be the AMD EPYC line of CPUs, which still favors AMD (although Intel claims superiority of their upcoming Ice Lake chips in a very specific benchmark). The balance may change with new offerings from Intel as they try to catch up, but for now AMD will continue to dominate high-end desktop performance (and performant value) in the immediate future.
1. Why Corporate AI projects fail?
2. How AI Will Power the Next Wave of Healthcare Innovation?
3. Machine Learning by Using Regression Model
4. Top Data Science Platforms in 2021 Other than Kaggle
For some image/video heavy computational loads, however, CPU performance becomes less important. In those domains performance is dominated by state-of-the-art GPUs, and in fact it’s one of the most common and visible application areas of deep learning and AI.
There are several worthwhile recipes in blog write-ups for personal deep learning machines that skimp decidedly on the CPU end of things, and maintain a very budget-friendly bill of materials as a result. However, be careful when sorting CPUs by cost, as this component remains important for shuttling data around even when the majority of computation is conducted on the GPU.
Ensuring adequate PCIe lanes for your GPUs (and future upgrades) is important, but in many cases 4 PCIe lanes per GPU is workable on desktops/workstations and 8 to 16 lanes are plenty.
So how does CPU choice shake out for our hypothetical startups?
Our virtual screening/computational chemistry drug development group will want to weight CPU performance heavily when choosing components, while our image-based high-throughput screening project may be able to choose budget friendly options in order to free up more funds for high-end GPUs.
For the first case, workstations with AMD Threadripper CPUs with 24 to 64 cores will enable rapid prototyping by individual engineers. For training production models, a shared server or cluster built around AMD EPYC 7002 or 7003 processors will expedite training for CPU-heavy problems.
The high-throughput biomedical image analysis startup can spend decidedly less on CPUs and still get good performance. CPU considerations for GPU-intensive deep learning applications include ensuring 4 cores and 8 to 16 PCIe lanes per GPU, although PCIe lanes are not so important for systems with 4 GPUs or less.
In the previous section, we’ve discussed the importance of a powerful CPU for deep learning projects that depend heavily on simulation, significant amounts of pre-processing, or other general computing needs that do not yet have optimized support for accelerators like GPUs.
These areas include, for example, reinforcement learning projects that often use physics simulators or significant non-neural computation for tasks, like molecular docking in our hypothetical computational chemistry drug development startup.
Sometimes, engineering time can be used to fill these needs instead. If you can develop your general computing code so that it can run on the GPU, you can reap additional efficiency and speed without buying the top of the line, latest generation CPUs.
A simple example is the speedup that can be achieved by re-writing cellular automata algorithms to run on the GPU. This can easily achieve more than 1000x speedup over a naive serial implementation, just by re-writing the algorithm in PyTorch to take advantage of GPU support (with vectorization and GPU acceleration the speedup can be as much as ~70,000x speedup).
Modern PyTorch and Tensorflow 2.x are actually highly flexible libraries that can quite readily be adapted for generalized differentiable programming and computational physics, so a little extra development time can put the emphasis back on the GPU for many bespoke needs. Careful management of parallelization and bottlenecks during training can enable impressive training capabilities for challenging combinations of reinforcement learning simulator roll-outs and large model updates.
Using this approach, Uber AI labs (before disbanding in 2020) demonstrated the efficacy of clever engineering to reduce the budget for training agents to play Atari games with genetic algorithms from a 720-CPU cluster to a single 48-core desktop, and they did so by efficiently managing the loads on GPU and CPU so that both components were utilized concurrently.
What GPU is best to use for AI?
Even though simulation-based deep learning projects may call for considering processor performance more strongly, GPU(s) remain the flagship component on most deep learning systems. We’ve come a long way since 2012 when AlexNet showed us the way by training on dual GTX 580 GPUs sporting 3GB of memory a piece.
As deep learning became a modern market for GPUs, manufacturers, especially NVIDIA, have invested substantial development in catering to AI engineers and researchers with features specifically designed to improve efficiency, speed, and scale for large neural network models. AMD has been expending modest efforts to make their GPUs more viable for deep learning, and the latest PyTorch 1.8 does support AMD’s ROCm instructions, but with NVIDIA’s community support and head start in tensor cores, they are still the GPU manufacturer of choice for deep learning and can be expected to remain so for the next few years.
Modern NVIDIA Ampere GPUs offer tensor cores and reduced-precision computation. The addition of tensor cores since the NVIDIA Volta architecture in particular make matrix multiplies so much more efficient that moving data around becomes the main bottleneck in deep learning. You’ll find tensor cores in the RTX 20xx and 30xx line of cards, that is, if you can physically find the GPU you want in stock, as they’ve been in relatively short supply recently due to a combination of high demand and pandemic related semiconductor shortages.
RTX cards also have allowed for reduced precision training with 16-bit floating point numbers instead of 32-bit, effectively doubling the size of the models (in terms of number of parameters) that can be trained.
GPU memory is another factor to consider, and your choice will depend on the type of work being done on a given machine.
For individual engineers or small teams tinkering with interesting architectures, 8 to 10 gigabytes is probably enough.
Training production transformer models (and it seems like there’s a transformer for everything these days) on a local server or cluster, however, warrants choosing a card with more memory.
For personal use or training, even smaller GPUs (~4 to 6GB) may do. In our opinion the best resource for in-depth analysis of GPUs for deep learning is Tim Dettmers, who maintains and regularly updates a blog post detailing new architectures and performance-per-cost of the most viable GPUs.
Coming back to our hypothetical case studies, the high-throughput biomedical imaging and deep learning startup will be interested in emphasizing the GPU specifications in their deep learning systems. They might want to equip personal workstations with one or two RTX 2080 or 3080 cards with about 10GB of memory, while a deep learning server or cluster might be built around A100 GPUs (although the cost-effectiveness is questionable) or similar.
The drug discovery startup depending more on computational chemistry workflows can probably get by with lower-end GPUs, freeing up budget for premium AMD CPUs to manage physics simulations and molecular docking programs.
How much RAM should be used in an AI computer? Compared to CPUs and GPUs, RAM is likely to account for a far smaller proportion of your overall budget for system. Having plenty of RAM can significantly improve your experience, and prototyping on a machine with enough RAM can have a significant impact on conserving your most valuable resource of all: developer time.
Being able to load large datasets into memory without worrying about running out obviates the need for programming clever workarounds. It can also free up brain cycles for concentrating on the more interesting (and fun) problems that you’re actually interested in solving. That should be the focus rather than implementing memory workarounds like chunking. This is even more important for projects requiring significant tinkering with data pre-processing.
A good rule of thumb is to buy at least as much RAM as the GPU memory in a system, then buy a little more (25% to 50% or so) for quality-of-life.
That’s solid advice for image-processing workflows with big GPUs, but for workflows that might weight the GPUs as slightly less important (such as our hypothetical computational chemistry/virtual screening startup), you may opt for buying twice as much RAM as GPU memory in a system (or just buy enough RAM for the datasets you’ll be prototyping with).
However, if a project is not dependent on large datasets (i.e. training is mostly done in simulations rather than on datasets) you don’t need as much memory. That may apply to projects such as our theoretical virtual screening/docking startup or other reinforcement learning type projects.
In any case, it’s always a good idea to leave a few memory slots open for future upgrades in case typical training runs change, and RAM is unlikely to be cost-prohibitive relative to the other components of an AI/deep learning system.
One final note is to not be swayed by sky-high RAM clock rate specs. According to deep learning hardware guru Tim Dettmers, faster RAM speeds only enable performance gains of a few percent, and even modestly faster RAM clock speeds (e.g. 3200 MHz versus 3000MHz) will cost somewhere between 10% and 33% more.
So it’s important to buy more RAM instead of faster RAM.
We’ve outlined the major components of deep learning systems from a couple perspectives of what you might want to do with them. For projects that involve mostly “neural compute” (i.e. matrix multiplication), the choice of GPUs is a main concern and buying expensive CPUs are probably not worth it. These include NLP and image heavy workflows involving big convolutional neural networks and transformer models.
For a project that requires significant data pre-processing, physics simulations, and other general computational flexibility, it does make sense to invest in highly capable AMD CPUs such as those in the Threadripper or EPYC line. These types of projects can include reinforcement learning type projects that rely on physics simulators.
However, even projects with significant computational needs that don’t have GPU support may benefit from some development to take advantage of GPU compute, and clever management of the interplay between neural network passes on the GPU and managing RL environments on the CPU can yield substantial improvements in a training pipeline, leading to better iteration and prototyping throughput.