In this special guest feature, Bill Wagner from Bright Computing writes that the convergence of HPC & AI presents new challenges for containers, job scheduling, and system management.
The convergence of HPC & A.I. is a hot topic right now, and for good reasons. In practice, there is a symbiotic relationship between (for example) HPC simulations and machine learning. HPC simulations generate tons of data, and as luck would have it, machine learning requires tons of data to train models. If we turn the value proposition around, the quality and effectiveness of HPC simulations can be improved if we use machine learning to identify the parameters in simulation data that have a significant effect on the outcome and focus subsequent iterations of simulation on those parameters.
Beyond the work itself, HPC and A.I. share another characteristic … the need for high-performance infrastructure to run these workloads―expensive infrastructure … lot’s of computing power and storage, accelerators, high-speed interconnects, etc. And although A.I. is only getting started relative to the maturity of its HPC cousin, the common infrastructure need between the two, lead many organizations to see the convergence of HPC and A.I. at an infrastructure level as being inevitable. But here’s the rub … traditional HPC applications run under the jurisdiction of an HPC workload manager like Slurm or PBS Pro, whereas machine learning applications are primarily run in containers under the jurisdiction of a container orchestration system, such as Kubernetes.
So if you want to run (for example) HPC apps managed by Slurm and machine learning containers managed by Kubernetes on the same Linux cluster, you need to hardwire a subset of your cluster nodes to Slurm for the HPC workload and hardwire a subset of cluster nodes to Kubernetes for the machine learning containers. That sounds more like coexistence than convergence. What if there are times when you temporarily need to use the majority of your cluster (or the whole thing) to run an important HPC simulation, or, use the whole cluster to train an algorithm?
At Bright, we see the convergence of HPC and A.I. as an opportunity to exploit a great auto-scaling feature we’ve had in our product for years that our developers refer to as “cm-scale.” Bright Cluster Manager’s auto-scaling feature acts as a sort of uber-scheduler that has the ability to look at the resource demand across workload engines (e.g., Slurm and Kubernetes) and auto-scale the resources assigned to each by repurposing cluster nodes on the fly to serve either Slurm (HPC) workloads or Kubernetes (machine learning) workloads as demand and policy dictate. You can also use this feature to trigger the provisioning of resources from a public cloud for additional capacity when needed, and then automatically terminate those resources when demand drops below a certain level. Powerful stuff.
We refer to this notion of having a single Linux cluster capable of sharing its resources dynamically across different types of applications and workloads as a “dynamic data center,” and we believe it represents the future-state infrastructure that most organizations aspire to in an increasingly compute/data-intensive world.
Bill Wagner is CEO of Bright Computing.
Sign up for our insideHPC Newsletter
Credit: Google News