Credit: Google News
If you want to see what the future of iron to support machine learning looks like, then perhaps the best place to look at what the hyperscalers and cloud builders who account for the vast majority of processing and applications in this field are deploying. Or, more precisely, look at the iron that their ODM partners are trying to peddle to other companies that is inspired by what the hyperscalers and cloud builders are using.
Inspur, one of the upstart makers of infrastructure that is located in China but which is expanding outwards to North America and Europe, is a good case in point.
The company has very good insight into what the Big Four in China – Alibaba, Baidu, Tencent, and either China Mobile or JD.com, depending on how you want to rank numbers four and five – are doing with their vast infrastructure, and it dominates some of these accounts. As we reported back in October 2018, when Inspur was making a push into Open Compute, Inspur has about half of the plain vanilla server shipments and about 80 percent of the GPU accelerated machine learning shipments to the hyperscalers and cloud builders in China. Inspur also works with Microsoft, one of the Big Four in the United States, on its current generation “Project Olympus” servers, the designs of which have been open sourced through the Open Compute Project championed by Facebook alongside some other hyperscale iron that was inspired by Inspur’s manufacturing deals with Alibaba and Tencent. The company was on track at the time to have $6 billion in server bookings is 2018, which is not too shabby in this cut-throat server market.
It takes a stack to do this, and a lot of the innovation in the machine learning space has been done by the hyperscalers and cloud builders that need to monetize all that data about us, and the academics they collaborate with or directly support with research, and the traditional and upstart IT vendors.
“The hyperscalers are definitely leveraging this technology as their driving force for business growth, and they are leveraging it really well,” Dolly Wu, vice president and general manager of the Datacenter/Cloud Division at Inspur, specifically for the markets in the United States and Canada, tells The Next Platform. “Companies like Microsoft, Google, Alibaba, Tencent – all the hyperscalers – are growing very, very fast. And part of that growth is attributed, we believe, to AI. If HPC is basically making machines to do tasks very quickly, what AI is doing is injecting intelligence into machines to make it speed up even more in an effort to help people deal with business issues.”
This, in a nutshell, is why various AI techniques, be they based on machine learning or other statistical techniques, will eventually be embedded in all sorts of enterprise applications.
Based on what Inspur is selling into the hyperscaler and cloud builder accounts, which is a representative portion of that sector of the IT market, machine learning training is dominated by GPU accelerated servers, but with machine learning inference – making use of the models that are trained and reduced – is still largely done on CPUs, but some companies are starting to use a mix of CPUs and GPUs or CPUs and FPGAs. Wu notes that Intel is seeking to protect its sales of CPUs for inference workloads by adding the Vector Neural Network Instructions (VNNI) – also known as Deep Leaning Boost or DLBoost – to the future “Cascade Lake” Xeons SPs that are expected to be launched formally in a couple of months. There are custom ASICs that are trying to break into the inference space, too, not limited to Google’s Tensor Processing Unit (TPU) chips but exemplified by them.
“CPU is still the bulk of the inference market, and Intel claims that 80 percent of all the inference still is being done on their standard CPUs,” says Wu. “FPGAs are being adopted and widely used across the hyperscale landscape, but they haven’t really filtered down to the Tier 2 cloud service providers or more average enterprises yet.”
Alibaba, which like Amazon is an online retailer as well as a public cloud provider, gives some insight into the challenges that the hyperscalers face and how they are using machine learning to boost sales through massive amounts of automation.
Alibaba has hundreds of millions of customers and millions of merchants across its various sites – Taobao, Alibaba.com, and Tmall are its main ones – and the former AliPay payment services (analogous to PayPal and now called Ant Financial) is also a behemoth and probably the most valuable payment platform with a valuation of around $150 billion, rivalling the $350 billion market capitalization of Alibaba Group itself. Alibaba is a huge business, and one where automation is an absolute necessity. Which we all know is the mother of invention. And that business has generated a huge volume of customer information, which we all know is gold waiting to be mined by machine learning.
The biggest shopping day in the world belongs to Alibaba, and it is called Single’s Day, which occurs on November 11 each year. This holiday started before the e-commerce boom back in 1993 at Nanjing University in China as a way to celebrate being single, but Alibaba has transformed that into the biggest shopping day in the world – bigger than Black Friday, Cyber Monday, and Amazon Prime Day put together. Take a look at the gross merchandise value (GMV) of Single’s Day at Alibaba since it really took off in 2013:
The sales figures for Single’s Day in 2018 were widely reported, with $30.8 billion in sales (as loosely expressed by GMV, not actual Alibaba revenue) across a stunning 60,000 brands under the Alibaba umbrella. Alibaba hit $1 billion in sales in 85 seconds as Single’s Day 2018 got started, and had booked $10 billion in under an hour. That $30.8 billion in total sales for Single’s Day 2018 was up 27 percent compared to 2017, which is a lot. But here is the neat bit. Three months before Single’s Day, Alibaba ordered 50,000 servers from Inspur, which supplies the bulk of its infrastructure (but not all of it), to cover the incremental increase in sales that Alibaba anticipated in 2018 and to drive new kinds of applications that would make it even possible for Alibaba to book that much business in a 24 hour period in the first place.
For this past year’s Single’s Day, the recommendation engines for the Alibaba sites were driven by machine learning inference, and even more specifically according to Wu, Alibaba has chosen FPGAs from Xilinx to drive the inference and GPUs from Nvidia to drive the training. (Inspur has its own F10A FPGA card and the TF2 inference engine programmed for the FPGAs, which runs TensorFlow inference models that use a mix of 8-bit and 4-bit integer as well as 32-bit floating point math.)
Machine learning is also being used to program robots and drones, which are used to automatically shift inventory in Alibaba’s warehouses (the products kinds, not the data kind), and there are chatbots that are used to do 95 percent of customer service calls because handling that many calls with people doesn’t scale well. These chatbots are capable of interpreting human emotions to a certain degree, and know to shift a customer to a real human being in the event that the customer is very upset. Which helps calm them down. Finally, machine learning was used to drive the facial recognition software and payment system that Alibaba created for over 100,000 affiliated retail outlets, which allowed customers to walk around and automatically be charged for the stuff they take out. (Amazon Go stores are promising the same kind of functionality.)
All of this machine learning takes iron.
At the moment, the FPGA cards are just put into the infrastructure servers where the applications run to handle the inference, and specifically, these are racks of servers created by the “Project Scorpio” open source hardware effort from Alibaba, Baidu, and Tencent – and backed by Inspur – that is the analog to Intel’s Rack Scale Architecture and to a certain extent the Open Compute standard and Open Rack designs. The interesting bit is that Inspur makes its own cards based on Xilinx FPGAs rather than buy them from Xilinx, and there is a good chance that someday it will sell cards of its own based on Intel Altera FPGAs, too. At the volumes that Inspur ships, it gets to be the ODM, not Intel or Xilinx.
While there are many startups that are creating custom chips to do inference, at least among the hyperscalers – excepting Google, of course – Wu does not think there is much of a chance for these to take off. “I don’t see custom ASICs flying,” says Wu. “It’s too expensive and it’s too purpose built.”
For machine learning training, Alibaba is using the AGX-4 system, which is a 2U, two-socket Intel Xeon SP server with four Nvidia V100 GPU accelerators (which through PCI-Express switches and GPU sidecars, called the GX4, can be expanded to up to 16 GPUs in a system). These machines cost in the neighborhood of $80,000 fully loaded with CPUs, memory, flash, and eight GPUs. According to Wu, next year Alibaba is already planning to adopt its forthcoming AGX-5 system, which is a derivative of the Nvidia HGX-2 system that links together 16 GPUs in a single memory space using its NVSwitch interconnect.
And here’s the thing: Inspur will sell the machines for a lot less than what Nvidia was charging for similar DGX-1 and DGX-2 machine –although Wu cannot be specific with such comparisons because of the confidential nature of the hyperscaler and cloud builder customers it has. Our guess, based on the limited pricing information we have seen in the past out of Inspur, is somewhere between 40 percent and 50 percent off the list price of DGX-1 and DGX-2 gear.
Further down the road, what Alibaba wants to have – and what Inspur is working towards building now – is composable infrastructure, which will allow for the pooling of FPGA and GPU accelerators that are distinct from the CPU compute, which will allow the accelerators to be dynamically configured to the CPUs to accommodate different, and changing, workloads.
The idea here is to only buy the CPUs, GPUs, FPGAs, and flash that you need and to drive up the utilization for these different components across the network rather than try to do it by having different physical – and static – configurations of servers tailored at their building to specific workloads. The average datacenter utilization, says Wu, is still only around 12 percent (that’s measuring just CPU utilization), with Google leading the pack at around 30 percent. But with composable infrastructure, the utilization of these components could all be pushed up to something between 60 percent and 90 percent.
At least that is the dream.
Credit: Google News