Cloudera has been through an eventful, some might say turbulent year, to say the least. While its been awhile since Cloudera branded itself as Hadoop, its fortunes have nonetheless been tied to the open source platform first codified by Doug Cutting and Mike Cafarella well over a decade ago. In our recent post about Strata, we referred to midlife crisis, and that’s a metaphor that could also apply to Cloudera, the prime sponsor of the event. In a recent post in Medium, Cloudera chief product officer Arun Murthy best summed the whole 2019 saga in one elegant title, Hadoop is Dead. Long Live Hadoop. According to Murthy, this post evidently struck a chord, getting roughly 10x the number of reads of most Cloudera blogs.
Last week, Big on Data bro Andrew Brust provided an exhaustive account of Cloudera’s new generational product change, Cloudera Data Platform (CDP). To its credit, Cloudera didn’t simply do a cut and paste job with the new offering, which was the long-awaited converged platform coming out of its merger with Hortonworks. It was a complete rethinking from the ground up, starting with refactoring of storage and compute that severed the ironclad connection between Cloudera’s platform and HDFS.
In an era where the cloud in Cloudera is finally emerging as the future path for Hadoop, the company made the shrewd choice to go all in on cloud-native architecture. On the storage side, it makes S3-style cloud object storage a first-class citizen on par with HDFS, and on the compute side, it clears the path for Kubernetes to supplant YARN for commandeering resources in the cloud.
What does that mean? Beyond elasticity, it means on the fly deployment. If you look at Cloudera’s first attempt at a cloud offering, Altus, it was based on deployment via virtual machines (VMs), a process that typically took about 8 minutes to spin up clusters. With Docker and Kubernetes on CDP, that goes down to 30 seconds. ‘Nuff said?
In his piece, Andrew also recounted the trajectory of the company this year. After a disastrous Q1, and with MapR on the ropes, the conventional wisdom was that Hadoop was dead, and with it, Cloudera and MapR becoming roadkill. Enter Carl Icahn.
Now let’s put this saga in perspective. First off, at Strata we saw Ted Dunning, who still lists himself as MapR’s chief technology officer, where he assured us that the product engineering team made the move to HPE and is still largely intact. MapR’s flavor of Hadoop may not be quite so dead.
Back to our originally scheduled program, Cloudera was going through the familiar story of a company on the cusp of platform change (as Andrew termed it, the Osborne effect); of course, customers are going to hold back until they know what’s coming. As noted, Cloudera didn’t adequately warn Wall Street. The good news is that it had a better than expected Q2, which was enough to put Icahn’s forces on standstill, for now.
On the product side, not only did Cloudera re-architect the heck out of the combined CDH and HDP assets, it finally tamed the zoo animals. For instance, Cloudera’s Shared Data Experience (SDX), which was vaporware when it was first introduced 18 months ago, now is real. And more importantly, it is more than the sum of its zoo animals: it is a coherent offering that, under the hood, incorporates policy management functionality from Apache Ranger; metadata tagging from Apache Atlas; and single sign-on capability from Apache Knox. This is a singular packaging with a singular install; you will not see separate Ranger, Atlas, or Knox modules under the hood. The constituent parts may be open source, but the integration and packaging is Cloudera’s unique (and proprietary) IP.
It’s a fact that in the cloud, Cloudera starts as the challenger to cloud provider Hadoop offerings including AWS EMR; Azure HDInsight (although rooted in the Hortonworks, it’s very much a Microsoft product now); and Google Cloud Dataproc. They all offer most of the open source components that CDP does. But, aside from perimeter security and identity and access management, they lack the more granular data-specific governance, access control, and tracking/auditing capabilities of SDX. By the way, the same is true for point services like Databricks or any of the machine learning or AutoML services that are offered in the cloud; there’s no real governance of them aside from what the cloud provider offers.
We don’t expect this situation to last for long; for instance, AWS’s Glue ETL offering could form the basis for an expanded data governance capability by leveraging its metadata. We expect that Azure and GCP won’t lag far behind either. But for now, aside from third party data governance offerings that attack pieces of the problem, Cloudera is the only heterogenous data platform that has this capability all to itself.
But that’s not all. Because SDX is tied to other open source projects that are used with Hadoop, Cloudera could conceivably package this separately and have something to sell to EMR, HDInsight, or Cloud Dataproc customers that might otherwise be beyond its reach. Couple that with Cloudera’s positioning as being cloud-agnostic, and we believe that SDX is the crown jewel of the Cloudera Data Platform.
So, the good news on the technology front is that Cloudera is on the right track. The job is not done, but it’s finally dealt with the zoo animal distractions. The conventional wisdom is that Cloudera’s challenge now is to execute. The conventional wisdom is true; if you have a product, you need to effectively connect with customers and sell it. Focusing on the Global 2000, Cloudera already has a rich base to focus on the expand part of its land and expand strategy. The installed base includes nearly a thousand customers with engagements exceeding six figures and there remains plenty of headroom for growing its footprint with existing customers. Cloudera is planning to get the field sales and support engineering force up to speed on the new platform within the next quarter or two.
But the question is, what is Cloudera selling? So far, they have gotten to the point of rationalizing the platform and doing some simplification. But by its nature, the Cloudera Data platform is all about heterogeneity: a heterogeneous mix of workloads against a heterogeneous mix of storage, compute, data, and data types. That’s not only the toughest nut to crack, but also the toughest to define. And by the way, Cloudera is not the only one tackling heterogeneity, as we’re seeing many of the household names in data warehousing pitch visions that use familiar SQL as the starting point. They’ve got a strong value proposition given the large SQL skills pool out there.
Although Cloudera will be offering fit-for-purpose packaging of its platform for data warehousing, data engineering, and machine learning, it still has its work cut out making the business case of why you need a Swiss army knife platform for analyzing large hordes of data.
That’s where the elevator pitch to audiences outside of Cloudera’s traditional constituencies of CIOs and architects becomes essential. A key pillar of this, of course, is multi cloud and hybrid cloud – but that is a pitch for every incumbent who’s not AWS, Azure, or GCP. Cloudera needs to define a message that goes beyond the “we conquered the complexity of Hadoop” to the stories that can be told because the platform is multi-cloud, governed, and multi-workload. For now, Cloudera has the message that will address the folks who its sales teams typically call on. But ultimately, CIOs and architects are not interested in technically perfect solutions, but in solutions that address the business needs of the lines of business that are, directly or indirectly, funding them.
Cloudera has, and still is doing the homework for putting its next-generation product together. Yes, now that it has the new product, Cloudera needs to execute. But beyond that, Cloudera needs a more cogent higher-level message that tells the stories of the types of business problems that its platform is best-positioned to solve.