Going cloud for your storage needs comes with some baggage. On the one hand, it’s cheap, elastic, and convenient – it just works. On the other hand, it’s messy, especially if you are used to working with data management systems like databases and data warehouses.
Unlike those systems, cloud storage was not designed with things such as transactional support or metadata in mind. If you work with data at scale, these are pretty important features. This is why Databricks introduced Delta Lake to add those features on top of cloud storage back in 2017.
Earlier in 2019, Delta Lake was open sourced. Today, the Linux Foundation, the non-profit organisation enabling mass innovation through open source, announced that it will host Delta Lake. ZDNet discussed with Matei Zaharia, Databricks Chief Technologist and co-founder, about this development and how it fits in the overall data and analytics landscape.
Delta Lake keeping in touch with the open source community, getting adoption
Zaharia and Databricks CEO and co-founder Ali Ghodsi are the original creators of the open source Apache Spark project, the unified analytics engine that has become a defacto standard for large-scale data processing.
Early on in Databricks’ course, the decision was made to focus on offering a data management platform based on a hardened version of Apache Spark, with additional proprietary elements, as a managed cloud-based service. At the same time, Databricks remains the driving force behind the evolution of Spark.
This open core strategy is typical for many companies that act both as the stewards of open source projects, and as commercial for-profit entities. It’s a way to balance the benefits of open source, with the need to be commercially sustainable. It can, however, lead to unintended side-effects.
Competition from cloud vendors has forced some companies offering open source products to react. What they did was to change the licenses of their open source components to prohibit cloud vendors from taking their open source core and offering it as a service themselves. This, in turn, has caused controversy in the open source community, and beyond.
Databricks is aware of this, and decided to take a different approach for Delta Lake. As Zaharia explained, they want Delta Lake to be as widely adopted as possible, which is why they open sourced it. At the same time, they want it to take a life of its own, regardless of Databricks, which is why they are handing it over to the Linux Foundation.
Databricks wants to send a clear message to the community, said Zaharia, which is why they chose the Linux Foundation, an umbrella foundation for open source projects, as the steward for Delta Lake. Although it’s been only 6 months since Delta Lake was open sourced, data shared by Databricks suggest strong uptake.
Since its launch in October 2017, Delta Lake has been adopted by over 4,000 organisations and processes over two exabytes of data each month. Adopters include the likes of Alibaba, Booz Allen Hamilton, Intel, and Starburst. Coupled with an open governance model that encourages participation and technical contribution, this may mean that Delta Lake does indeed become a standard for storing big data.
Beyond Delta Lake: replicating a strategy based on open source
Delta Lake aims at nothing short of unifying cloud storage and data warehouses, and this theme is reflected across the board for Databricks. Let’s take the recent announcement of partnering with Tableau, for example. As Zaharia explained, in fact the partnership is not new. What is new is the Databricks connector for Tableau, which is faster and easier to use than the generic Spark connector previously available.
For Databricks, having strong business intelligence and visualization partners like Tableau makes sense. It enables it to go the last mile to the business users, which is something Zaharia said they did not have in mind in the beginning of their journey. This is an interesting point, as it sheds light on the focus on Delta Lake.
Zaharia said their original aim was on serving data scientists and their workloads, with a focus on machine learning. But as they were met with increasing demand for “vanilla” data access, the kind of workload typically served by data warehouses, unifying those became a priority. This is how Delta Lake came to be.
For a vendor like Tableau, on the other hand, being able to access and integrate data that lives in cloud storage vastly expands the reach of its users. It’s a win for everyone. So even though Tableau may have a more direct way to access data in the Databricks platform via the partnership, the idea is that any tool should be able to access data on any cloud via Delta Lake.
This sounds great, but it’s a bit more complicated than it sounds. Delta Lake may enable access on the data layer, but what happens on the metadata and management layer? Having access to vast amounts of data without proper data governance in place would be like navigating uncharted territory: good luck finding your way, and keeping track of where you’ve been.
In a way, enabling access via Delta Lake could mean becoming the victim of your own success. This has not happened to Databricks, Zaharia said, due to its approach of letting users integrate their data management solution of choice. He did concur, however, that entering the turf of data warehouses comes with the responsibility of appropriate data management.
This is something which Databricks itself may decide to tackle in the future, in a similar way they decided to tackle cloud storage with Delta Lake. It’s not like the Databricks team is not keeping busy though. Another open source project started in Databricks, MLFlow, is taking steps to address management of machine learning models, by introducing a model registry. Zaharia hinted we may see MLFlow taking the Delta Lake path soon, too.
MLFlow was also the core on which AutoML features for Databricks were built. This goes to show the philosophy on which Databricks operates: tackle problems faced by customers, open source solutions to foster adoption and innovation, integrate and build proprietary extensions, offer as a service.
Judging from the numbers Zaharia shared, this seems to be working well for Databricks: three-fold increase in customer base, over $100 million in revenue, and an event which has grown in attendance from 1500 to 2100 people in one year. While Apache Spark is not the only data platform around, Databricks’ strategy seems to be making the difference.: