MLflow, the open source machine learning operations (MLOps) platform created by Databricks, is becoming a Linux Foundation project. It’s also getting some new features. The move was announced by Matei Zaharia, co-founder of Databricks, and creator of both MLflow and Apache Spark, at the company’s Spark + AI Summit virtual event today.
In a pre-briefing with ZDNet earlier in the week, Zaharia provided an update on MLflow’s momentum, details on the new features and reasoning for moving management of the open source project from Databricks to the Linux Foundation.
Momentum-wise, Zaharia said MLflow has been experiencing a 4x year-over-year growth rate. On the Databricks platform alone (including both the Amazon Web Services and Microsoft Azure offerings of the service), Zaharia said the more than 1M experiment runs are run on MLflow, and more than 100,000 ML models are added to its model registry, *each week*. Some customers are, apparently, creating tens of thousands of models, which helps drive that immense volume.
With that kind of growth, Zaharia explained that it’s important for customers to see the project managed by a vendor-neutral organization. This protects customer investments in MLflow and eliminates any unease that the project might be dependent on Databricks’ corporate direction. I still find it odd that Spark itself — on which the Databricks platform is based — is an Apache Software Foundation project, whereas associated projects Delta Lake and now MLflow sit under the Linux Foundation. Zaharia explained that the two foundations operate in a similar enough way that it shouldn’t impact users and pointed out that Kubernetes and the Cloud Native Computing Foundation are also under the Linux Foundation umbrella, creating useful synergies for the Databricks-launched projects that have moved there.
With all this momentum for MLflow, the project has added new application programming interfaces (APIs), to allow it to integrate with CI/CD (continuous integration, continuous deployment) frameworks like Jenkins and GitLab.
MLflow now has automatic logging and versioning, too. For logging, details like model metrics and parameters can be recorded automatically and, for experiments run on the Databricks platform, cluster and notebook details are logged as well. Model versioning is enabled by the use of Delta Lake — which innately provides for versioning and efficient storage of just what’s changed between versions — as the storage medium.
Both of these new MLflow features acknowledge that machine learning model development is a type of software development, where such logging and CI/CD practices are standard operating procedure. They also correspond well with features announced last month by Cloudera for its own MLOps platform, which acknowledge the same reality.
Also read: Cloudera’s MLOps platform brings governance and management to data science pipelines
Speaking of Delta Lake, Databricks announced just yesterday a new proprietary layer on top of it, called Delta Engine. This product is essentially a rewrite of the Spark SQL/DataFrame engine, implemented in a combination of C++ and assembly language, to provide much improved performance over the Java-based implementation in Apache Spark.
Delta Engine also makes use of so-called SIMD (single instruction, multiple data) CPU instructions, to process multiple data items at once, rather than one at a time. This technique, also known as vector processing or vectorization, is a standard feature in most data warehouse platforms, as is columnar storage, which Delta Lake utilizes. Another tenet of data warehousing is massively parallel processing, or MPP. And while Delta Engine doesn’t implement MPP per se, both Spark SQL and Delta Engine do utilize parallelization across the nodes in the Spark cluster for some of the same effect.
Lake and warehouse converge
Together, these features support the “data lakehouse” concept — Databricks’ name for its fusion of data lake and data warehouse architectures and features into a single platform. Along with the acquisition of query/dashboard platform purveyor Redash, covered yesterday by my ZDNet Big on Data colleague George Anadiotis, this positions Databricks to provide a platform with big data/data lake, data warehouse and data visualization capabilities, along with the data engineering and machine learning capabilities that Databricks and Apache Spark have been known for.
Also read: Data Lakehouse, meet fast queries and visualization: Databricks unveils Delta Engine, acquires Redash