Apache Spark is a hugely popular execution framework for running data engineering and machine learning workloads. It powers the Databricks platform and is available in both on-premises and cloud-based Hadoop services, like Azure HDInsight, Amazon EMR and Google Cloud Dataproc. It can run on Mesos clusters too.
But what of you just want to run your Spark workloads on a Kubernetres (k8s) cluster sans Mesos, and without the Hadoop YARN strings attached? While Spark first added Kubernetes-specific features in its 2.3 release, and improved them in 2.4, getting Spark to run natively on k8s, in a fully integrated fashion, can still be a challenge.
Today, Google, which created Kubernetes in the first place, is announcing the beta release of the Kubernetes Operator for Apache Spark — “Spark Operator” for short. Spark Operator allows Spark to run natively on k8s clusters and thus allows Spark applications — be they purposed for analytics, data engineering or machine learning — to deploy to these clusters as they would any Spark instance.
According to Google, Spark Operator is a Kubernetes custom controller that uses custom resources for declarative specification of Spark applications; it also supports automatic restart and cron-based, scheduled applications. Further, developers, data engineers and data scientists can create declarative specifications that describe their Spark applications and use native Kubernetes tooling (e.g. kubectl) to manage their applications.
Get yours today
Spark Operator is available on the Google Cloud Platform (GCP) Marketplace for Kubernetes, in the form of Google Click to Deploy containers, for easy deployment to Google Kubernetes Engine (GKE). But Spark Operator is an open source project and can be deployed to any Kubernetes environment, and the project’s GitHub site provides Helm chart-based command line installation instructions.
It will be interesting to see if the likes of Amazon and Microsoft will endorse and offer simple deployment of the Spark Operator for their own Kubernetes services (Elastic Container Service/EKS and Azure Kubernetes Service/AKS, respectively). Doing so would be a great service to their customers who do not wish to have the overhead of an EMR, HDInsight or Databricks workspace and cluster.
Et tu, Hadoop?
Since many non-Databricks Spark clusters have in fact run on Hadoop, the release of Spark Operator begs the question of whether Hadoop’s influence is waning. But the Hadoop team isn’t resting either. For example, the Open Hybrid Architecture Initiative is focused on the containerization of Hadoop. Furthermore, Hadoop 3.2 was released just last week and, among other features, includes native support for Tensorflow, new connectivity to Azure Data Lake Storage Gen2 and enhanced connectivity to Amazon S3 storage.
As usual, such robust competition benefits customers.