There’s a lot of surface area in the typical data science workflow for the purveyors of automation to attack. What moves the needle for the folks at the startup Kaskada is the feature engineering and deployment stage, which it’s seeking to streamline with a new automated feature store.
The typical data science workflow is fraught with inefficiency, according to Kaskada CEO and co-founder Davor Bonaci, who previously was a senior engineer at Google who worked on Apache Beam.
For example, data scientists often will do much of the work in Jupyter, the popular data science notebook, where they will explore the data and identify the key data features that they will use as inputs for their predictive models. They will typically develop these features in Python within the Jupyter environment.
“That’s a really good way to do science,” Bonaci says. “You can visualize things. You can do things with a few lines of code.”
The problem with this approach, however, is that the output of Jupyter notebooks isn’t production-ready code. That leads many organizations to employ teams of engineers whose job is to rewrite the resulting Python into something more scalable and production-ready, such as Scala, that can be deployed within an Apache Spark framework.
This approach is proven and it works, but it’s slower and more expensive than it needs to be, according to Bonaci, who hopes to accelerate the workflow with the software he’s developing at Kaskada, which yesterday announced an $8 million Series A round of funding.
Kaskada accelerates the data science in several ways. First, it provides a studio where data scientists can explore the data and define the data features they plan to use in their production machine learning models. It also creates a feature store that houses the pre-defined features until they’re called into use as feature vectors.
The feature store is a critical component of the Kaskada offering, as it simplifies the roll-out of feature vectors into production machine learning models. Instead of fumbling around with code, Kaskada allows developers to call feature vectors from the feature store via an API.
Lastly, the company automatically compiles the layer of coded needed to instantiate the vectors from the feature store. It does this using Scala , which it automatically deploys in containers that run in Kubernetes cloud environments.
This approach not only reduces the odds of something going wrong with a machine learning deployment, but it makes the process more reliable and repeatable as well.
“Instead of having the output of a data scientist being the notebook, we make the data scientist responsible for populating the feature store,” Bonaci says. “So the output is no longer the notebook. It’s actually the computed value, and we give data scientists the experience to work together, to collaborate to populate the feature store. And once they populate the feature store, data scientists can simply query it without rewriting any pipelines.”
This process allows data scientists to deploy features with the “click of the button,” which cuts weeks off the typical model deployment scenario, Bonaci says.
“The need to rewrite notebooks in Spark has gone away,” he says. “They just come and query a simple API from the feature store to get those values out to drive the model and get to prediction…That’s the unique innovation that we are bringing to market.”
This approach does not come without costs (not to mention the actual money that users must pay Kaskada to use their service). Instead of using familiar tools like Juypter and frameworks like Pandas or scikit-learn, the data scientist, for the most part, must work within the confines of the Kaskada environment. And you’re not going to use the Kaskada environment for arbitrary data science work; Bonaci says the system, which uses Apache Cassandra and Akka under the covers, is geared primarily to building recommendation engines and real-time predictions for websites and mobile apps.
“Technically, we’re kind of a compiler between the studio and the feature store,” he says. “We are compiling code from whatever the data scientist defines [and] automatically generating a real-time distributed system. That’s where the rewriting goes away. We generate automatically a distributed system from what you define in our software.”
But data scientists get other benefits once they select Kaskada. For starters, once the data scientist has used her data science skill to select the features to use in the model, Kaskada will automatically keep the resulting feature vectors (the series of integers that go into the ML inference model) up to date based on incoming data. Hooks to pub-sub systems like Apache Kafka and AWS Kinesis keep machine learning models fresh with the latest streaming data.
The typical Kaskada customer will have hundreds or thousands of features for each user or business object that it wants to create predictions or recommendations for, Bonaci says. Visualize “a matrix with as many rows as you have users, and it’s computed in real-time based on the stream coming in,” he says.
Another benefit is the reduced need for data scientists to be experts in deploying distributed systems. Because Kaskada handles the packaging, deployment, and management of the production features into the cloud ML environment (or possibly on-prem environments, if the customer is large enough, Bonaci says), that’s one less hard-to-find machine learning engineer that the company needs to hire.
Kaskada is ideal for organizations that just want to let their data scientists be data scientists and not engineers, Bonaci says. “It’s is fully managed service where data scientists come and apply their domain experience, and the things just work,” he says.
Once they define the features in Kaskada, it makes the resulting model development much easier, he says. It’s all about letting data scientists focus on what they’re best at, which is pushing the art of machine learning, Bonaci says.
“We’ve raised the level of abstraction and enabled these expert data scientists to be able to do it themselves, without having to depend on data engineers to rewrite things,” he says. “The data engineer will still have a role, but the data scientist [no longer] has to wait for the data engineer for something to happen to see the result in production. That is kind of the conceptual, next-generation system that we are trying to bring to market.”
Kaskada is based in Seattle, Washington, and its service is currently still in beta. For more info, see www.kaskada.com.
An Open Source Alternative to AWS SageMaker
It’s Time for MLOps Standards, Cloudera Says
Machine Learning Hits a Scaling Bump
Credit: Google News