What is a Feature Store?
Machine learning is such a new field that a mature industry-wide standard practice of operations has not yet emerged, like there has been in software development for the past 20 or more years. An ML practitioner who transfers from one company to another would find very big differences in the way each organization brings AI projects to production–if they do at all.
The feature store is an element of data infrastructure that has emerged in the ML community over the past year as a centerpiece of ML pipelines. Adopting a feature store can be a force multiplier for companies trying to transform with data science.
The feature store is not about storing features. A feature store is much more than simply a repository for features, it’s a system that runs scalable, high-performance data pipelines to transform raw data into features. With this system, ML teams can define features once, and deploy to production without rewriting.
And yes, a feature store also:
- Catalogs and stores features for everyone on the team to discover and share, reducing duplicative work.
- Serves the same features for both training and inference, saving time and keeping features accurate
- Analyzes and monitors features for drift.
- Maintains a register of features with all their metadata and statistics, so that the whole team can work from a single source of truth.
- Manages data for security and compliance purposes.
What are Features?
A feature is an input variable to a machine learning model. In other words, it’s a piece of data that will be consumed by a machine learning model. There are two types of ML features: online and offline.
Offline features are static features that don’t change often. This can be data like user language, location, or education level. These features are processed in batch. Typically, offline features are calculated via frameworks such as Spark, or by simply running SQL queries on a database and then using a batch inference process.
Online features—also called real-time features—are dynamic and require a processing engine to calculate, sometimes in real time. Number of ad impressions is a good example of a feature that changes very rapidly and would need to be calculated in real time. Online features often need to be served in ultra-low latency as well. For this reason, these calculations are much more challenging and require both speedy computation as well as fast data access. Data is stored in memory or in a very fast key-value database. The process itself can be performed on various services in the cloud or on a dedicated MLOps platform.
Why You Might Need a Feature Store
The data scientist’s strength is addressing business problems by understanding data and creating complex algorithms. They are not data engineers and they don’t need to be. In a typical workflow, data scientists search for and create features as part of their job, and the features they create are usually for training models in a strictly development environment. Thus, once the model is ready to be deployed in production, data engineers must take over and rewrite the feature to make it production-ready. This is a part of the MLOps process (machine learning operationalization). This siloed process creates longer development cycles and introduces the risk of training-serving skew that could cause a less accurate model in production as a result of those code changes.
Real-time pipelines also require an extremely fast event processing mechanism while running complex algorithms to calculate features in real time. For many use cases in industries like Finance or AdTech, the application requires a response time in the range of milliseconds.
Meeting that requirement demands a suitable data architecture and the right set of tools to support real-time event processing with low-latency response times. ML teams cannot use the same tools for real-time processing as they do for training (e.g. Spark).
The key benefit of the feature store architecture is a very robust and fast data transformation service to power machine learning workloads, to address the challenges presented by data management and especially real-time data. A feature store solves the complex problem of real-time feature engineering, and maintains one logic for generating features for both training and serving. This way, ML teams can build it once and then use it for both offline training and online serving, ensuring that the features are being calculated in the same way for both layers, which is especially critical in low latency real time use cases.
Integrated or Stand-alone?
The feature store market is very active, with many new entrants over the past year and undoubtedly more to come. One of the most important characteristics of a feature store is that it is seamlessly integrated with other components in the ML workflow. Using an integrated feature store will make life simpler for everyone on the ML team, with monitoring, pipeline automation, and multiple deployment options already available, without the need for lots of glue logic and maintenance.