2. Feature Store Architecture
Minh Pham / April 04, 2024
4 min read
Feature Store
Feature Store Architecture, dive deep into advanced concepts and best practices for building a feature store
1. Data Infrastructure Layer#
This is backbone of FS, including: data ingestion, processing, and storage. Key components include Batch and Stream Processing Engines, as well as Offline and Online Stores.
1.1. Batch Processing engine#
Serves as the computational hub where raw data is transformed into feature. Handling large datasets that don't required real-time processing and load the data to offline feature store
- Considerations:
- Data Consistency: Ensure that features generated are consistent across different runs
- Versioning: Keep track of different versions of features (If a feature is updated, it should be captured)
- Concurrency: Ensure multiple batch jobs running simultaneously without confliction
- Option:
- Apache Spark
1.2. Stream Processing engine#
It is designed to handle real-time data processing needs. It processes data as it arrives, making it ideal for applications that require real-time analytics and monitoring.
- Considerations:
- Latency: Is a critical factor - The system should be capable of processing data with minimal dela
- Scalability: The system should be able to scale up or down quickly
- Data Integrity: When incorrect data gets streamed, the system can correct these errors either in real-time or through subsequent batch recalculations
- Option:
- Apache Spark Structured Streaming
- Apache Flink
1.3. Offline Store#
It acts as a “Data warehouse” store feature data after being processed. Designed to handle large volumes of data and is optimized for batch analytics.
- Considerations:
- Data Retention: how long the data should be stored (considered by cost and data utility)
- Accessibility: Ensure data security and accessibility for batch analytics
- Data Schema: Maintain a consistent schema for data usable
- Option:
- S3 with Delta or Iceberg files - These file formats offer ACID transactions, scalable metadata handling, and unify streaming and batch data processing
1.4. Online Store#
Online store is designed for low-latency access to feature data. It'’'s optimized for quick reads.
- Considerations:
- Latency: data should be retrievable in milliseconds
- High Availability: The store should be highly available to meet the demands of real-time applications
- Scalability - As the number of features or the request rate grows, the system should scale up too
- Option:
- Redis
- Cassandra
2. Serving Layer#
This is the interface where external applications and services request and receive feature data. It's optimized for high availability and low latency, ensuring that features can be served quickly and reliably.
- Considerations:
- API Design: The APIs should be designed for ease of use (with document and versioning)
- Load Balancing: Managing requests from multiple servers, ensure high availability and low latency
- Security: Authen-Autho mechanism to access control to feature store
- Option:
- Kubernetes
3. Application Layer#
This serves as the orchestrator for the whole feature store. It manages data pipeline, keeps track of features data, metadata and monitors the pipeline's health.
3.1. Job Orchestrator#
It orchestrates the data pipelines, especially the Batch and Stream transformations jobs. It ensure that tasks are executed in the correct order with correct dependencies.
- Considerations:
- Workflow Design: Define clear Directed Acyclic Graphs (DAGs) or workflows that outline the sequence and dependencies of tasks.
- Option:
- Airflow
3.2. Feature Registry#
Feature registry serves as the “library catalog” for feature store. It maintains metadata of features, supports CRUD operations for metadata and offers feature lineage tracking
- Considerations:
- Metadata Schema: Define a schema for metadata, including feature names, types, and lineage information.
- Searchability - Ensure that features can be easily searched and retrieved based on their metadata
- Versioning - Implement versioning for features to track changes over time.
- Option:
- PostgreSQL with Feast
3.3. The Control Plane#
It oversees all operations in feature store and ensures they run smoothly. It serves as UI for data monitoring, access control, and other management features.
- Option:
- Kubernetes - As used in Serving Layer