Featurebyte
Python Library for FeatureOps
Install / Use
/learn @featurebyte/FeaturebyteREADME
FeatureByte SDK is the core engine of FeatureByte's Self-Service Feature Platform. It is a free and source available feature platform designed to:
- Create state-of-the-art features, not data pipelines: Create features for Machine Learning with just a few lines of code. Leave the plumbing and pipelining to FeatureByte. We take care of orchestrating the data ops - whether it’s time-window aggs or backfilling, so you can deliver more value from data.
- Improve Accuracy through data: Use the intuitive feature declaration framework to transform creative ideas into training data in minutes. Ditch the limitations of ad-hoc pipelines for features with much more scale, complexity and freshness.
- Streamline machine learning data pipelines: Get more value from AI. Faster. Deploy and serve features in minutes, instead of weeks or months. Declare features in Python and automatically generate optimized data pipelines — all using tools you love like Jupyter Notebooks.
Take charge of the entire ML feature lifecycle
Feature Engineering and management doesn’t have to be complicated. Take charge of the entire ML feature lifecycle. With FeatureByte, you can create, experiment, serve and manage your features in one tool.
Create
- Create and share state-of-the-art ML features effortlessly
- Search and reuse features to create feature lists tailored to your use case
# Get view from catalog
invoice_view = catalog.get_view("GROCERYINVOICE")
# Declare features of total spent by customer in the past 7 and 28 days
customer_purchases = invoice_view.groupby("GroceryCustomerGuid").aggregate_over(
"Amount",
method="sum",
feature_names=["CustomerTotalSpent_7d", "CustomerTotalSpent_28d"],
fill_value=0,
windows=['7d', '28d']
)
customer_purchases.save()
Experiment
- Immediately access historical features through automated backfilling - let FeatureByte handle the complexity of time-aware SQL
- Experiment on live data at scale, innovating faster
- Iterate rapidly with different feature lists to create more accurate models
# Get feature list from the catalog
feature_list = catalog.get_feature_list(
"200 Features on Active Customers"
)
# Get an observation set from the catalog
observation_set = catalog.get_observation_table(
"5M rows of active Customers in 2021-2022"
)
# Compute training data and
# store it in the feature store for reuse and audit
training = \
feature_list.compute_historical_feature_table(
observation_set,
name="Training set to predict purchases next 2w"
)
Serve
- Deploy AI data pipelines and serve features in minutes
- Access features with low latency
- Reduce costs and security risk by performing computations in your existing data platform
- Ensure data consistency between model training and inferencing
# Get feature list from the catalog
feature_list = catalog.get_feature_list(
"200 Features on Active Customers"
)
# Create deployment
deployment = feature_list.deploy(
name="Features for customer purchases next 2w",
)
# Activate deployment
deployment.enable()
# Get shell script template for online serving
deployment.get_online_serving_code(language="sh")
Manage
- Organize feature engineering assets with domain-specific catalogs
- Centralize cleaning operations and feature job configurations
- Differentiate features that are prototype versus production ready
- Create new versions of your features to handle changes in data
- Keep full lineage of your training data and features in production
- Monitor the health of feature pipelines centrally
# Get table from catalog
items_table = catalog.get_table("GROCERYITEMS")
# Discount must not be negative
items_table.Discount.update_critical_data_info(
cleaning_operations=[
fb.MissingValueImputation(
imputed_value=0
),
fb.ValueBeyondEndpointImputation(
type="less_than",
end_point=0,
imputed_value=0
),
]
)
Get an overview of the typical workflow in FeatureByte.
Get started with Quick-Start and Deep-Dive Tutorials
Discover FeatureByte via its tutorials. All you need is to install the FeatureByte SDK.
Install FeatureByte SDK with pip:
pip install featurebyte
Note: To avoid potential conflicts with other packages we strongly recommend using a virtual environment or a conda environment.
Sign up for access to the Hosted Tutorial server here and register your credentials with FeatureByte SDK:
import featurebyte as fb
# replace <api_token> with your API token you received after registering
fb.register_tutorial_api_token("<api_token>")
This will create a "tutorial" profile that uses the hosted tutorial server. You can now download and run notebooks from the tutorials section.
Leverage your data warehouse
FeatureByte integrates seamlessly with your Snowflake, Databricks, or Spark data warehouses, enhancing security and efficiency by bypassing large-scale outbound data transfers. This integration allows feature calculations to be performed within the data warehouse, leveraging scalability, stability, and efficiency.
<div align="center"> <img src="https://github.com/featurebyte/featurebyte/blob/main/assets/images/Data%20Warehouse.png" width="600" alt="Warehouse Diagram"> </div>FeatureByte utilizes your data warehouse as a:
- data source.
- compute engine to leverage its scalability, stability, and efficiency.
- storage of partial aggregates (tiles) and precomputed feature values to support feature serving.
Architecture
The FeatureByte platform comprises the following components:
- FeatureByte SDK (Python Package): Connects to the API service to provide feature authoring and management functionality through python classes and functions.
- FeatureByte Service (Docker Containers):
- API Service: REST-API service that validates and executes requests, queries data warehouses, and stores data.
- Worker: Executes asynchronous or scheduled tasks.
- MongoDB: Store metadata for created assets.
- Redis: Broker and queue for workers, messenger service for publishing progress updates.
- Query Graph Transpiler (Python Package): Construct data transformation steps as a query graph, which can be transpiled to platform-specific SQL.
- Source Tables (Data Warehouse): Tables used as data sources for feature engineering.
- Feature Store (Data Warehouse): Database that store data used to support feature serving.
FeatureByte Service Deployment Options
The FeatureByte Service can be installed in three different modes:
-
Local installation: The easiest way to get started with the FeatureByte SDK. It is a single-user installation that can be used to prototype features locally with your data warehouse.
-
Hosted on a single server: A light-weight option to support collaboration and job scheduling with limited scalability and availability. Multiple users can connect to the service using the FeatureByte SDK, and deploy features for production.
-
High availability installation (coming soon): The recommended way to run the service in production. Scale to a large number of users and deployed features, and provide highly available services.
The FeatureByte Service runs on Docker for the first two installation modes, and is deployed on a Kubernetes Cluster for the high availability installation mode.
Refer to the installation section of the documentation for more details.
FeatureByte SDK
The FeatureByte Python SDK offers a comprehensive set of objects for feature engineering, simplifying the management and manipulation of tables, entities, views, features, feature lists and other necessary objects for feature serving.
- Catalog objects help you organize your feature engineering assets per domain and maintain clarity and easy access to these assets.
- [Entity](https://docs.featurebyte.com/la
