In [None]:
dbutils.widgets.text("RESOURCE_PREFIX", "")
dbutils.widgets.text("REDIS_KEY", "")

# Feathr Feature Store on Databricks Demo Notebook

This notebook illustrates the use of Feature Store to create a model that predicts NYC Taxi fares. The dataset comes from [here](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

This notebook is specifically written for Databricks and is relying on some of the Databricks packages such as `dbutils`. The intention here is to provide a "one click run" example with minimum configuration. For example:
- This notebook skips feature registry which requires running Azure Purview. 
- To make the online feature query work, you will need to configure the Redis endpoint. 

The full-fledged notebook can be found from [here](https://github.com/feathr-ai/feathr/blob/main/docs/samples/nyc_taxi_demo.ipynb).

## Prerequisite

To use feathr materialization for online scoring with Redis cache, you may deploy a Redis cluster and set `RESOURCE_PREFIX` and `REDIS_KEY` via Databricks widgets. Note that the deployed Redis host address should be `{RESOURCE_PREFIX}redis.redis.cache.windows.net`. More details about how to deploy the Redis cluster can be found [here](https://feathr-ai.github.io/feathr/how-to-guides/azure-deployment-cli.html#configurure-redis-cluster).

To run this notebook, you'll need to install `feathr` pip package. Here, we install notebook-scoped library. For details, please see [Azure Databricks dependency management document](https://learn.microsoft.com/en-us/azure/databricks/libraries/).

In [None]:
# Install feathr from the latest codes in the repo. You may use `pip install "feathr[notebook]"` as well.
%pip install "git+https://github.com/feathr-ai/feathr.git#subdirectory=feathr_project&egg=feathr[notebook]"

## Notebook Steps

This tutorial demonstrates the key capabilities of Feathr, including:

1. Install Feathr and necessary dependencies.
1. Create shareable features with Feathr feature definition configs.
1. Create training data using point-in-time correct feature join
1. Train and evaluate a prediction model.
1. Materialize feature values for online scoring.

The overall data flow is as follows:

<img src="https://raw.githubusercontent.com/feathr-ai/feathr/main/docs/images/feature_flow.png" width="800">

In [None]:
from datetime import timedelta
import os
from pathlib import Path

from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import GBTRegressor
from pyspark.sql import DataFrame
import pyspark.sql.functions as F

import feathr
from feathr import (
    FeathrClient,
    # Feature data types
    BOOLEAN,
    FLOAT,
    INT32,
    ValueType,
    # Feature data sources
    INPUT_CONTEXT,
    HdfsSource,
    # Feature aggregations
    TypedKey,
    WindowAggTransformation,
    # Feature types and anchor
    DerivedFeature,
    Feature,
    FeatureAnchor,
    # Materialization
    BackfillTime,
    MaterializationSettings,
    RedisSink,
    # Offline feature computation
    FeatureQuery,
    ObservationSettings,
)
from feathr.datasets import nyc_taxi
from feathr.spark_provider.feathr_configurations import SparkExecutionConfiguration
from feathr.utils.config import generate_config
from feathr.utils.job_utils import get_result_df


print(
    f"""Feathr version: {feathr.__version__}
Databricks runtime version: {spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion")}"""
)

## 2. Create Shareable Features with Feathr Feature Definition Configs

In this notebook, we define all the necessary resource key values for authentication. We use the values passed by the databricks widgets at the top of this notebook. Instead of manually entering the values to the widgets, we can also use [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) to retrieve them.
Please refer to [how-to guide documents for granting key-vault access](https://feathr-ai.github.io/feathr/how-to-guides/azure-deployment-arm.html#3-grant-key-vault-and-synapse-access-to-selected-users-optional) and [Databricks' Azure Key Vault-backed scopes](https://learn.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes) for more details.

In [None]:
RESOURCE_PREFIX = dbutils.widgets.get("RESOURCE_PREFIX")
PROJECT_NAME = "feathr_getting_started"

REDIS_KEY = dbutils.widgets.get("REDIS_KEY")

# Use a databricks cluster
SPARK_CLUSTER = "databricks"

# Databricks file system path
DATA_STORE_PATH = f"dbfs:/{PROJECT_NAME}"

In [None]:
# Redis credential
os.environ["REDIS_PASSWORD"] = REDIS_KEY

### Configurations

Feathr uses a yaml file to define configurations. Please refer to [feathr_config.yaml]( https://github.com//feathr-ai/feathr/blob/main/feathr_project/feathrcli/data/feathr_user_workspace/feathr_config.yaml) for the meaning of each field.

In the following cell, we set required databricks credentials automatically by using a databricks notebook context object as well as new job cluster spec.

In [None]:
ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()

In [None]:
config_path = generate_config(
    resource_prefix=RESOURCE_PREFIX,
    project_name=PROJECT_NAME,
    spark_config__spark_cluster=SPARK_CLUSTER,
    # You may set an existing cluster id here, but Databricks recommend to use new clusters for greater reliability.
    databricks_cluster_id=None,  # Set None to create a new job cluster
    databricks_workspace_token_value=ctx.apiToken().get(),
    spark_config__databricks__workspace_instance_url=f"https://{ctx.tags().get('browserHostName').get()}",
)

with open(config_path, "r") as f:
    print(f.read())

All the configurations can be overwritten by environment variables with concatenation of `__` for different layers of the config file. For example, `feathr_runtime_location` for databricks config can be overwritten by setting `spark_config__databricks__feathr_runtime_location` environment variable.

### Initialize Feathr Client

In [None]:
client = FeathrClient(config_path=config_path)

### View the NYC taxi fare dataset

In [None]:
DATA_FILE_PATH = str(Path(DATA_STORE_PATH, "nyc_taxi.csv"))

# Download the data file
df_raw = nyc_taxi.get_spark_df(spark=spark, local_cache_path=DATA_FILE_PATH)
df_raw.limit(5).show()

### Defining features with Feathr

In Feathr, a feature is viewed as a function, mapping a key and timestamp to a feature value. For more details, please see [Feathr Feature Definition Guide](https://github.com/feathr-ai/feathr/blob/main/docs/concepts/feature-definition.md).

* The feature key (a.k.a. entity id) identifies the subject of feature, e.g. a user_id or location_id.
* The feature name is the aspect of the entity that the feature is indicating, e.g. the age of the user.
* The feature value is the actual value of that aspect at a particular time, e.g. the value is 30 at year 2022.

Note that, in some cases, a feature could be just a transformation function that has no entity key or timestamp involved, e.g. *the day of week of the request timestamp*.

There are two types of features -- anchored features and derivated features:

* **Anchored features**: Features that are directly extracted from sources. Could be with or without aggregation. 
* **Derived features**: Features that are computed on top of other features.

#### Define anchored features

A feature source is needed for anchored features that describes the raw data in which the feature values are computed from. A source value should be either `INPUT_CONTEXT` (the features that will be extracted from the observation data directly) or `feathr.source.Source` object.

In [None]:
TIMESTAMP_COL = "lpep_dropoff_datetime"
TIMESTAMP_FORMAT = "yyyy-MM-dd HH:mm:ss"

In [None]:
# We define f_trip_distance and f_trip_time_duration features separately
# so that we can reuse them later for the derived features.
f_trip_distance = Feature(
    name="f_trip_distance",
    feature_type=FLOAT,
    transform="trip_distance",
)
f_trip_time_duration = Feature(
    name="f_trip_time_duration",
    feature_type=FLOAT,
    transform="cast_float((to_unix_timestamp(lpep_dropoff_datetime) - to_unix_timestamp(lpep_pickup_datetime)) / 60)",
)

features = [
    f_trip_distance,
    f_trip_time_duration,
    Feature(
        name="f_is_long_trip_distance",
        feature_type=BOOLEAN,
        transform="trip_distance > 30.0",
    ),
    Feature(
        name="f_day_of_week",
        feature_type=INT32,
        transform="dayofweek(lpep_dropoff_datetime)",
    ),
    Feature(
        name="f_day_of_month",
        feature_type=INT32,
        transform="dayofmonth(lpep_dropoff_datetime)",
    ),
    Feature(
        name="f_hour_of_day",
        feature_type=INT32,
        transform="hour(lpep_dropoff_datetime)",
    ),
]

# After you have defined features, bring them together to build the anchor to the source.
feature_anchor = FeatureAnchor(
    name="feature_anchor",
    source=INPUT_CONTEXT,  # Pass through source, i.e. observation data.
    features=features,
)

We can define the source with a preprocessing python function.

In [None]:
def preprocessing(df: DataFrame) -> DataFrame:
    import pyspark.sql.functions as F

    df = df.withColumn(
        "fare_amount_cents", (F.col("fare_amount") * 100.0).cast("float")
    )
    return df


batch_source = HdfsSource(
    name="nycTaxiBatchSource",
    path=DATA_FILE_PATH,
    event_timestamp_column=TIMESTAMP_COL,
    preprocessing=preprocessing,
    timestamp_format=TIMESTAMP_FORMAT,
)

For the features with aggregation, the supported functions are as follows:

| Aggregation Function | Input Type | Description |
| --- | --- | --- |
|SUM, COUNT, MAX, MIN, AVG	|Numeric|Applies the the numerical operation on the numeric inputs. |
|MAX_POOLING, MIN_POOLING, AVG_POOLING	| Numeric Vector | Applies the max/min/avg operation on a per entry bassis for a given a collection of numbers.|
|LATEST| Any |Returns the latest not-null values from within the defined time window |

In [None]:
agg_key = TypedKey(
    key_column="DOLocationID",
    key_column_type=ValueType.INT32,
    description="location id in NYC",
    full_name="nyc_taxi.location_id",
)

agg_window = "90d"

# Anchored features with aggregations
agg_features = [
    Feature(
        name="f_location_avg_fare",
        key=agg_key,
        feature_type=FLOAT,
        transform=WindowAggTransformation(
            agg_expr="fare_amount_cents",
            agg_func="AVG",
            window=agg_window,
        ),
    ),
    Feature(
        name="f_location_max_fare",
        key=agg_key,
        feature_type=FLOAT,
        transform=WindowAggTransformation(
            agg_expr="fare_amount_cents",
            agg_func="MAX",
            window=agg_window,
        ),
    ),
]

agg_feature_anchor = FeatureAnchor(
    name="agg_feature_anchor",
    source=batch_source,  # External data source for feature. Typically a data table.
    features=agg_features,
)

#### Define derived features

We also define a derived feature, `f_trip_time_distance`, from the anchored features `f_trip_distance` and `f_trip_time_duration` as follows:

In [None]:
derived_features = [
    DerivedFeature(
        name="f_trip_time_distance",
        feature_type=FLOAT,
        input_features=[
            f_trip_distance,
            f_trip_time_duration,
        ],
        transform="f_trip_distance / f_trip_time_duration",
    )
]

### Build features

Finally, we build the features.

In [None]:
client.build_features(
    anchor_list=[feature_anchor, agg_feature_anchor],
    derived_feature_list=derived_features,
)

## 3. Create Training Data Using Point-in-Time Correct Feature Join

After the feature producers have defined the features (as described in the Feature Definition part), the feature consumers may want to consume those features. Feature consumers will use observation data to query from different feature tables using Feature Query.

To create a training dataset using Feathr, one needs to provide a feature join configuration file to specify
what features and how these features should be joined to the observation data. 

To learn more on this topic, please refer to [Point-in-time Correctness](https://github.com//feathr-ai/feathr/blob/main/docs/concepts/point-in-time-join.md)

In [None]:
feature_names = [feature.name for feature in features + agg_features + derived_features]
feature_names

In [None]:
DATA_FORMAT = "parquet"
offline_features_path = str(
    Path(DATA_STORE_PATH, "feathr_output", f"features.{DATA_FORMAT}")
)

In [None]:
# Features that we want to request. Can use a subset of features
query = FeatureQuery(
    feature_list=feature_names,
    key=agg_key,
)
settings = ObservationSettings(
    observation_path=DATA_FILE_PATH,
    event_timestamp_column=TIMESTAMP_COL,
    timestamp_format=TIMESTAMP_FORMAT,
)
client.get_offline_features(
    observation_settings=settings,
    feature_query=query,
    # Note, execution_configurations argument only works when using a new job cluster
    # For more details, see https://feathr-ai.github.io/feathr/how-to-guides/feathr-job-configuration.html
    execution_configurations=SparkExecutionConfiguration(
        {
            "spark.feathr.outputFormat": DATA_FORMAT,
        }
    ),
    output_path=offline_features_path,
)

client.wait_job_to_finish(timeout_sec=5000)

In [None]:
# Show feature results
df = get_result_df(
    spark=spark,
    client=client,
    data_format="parquet",
    res_url=offline_features_path,
)
df.select(feature_names).limit(5).toPandas()

## 4. Train and Evaluate a Prediction Model

After generating all the features, we train and evaluate a machine learning model to predict the NYC taxi fare prediction. In this example, we use Spark MLlib's [GBTRegressor](https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-regression).

Note that designing features, training prediction models and evaluating them are an iterative process where the models' performance maybe used to modify the features as a part of the modeling process.

### Load Train and Test Data from the Offline Feature Values

In [None]:
# Train / test split
train_df, test_df = (
    df.withColumn(  # Dataframe that we generated from get_offline_features call.
        "label", F.col("fare_amount").cast("double")
    )
    .where(F.col("f_trip_time_duration") > 0)
    .fillna(0)
    .randomSplit([0.8, 0.2])
)

print(f"Num train samples: {train_df.count()}")
print(f"Num test samples: {test_df.count()}")

### Build a ML Pipeline

Here, we use Spark ML Pipeline to aggregate feature vectors and feed them to the model.

In [None]:
# Generate a feature vector column for SparkML
vector_assembler = VectorAssembler(
    inputCols=[x for x in df.columns if x in feature_names],
    outputCol="features",
)

# Define a model
gbt = GBTRegressor(
    featuresCol="features",
    maxIter=100,
    maxDepth=5,
    maxBins=16,
)

# Create a ML pipeline
ml_pipeline = Pipeline(
    stages=[
        vector_assembler,
        gbt,
    ]
)

### Train and Evaluate the Model

In [None]:
# Train a model
model = ml_pipeline.fit(train_df)

# Make predictions
predictions = model.transform(test_df)

In [None]:
# Evaluate
evaluator = RegressionEvaluator(
    labelCol="label",
    predictionCol="prediction",
)

rmse = evaluator.evaluate(predictions, {evaluator.metricName: "rmse"})
mae = evaluator.evaluate(predictions, {evaluator.metricName: "mae"})
print(f"RMSE: {rmse}\nMAE: {mae}")

In [None]:
# predicted fare vs actual fare plots -- will this work for databricks / synapse / local ?
predictions_pdf = predictions.select(["label", "prediction"]).toPandas().reset_index()

predictions_pdf.plot(
    x="index",
    y=["label", "prediction"],
    style=["-", ":"],
    figsize=(20, 10),
)

In [None]:
predictions_pdf.plot.scatter(
    x="label",
    y="prediction",
    xlim=(0, 100),
    ylim=(0, 100),
    figsize=(10, 10),
)

## 5. Materialize Feature Values for Online Scoring

While we computed feature values on-the-fly at request time via Feathr, we can pre-compute the feature values and materialize them to offline or online storages such as Redis.

Note, only the features anchored to offline data source can be materialized.

In [None]:
materialized_feature_names = [feature.name for feature in agg_features]
materialized_feature_names

In [None]:
if REDIS_KEY and RESOURCE_PREFIX:
    FEATURE_TABLE_NAME = "nycTaxiDemoFeature"

    # Get the last date from the dataset
    backfill_timestamp = (
        df_raw.select(
            F.to_timestamp(F.col(TIMESTAMP_COL), TIMESTAMP_FORMAT).alias(TIMESTAMP_COL)
        )
        .agg({TIMESTAMP_COL: "max"})
        .collect()[0][0]
    )

    # Time range to materialize
    backfill_time = BackfillTime(
        start=backfill_timestamp,
        end=backfill_timestamp,
        step=timedelta(days=1),
    )

    # Destinations:
    # For online store,
    redis_sink = RedisSink(table_name=FEATURE_TABLE_NAME)

    # For offline store,
    # adls_sink = HdfsSink(output_path=)

    settings = MaterializationSettings(
        name=FEATURE_TABLE_NAME + ".job",  # job name
        backfill_time=backfill_time,
        sinks=[redis_sink],  # or adls_sink
        feature_names=materialized_feature_names,
    )

    client.materialize_features(
        settings=settings,
        # Note, execution_configurations argument only works when using a new job cluster
        execution_configurations={"spark.feathr.outputFormat": "parquet"},
    )

    client.wait_job_to_finish(timeout_sec=5000)

Now, you can retrieve features for online scoring as follows:

In [None]:
if REDIS_KEY and RESOURCE_PREFIX:
    # Note, to get a single key, you may use client.get_online_features instead
    materialized_feature_values = client.multi_get_online_features(
        feature_table=FEATURE_TABLE_NAME,
        keys=["239", "265"],
        feature_names=materialized_feature_names,
    )
    materialized_feature_values

## Cleanup

In [None]:
# Remove temporary files
dbutils.fs.rm("dbfs:/tmp/", recurse=True)