Feathr Job Configuration during Run Time
Since Feathr uses Spark as the underlying execution engine, there’s a way to override Spark configuration by FeathrClient.get_offline_features()
with execution_configurations
parameters. The complete list of the available spark configuration is located in Spark Configuration (though not all of those are honored for cloud hosted Spark platforms such as Databricks), and there are a few Feathr specific ones that are documented here:
Property Name | Default | Meaning | Since Version |
---|---|---|---|
spark.feathr.inputFormat | None | Specify the input format if the file cannot be tell automatically. By default, Feathr will read files by parsing the file extension name; However the file/folder name doesn’t have extension name, this configuration can be set to tell Feathr which format it should use to read the data. Currently can only be set for Spark built-in short names, including json , parquet , jdbc , orc , libsvm , csv , text . For more details, see “Manually Specifying Options”. Additionally, delta is also supported if users want to read delta lake. | 0.2.1 |
spark.feathr.outputFormat | None | Specify the output format. “avro” is the default behavior if this value is not set. Currently can only be set for Spark built-in short names, including json , parquet , jdbc , orc , libsvm , csv , text . For more details, see “Manually Specifying Options”. Additionally, delta is also supported if users want to write delta lake. | 0.2.1 |
spark.feathr.inputFormat.csvOptions.sep | None | Specify the delimiter. For example, “,” for commas or “\t” for tabs. (Supports both csv and tsv) | 0.6.0 |
Examples on using job configurations
Examples when using the above job configurations when get offline features:
client.get_offline_features(
observation_settings=settings,
feature_query=feature_query,
output_path=output_path,
execution_configurations=SparkExecutionConfiguration({"spark.feathr.inputFormat": "parquet", "spark.feathr.outputFormat": "parquet"}),
verbose=True
)
Examples when using the above job configurations when materializing features:
client.materialize_features(settings, execution_configurations=SparkExecutionConfiguration({"spark.feathr.inputFormat": "parquet", "spark.feathr.outputFormat": "parquet"}))
Config not applied issue
Please note that execution_configurations
argument only works when using a new job cluster in Databricks : Cluster spark config not applied
If you are using an existing cluster, please manually add them to the cluster spark configuration. This can be done in Databrick Cluster UI : Edit a cluster