Cloud Integration Test/CI Pipeline
We use GitHub Actions to do cloud integration test. Currently the integration test has 5 jobs:
- running
./gradlew test
to verify if the scala/spark related code has passed all the test - running
flake8
to lint python scripts and make sure there are no obvious syntax errors - running the built jar in databricks environment with end to end test to make sure it passed the end to end test
- running the built jar in Azure Synapse environment with end to end test to make sure it passed the end to end test
- running the end to end test cases for registry server to make sure related code can passed all the tests
The above 5 jobs will ran in parallel, and if any one of them fails, the integration test will fail.
Cloud Testing Pipelines
Since there are many cloud integration testing jobs that could be run in parallel, currently the workflow is like this:
- For each spark runtime (databricks or Azure Synapse), it will first compile the Feathr jar file
- CI pipeline will upload the jar to a location which is specific to this CI workflow. i.e. For the subsequent spark jobs, they will use the same jars in a shared cloud location (so that those spark jobs don’t have to upload jars again); However the jar will be in a different location for different CI workflow (for example you have a new push for an PR, the CI pipeline will upload the jar into a different location)
- For each spark job, they will use a different “workspace folder” so that all the required configurations and cloud resources won’t conflict.
os.environ['SPARK_CONFIG__DATABRICKS__WORK_DIR'] = ''.join(['dbfs:/feathrazure_cijob','_', str(now.minute), '_', str(now.second), '_', str(now.microsecond)]) os.environ['SPARK_CONFIG__AZURE_SYNAPSE__WORKSPACE_DIR'] = ''.join(['abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/feathr_github_ci','_', str(now.minute), '_', str(now.second) ,'_', str(now.microsecond)])
-
They will also use different output paths to make sure there’s no writing conflict for the same output file.
if client.spark_runtime == 'databricks': output_path = ''.join(['dbfs:/feathrazure_cijob_snowflake','_', str(now.minute), '_', str(now.second), ".avro"]) else: output_path = ''.join(['abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/snowflake_output','_', str(now.minute), '_', str(now.second), ".avro"])
Optimizing Parallel Runs
Since Feathr is using cloud resources to do CI testing, we have those optimizations in place:
- set
pytest -n 4
to run 4 tests in parallel - Use pre-exist spark pools to reduce the setup time. All the spark jobs are running on “instance pools” that has certain idle compute instances so the setup time will be short. For example, for Databricks:
"instance_pool_id":"0403-214809-inlet434-pool-l9dj3kwz"
More on GitHub Actions
The integration test will be triggered once there are push or for new pull requests.
The integration test will also skip the files in the /docs
folder and for files that are ending with md
.
For more info on GitHub actions, refer to the documentation here.
push:
branches: [main]
paths-ignore:
- "docs/**"
- "**/README.md"
pull_request:
branches: [main]
paths-ignore:
- "docs/**"
- "**/README.md"