Orchestration

In the data engineering world, that's your orchestration layer or your overall workflow.

Example

  1. every time a piece of a CSV file any of that data drops into this Google Cloud Storage bucket I want you to automatically pass it to our data pipeline for processing
  2. once it's done processing I want you the data pipeline to then stream it into our data warehouse.
  3. once it's in the data warehouse, I will notify the machine learning model that new cleaned training data is now available for training and retraining.
  4. And then I can direct it to start training a new model version.

What if one steps fails here, or what if you want to run this every day or every hour or a triggered on an event?
You're beginning to see the need for an orchestrator which in our solutioning, will be Apache airflow running on a cloud composed environment

Which tool is the right choise for orchestration?

Cloud Composer