Dataflow

What is Dataflow?

Dataflow is a serverless way to carry out data analysis. In this lab, you set up a streaming data pipeline to read sensor data from Pub/Sub, compute the maximum temperature within a time window, and write this out to BigQuery. Dataflow supports streaming. It is based on Apache Beam, so knowledge about it is desirable.

Dataflow vs Other GCP ETL Options

Dataflow isn't the only option to run ETL on GCP. Here are some options:

Cloud Dataflow: Can be used for more complex ETL pipelines.
Cloud Dataproc: Can be used for complex ETL pipelines but requires Hadoop expertise to leverage directly.
Cloud Data Fusion: Provides a simple to use graphical interface to build ETL pipelines.

Analogy to a Construction Site

How do you transform raw materials into useful pieces? That's the job of the worker. As you'll see later when we talk about data pipelines, it's actually pretty funny. The individual unit behind the scenes is literally called a worker on Cloud Dataflow. A worker is actually just a virtual machine that takes some small piece of data and transforms that piece for you.

How to Create and Set Up a Dataflow Pipeline

Open the Google Cloud console.
Navigate to the menu > Dataflow.
Create a job from template.
Enter streaming-name-pipeline as the Job name for your Dataflow job.
Under Dataflow template, select the Pub/Sub Topic to BigQuery template.
Under Input Pub/Sub topic, enter projects/pubsub-public-data/topics/tablename-realtime.
Under BigQuery output table, enter <myprojectid>:tablename.realtime.
Under Temporary location, enter gs://<mybucket>/tmp/.
Click the Run Job button.
A new streaming job will start! You can now see a visual representation of the data pipeline.

Dataflow Job Topology and Metrics

System lag: The maximum time that an item of data has been awaiting processing.

Data watermark: Timestamp marking the estimated completion of data input for this step.

Wall time: Approximate time spent in this step on initializing, processing data, shuffling data, and terminating, across all threads in all workers. For composite steps, it is the sum of time spent in the component steps. This estimate helps identify slow steps.

Job region: The regional endpoint where metadata is stored and handled for this job. This may be distinct from where the job's workers are deployed.

Worker location: The number of workers your job is currently using. Find more information about your job's workers in the "JOB METRICS" tab under the "Autoscaling" section.

SDK version: Apache Beam SDK for Java 2.23.0

What is Dataflow?

Dataflow vs Other GCP ETL Options

ETL Example with Dataflow

Analogy to a Construction Site

How to Create and Set Up a Dataflow Pipeline

Steps of a Dataflow Job

Dataflow Job Topology and Metrics

How to Stop a Dataflow Job