Dataproc

What is Dataflow?

Dataflow is a serverless way to carry out data analysis. In this lab, you set up a streaming data pipeline to read sensor data from Pub/Sub, compute the maximum temperature within a time window, and write this out to BigQuery. Dataflow supports streaming. It is based on Apache Beam, so knowledge about it is desirable.

Dataflow vs Other GCP ETL Options

Dataflow isn't the only option to run ETL on GCP. Here are some options:

ETL Example with Dataflow

  1. Extract data from Pub/Sub, Google Cloud Storage, Cloud Spanner, Cloud SQL.
  2. Transform the data using Cloud Dataflow.
  3. Have the Dataflow pipeline write to BigQuery.

Analogy to a Construction Site

How do you transform raw materials into useful pieces? That's the job of the worker. As you'll see later when we talk about data pipelines, it's actually pretty funny. The individual unit behind the scenes is literally called a worker on Cloud Dataflow. A worker is actually just a virtual machine that takes some small piece of data and transforms that piece for you.

How to Create and Set Up a Dataflow Pipeline

  1. Open the Google Cloud console.
  2. Navigate to the menu > Dataflow.
  3. Create a job from template.
  4. Enter streaming-name-pipeline as the Job name for your Dataflow job.
  5. Under Dataflow template, select the Pub/Sub Topic to BigQuery template.
  6. Under Input Pub/Sub topic, enter projects/pubsub-public-data/topics/tablename-realtime.
  7. Under BigQuery output table, enter <myprojectid>:tablename.realtime.
  8. Under Temporary location, enter gs://<mybucket>/tmp/.
  9. Click the Run Job button.
  10. A new streaming job will start! You can now see a visual representation of the data pipeline.

Steps of a Dataflow Job

Below are the step names involved in a typical Dataflow job:

Dataflow Job Topology and Metrics

System lag: The maximum time that an item of data has been awaiting processing.

Data watermark: Timestamp marking the estimated completion of data input for this step.

Wall time: Approximate time spent in this step on initializing, processing data, shuffling data, and terminating, across all threads in all workers. For composite steps, it is the sum of time spent in the component steps. This estimate helps identify slow steps.

Job region: The regional endpoint where metadata is stored and handled for this job. This may be distinct from where the job's workers are deployed.

Worker location: The number of workers your job is currently using. Find more information about your job's workers in the "JOB METRICS" tab under the "Autoscaling" section.

SDK version: Apache Beam SDK for Java 2.23.0

How to Stop a Dataflow Job

  1. Navigate back to Dataflow.
  2. Click the streaming-taxi-pipeline job.
  3. Click Stop and select Cancel > Stop Job.
  4. This will free up resources for your project.