Dataproc 

Dataflow

Dataflow is a serverless way to carry out data analysis. In this lab, you set up a streaming data pipeline to read sensor data from Pub/Sub, compute the maximum temperature within a time window, and write this out to BigQuery. Dataflow support streaming. Data is based on Apache Beam and knowledge about is desirable.

How does

Dataflow isn't the only option to run ETL on GCP. Here the Options.

ETL Example

  1. Extract data from Pub/Sub, Google Cloud storage, Cloud spanner, Cloud SQL
  2. Transform the data using Cloud Dataflow
  3. Have Dataflow pipeline write to BigQuery.

Analogy to construction site

How do you transform these raw materials into those useful pieces.
That's the job of the worker, as you'll see later when we
talk about data pipelines, it's actually pretty funny.
The individual unit behind the scenes is literally called a worker
on cloud dataflow, and a worker is actually just a virtual machine.
And it takes some small piece of data and it transforms that piece
for you.

Create a Dataflow pipeline

Set up a Dataflow Pipeline

  1. Open console
  2. Navigation menu > Dataflow
  3. Create job from template.
  4. Enter streaming-name-pipeline as the Job name for your Dataflow job.
  5. Under Dataflow template, select the Pub/Sub Topic to BigQuery template.
  6. Under Input Pub/Sub topic, enter projects/pubsub-public-data/topics/tablename-realtime
  7. Under BigQuery output table, enter <myprojectid>:tablename.realtime
  8. Under Temporary location, enter gs://<mybucket>/tmp/
  9. Click the Run Job button.
  10. A new streaming job has started! You can now see a visual representation of the data pipeline.

Steps of a Dataflow job

Below are step names.

Topology of a Dataflow

System lag

The maximum time that an item of data has been awaiting processing.

Data watermark

Timestamp marking the estimated completion of data input this step.

Wall time

Approximate time spent in this step on initializing, processing data, shuffling data, and terminating, across all threads in all workers. For composite steps, the sum of time spent in the component steps. This estimate helps you identify slow steps.

Job region

The regional endpoint where metadata is stored and handled for this job. This may be distinct from the location where the job's workers are deployed.

Worker location

The number of workers your job is currently using. Find more information about your job's workers in "JOB METRICS" tab under "Autoscaling" section.

SDK version

Apache Beam SDK for Java 2.23.0

Stop the Dataflow job

  1. Navigate back to Dataflow.
  2. Click the streaming-taxi-pipeline.
  3. Click Stop and select Cancel > Stop Job.
  4. This will free up resources for your project.