Dataflow is a serverless way to carry out data analysis. In this lab, you set up a streaming data pipeline to read sensor data from Pub/Sub, compute the maximum temperature within a time window, and write this out to BigQuery. Dataflow supports streaming. It is based on Apache Beam, so knowledge about it is desirable.
Dataflow isn't the only option to run ETL on GCP. Here are some options:
How do you transform raw materials into useful pieces? That's the job of the worker. As you'll see later when we talk about data pipelines, it's actually pretty funny. The individual unit behind the scenes is literally called a worker on Cloud Dataflow. A worker is actually just a virtual machine that takes some small piece of data and transforms that piece for you.
projects/pubsub-public-data/topics/tablename-realtime
.<myprojectid>:tablename.realtime
.gs://<mybucket>/tmp/
.Below are the step names involved in a typical Dataflow job:
System lag: The maximum time that an item of data has been awaiting processing.
Data watermark: Timestamp marking the estimated completion of data input for this step.
Wall time: Approximate time spent in this step on initializing, processing data, shuffling data, and terminating, across all threads in all workers. For composite steps, it is the sum of time spent in the component steps. This estimate helps identify slow steps.
Job region: The regional endpoint where metadata is stored and handled for this job. This may be distinct from where the job's workers are deployed.
Worker location: The number of workers your job is currently using. Find more information about your job's workers in the "JOB METRICS" tab under the "Autoscaling" section.
SDK version: Apache Beam SDK for Java 2.23.0