Dataflow is a serverless way to carry out data analysis. In this lab, you set up a streaming data pipeline to read sensor data from Pub/Sub, compute the maximum temperature within a time window, and write this out to BigQuery. Dataflow support streaming. Data is based on Apache Beam and knowledge about is desirable.
Dataflow isn't the only option to run ETL on GCP. Here the Options.
How do you transform these raw materials into those useful pieces.
That's the job of the worker, as you'll see later when we
talk about data pipelines, it's actually pretty funny.
The individual unit behind the scenes is literally called a worker
on cloud dataflow, and a worker is actually just a virtual machine.
And it takes some small piece of data and it transforms that piece
for you.
System lag
The maximum time that an item of data has been awaiting processing.
Data watermark
Timestamp marking the estimated completion of data input this step.
Wall time
Approximate time spent in this step on initializing, processing data, shuffling data, and terminating, across all threads in all workers. For composite steps, the sum of time spent in the component steps. This estimate helps you identify slow steps.
Job region
The regional endpoint where metadata is stored and handled for this job. This may be distinct from the location where the job's workers are deployed.
Worker location
The number of workers your job is currently using. Find more information about your job's workers in "JOB METRICS" tab under "Autoscaling" section.
SDK version
Apache Beam SDK for Java 2.23.0