Dataproc

What is Apache Beam?

Apache Beam is an open-source, portable data processing programming model. It allows you to run pipelines in a highly distributed fashion. Beam offers a unified model, meaning the same pipeline code can support both batch and streaming data processing.

Steps in an Apache Beam Pipeline

Inputs → Data → Transforms → Data → Outputs

Apache Beam Pipelines

Pipelines are written using Java, Python, or Go.

Four Key Concepts

PCollections

A PCollection represents a data set—either bounded (batch) or unbounded (streaming). There are no size limits. It is called a “parallel collection” because it is processed in parallel across many workers.

For streaming data, the PCollection is unbounded—it has no defined end. Each element in a PCollection can be individually accessed and processed, enabling scalable, distributed data processing.