Apache Beam

What is Apache Beam?

Apache Beam is a portable data processing programming model. It's open source, and it can be ran in a highly distributed fashion. It's unified and that is single model, meaning your pipeline code can work for both batch and streaming data.

Unified - Use a single programming model for both batch and streaming use cases

Portable - execute pipelines on multiple executaion environments

Extensible - write and share new SDKs,IO connectors and transformation information

Steps in Apache Beam

Inputs --> Data --> Transforms --> Data --> Outputs

Apache Pipelines

Apache pipelines are written in Java, Python or Go.

4 Concetps

Ptransforms
Pcollections
Pipelines
Pipeline Runners

Pcollections

A Pcollection represents both streaming data and batch data.
There's no size limits your Pcollection either bounded or unbounded.
That's why it's called a Pcollection or parallel collection.
The more data, the more it's simply distributed in parallel across more workers.
For streaming data, the Pcollection is simply without bounds.
It has no end. Each element inside a Pcollection canbe individually accessed and processed.
This how distributed processing of the Pcollection is implemented.