Core Concepts Overview

Apache Beam provides a unified programming model for both batch and streaming data processing. This section introduces the core concepts that form the foundation of Beam pipelines.

What is Apache Beam?

Apache Beam is an open-source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the Beam SDKs (Java, Python, or Go), you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Flink, Apache Spark, and Google Cloud Dataflow.

The Beam Model

The Beam programming model simplifies the mechanics of large-scale data processing. At its core, a Beam pipeline:

Reads data from one or more sources
Transforms the data through a series of operations
Writes the results to one or more destinations

Core Components

Every Beam pipeline is built using these fundamental building blocks:

Pipeline

A Pipeline encapsulates your entire data processing task, from start to finish. It manages a directed acyclic graph (DAG) of PTransforms and the PCollections they consume and produce.

PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);

PCollection

A PCollection represents a distributed dataset that your Beam pipeline operates on. It can be bounded (finite) or unbounded (infinite/streaming). PCollections are immutable - transforms create new PCollections rather than modifying existing ones. Key characteristics:

Immutable: Once created, elements cannot be changed
Distributed: Data is distributed across multiple workers
Timestamped: Each element has an associated timestamp
Windowed: Elements are organized into windows for processing

PTransform

A PTransform represents a data processing operation that takes one or more PCollections as input and produces one or more PCollections as output. Transforms can be:

Root transforms: Read data from external sources (e.g., TextIO.Read, Create)
Processing transforms: Transform data (e.g., ParDo, Map, Filter)
Aggregating transforms: Combine data (e.g., GroupByKey, Combine)
Output transforms: Write data to external sinks (e.g., TextIO.Write)

Data Processing Paradigms

Bounded vs Unbounded Data

Bounded data has a fixed size - it’s batch processing. Unbounded data is continuously generated - it’s streaming processing. Beam handles both with the same programming model.

Bounded (Batch): Traditional batch processing of finite datasets
Unbounded (Streaming): Continuous processing of data streams

Windowing

Windowing divides unbounded data into logical, finite chunks for processing. Common windowing strategies include:

Fixed (Tumbling) Windows: Non-overlapping, fixed-size time intervals
Sliding Windows: Overlapping, fixed-size time intervals
Session Windows: Dynamic windows based on data activity

Triggers

Triggers determine when to emit results for a window. They control:

When to produce early (speculative) results
When to produce final results
How to handle late-arriving data

Why These Concepts Matter

Understanding these core concepts is essential because they form the vocabulary for expressing any data processing task in Beam, whether batch or streaming.

For batch processing, you typically:

Work with bounded PCollections
Use global windowing (implicit)
Rely on default triggers

For streaming processing, you must:

Handle unbounded PCollections
Define explicit windowing strategies
Configure triggers for timely results
Manage late data

Next Steps

Now that you understand the overview, dive deeper into each concept:

Programming Model

Learn how to structure Beam pipelines

Pipelines

Create and configure pipelines

PCollections

Work with distributed datasets

Transforms

Apply data transformations

Common Patterns

Here’s a simple example showing how these concepts work together:

Pipeline p = Pipeline.create(options);

// PCollection of strings (bounded)
PCollection<String> lines = p.apply(TextIO.read().from("input.txt"));

// PTransform: split into words
PCollection<String> words = lines.apply(
    FlatMapElements.into(TypeDescriptors.strings())
        .via(line -> Arrays.asList(line.split("\\s+"))));

// PTransform: count occurrences
PCollection<KV<String, Long>> counts = words.apply(Count.perElement());

p.run();

This simple example demonstrates:

Creating a Pipeline
Reading data into a PCollection
Applying PTransforms to process data
The immutability of PCollections (each transform creates a new one)

Documentation Index

​What is Apache Beam?

​The Beam Model

​Core Components

​Pipeline

​PCollection

​PTransform

​Data Processing Paradigms

​Bounded vs Unbounded Data

​Windowing

​Triggers

​Why These Concepts Matter

​Next Steps

Programming Model

Pipelines

PCollections

Transforms

​Common Patterns

What is Apache Beam?

The Beam Model

Core Components

Pipeline

PCollection

PTransform

Data Processing Paradigms

Bounded vs Unbounded Data

Windowing

Triggers

Why These Concepts Matter

Next Steps

Common Patterns