Introduction to Apache Beam

Apache Beam is a unified model for defining both batch and streaming data-parallel processing pipelines. It provides a set of language-specific SDKs for constructing pipelines and Runners for executing them on distributed processing backends.

What is Apache Beam?

Apache Beam evolved from several internal Google data processing projects, including MapReduce, FlumeJava, and Millwheel. The model provides a general approach to expressing embarrassingly parallel data processing pipelines that work seamlessly across both batch and streaming data sources.

Quickstart

Get started with Apache Beam in minutes with hands-on examples in Java, Python, and Go

Core concepts

Learn about PCollections, PTransforms, Pipelines, and PipelineRunners

SDKs

Explore language-specific SDKs for Java, Python, Go, and TypeScript

Runners

Execute pipelines on Flink, Spark, Dataflow, and other distributed backends

Examples

Browse comprehensive examples including WordCount, streaming, and ML pipelines

API Reference

Detailed API documentation for Java, Python, and Go SDKs

I/O Connectors

Connect to various data sources and sinks with built-in I/O transforms

Key features

Unified batch and streaming

Write your pipeline logic once and run it on both batch and streaming data sources. Beam’s unified model eliminates the need to maintain separate codebases for batch and streaming processing.

Multi-language SDKs

Beam currently provides SDKs for:

Java: Full-featured SDK with extensive ecosystem support
Python: Pythonic API with support for data science workflows
Go: Idiomatic Go SDK for high-performance pipelines
TypeScript: JavaScript/TypeScript SDK for web and Node.js environments

Portable pipelines

Run the same pipeline on multiple execution engines without code changes. Beam supports:

DirectRunner: Execute locally for development and testing
DataflowRunner: Run on Google Cloud Dataflow
FlinkRunner: Execute on Apache Flink clusters
SparkRunner: Run on Apache Spark clusters
PrismRunner: Local execution using Beam Portability

Core programming model

Beam pipelines are built using four key concepts:

PCollection

Represents a distributed dataset that can be bounded (batch) or unbounded (streaming). PCollections are immutable and can contain elements of any type.

PTransform

A data processing operation that takes one or more PCollections as input and produces one or more PCollections as output. Common transforms include ParDo, GroupByKey, and Combine.

Pipeline

A directed acyclic graph (DAG) of PTransforms and PCollections that defines your entire data processing workflow. Pipelines are constructed programmatically using the SDK.

PipelineRunner

Executes your Pipeline on a specific distributed processing backend. The runner translates your Beam pipeline into the appropriate API calls for the target execution engine.

Getting help

The Apache Beam community is active and helpful:

Mailing lists: Subscribe to user@beam.apache.org for questions and dev@beam.apache.org for development discussions
Slack: Join the #beam channel on ASF Slack
GitHub: Report issues and browse source code at github.com/apache/beam
Stack Overflow: Ask questions tagged with apache-beam

Next steps

Try the quickstart

Build and run your first Apache Beam pipeline in less than 5 minutes

Documentation Index

​What is Apache Beam?

Quickstart

Core concepts

SDKs

Runners

Examples

API Reference

I/O Connectors

​Key features

​Unified batch and streaming

​Multi-language SDKs

​Portable pipelines

​Core programming model

​Getting help

​Next steps