What is Apache Beam?
Apache Beam evolved from several internal Google data processing projects, including MapReduce, FlumeJava, and Millwheel. The model provides a general approach to expressing embarrassingly parallel data processing pipelines that work seamlessly across both batch and streaming data sources.Quickstart
Get started with Apache Beam in minutes with hands-on examples in Java, Python, and Go
Core concepts
Learn about PCollections, PTransforms, Pipelines, and PipelineRunners
SDKs
Explore language-specific SDKs for Java, Python, Go, and TypeScript
Runners
Execute pipelines on Flink, Spark, Dataflow, and other distributed backends
Examples
Browse comprehensive examples including WordCount, streaming, and ML pipelines
API Reference
Detailed API documentation for Java, Python, and Go SDKs
I/O Connectors
Connect to various data sources and sinks with built-in I/O transforms
Key features
Unified batch and streaming
Write your pipeline logic once and run it on both batch and streaming data sources. Beam’s unified model eliminates the need to maintain separate codebases for batch and streaming processing.Multi-language SDKs
Beam currently provides SDKs for:- Java: Full-featured SDK with extensive ecosystem support
- Python: Pythonic API with support for data science workflows
- Go: Idiomatic Go SDK for high-performance pipelines
- TypeScript: JavaScript/TypeScript SDK for web and Node.js environments
Portable pipelines
Run the same pipeline on multiple execution engines without code changes. Beam supports:- DirectRunner: Execute locally for development and testing
- DataflowRunner: Run on Google Cloud Dataflow
- FlinkRunner: Execute on Apache Flink clusters
- SparkRunner: Run on Apache Spark clusters
- PrismRunner: Local execution using Beam Portability
Core programming model
Beam pipelines are built using four key concepts:PCollection
Represents a distributed dataset that can be bounded (batch) or unbounded (streaming). PCollections are immutable and can contain elements of any type.
PTransform
A data processing operation that takes one or more PCollections as input and produces one or more PCollections as output. Common transforms include
ParDo, GroupByKey, and Combine.Pipeline
A directed acyclic graph (DAG) of PTransforms and PCollections that defines your entire data processing workflow. Pipelines are constructed programmatically using the SDK.
Getting help
The Apache Beam community is active and helpful:- Mailing lists: Subscribe to user@beam.apache.org for questions and dev@beam.apache.org for development discussions
- Slack: Join the #beam channel on ASF Slack
- GitHub: Report issues and browse source code at github.com/apache/beam
- Stack Overflow: Ask questions tagged with apache-beam
Next steps
Try the quickstart
Build and run your first Apache Beam pipeline in less than 5 minutes