Installation
Quick Start
Here’s a simple word count example demonstrating core Beam concepts:Core Concepts
Pipeline
APipeline encapsulates your entire data processing task. It manages the directed acyclic graph (DAG) of transforms and data:
PCollection
PCollection<T> represents a distributed dataset that your pipeline operates on. PCollections can be bounded (batch) or unbounded (streaming).
PTransform
Transforms define the data processing operations. Common transforms include:- ParDo: Parallel processing for element-wise transformations
- GroupByKey: Groups elements by key
- Combine: Combines grouped values
- Flatten: Merges multiple PCollections
- Partition: Splits a PCollection into multiple collections
DoFn
DoFn is the primary way to implement custom processing logic:
SDK Features
Type Safety
Java SDK provides compile-time type checking for transforms and data types:Windowing
Process unbounded data using time-based windows:Side Inputs
Access additional data during processing:Metrics and Logging
Monitor pipeline execution with built-in metrics:I/O Connectors
The Java SDK includes extensive I/O support:- File-based: TextIO, AvroIO, ParquetIO, XML, JSON
- Google Cloud: BigQuery, Bigtable, Spanner, Pub/Sub, Cloud Storage
- Apache: Kafka, HBase, Cassandra, Solr, Hadoop
- Databases: JDBC, MongoDB, Redis, Elasticsearch
- Streaming: Kinesis, MQTT, AMQP
Running Pipelines
Direct Runner (Local)
For testing and development:Google Cloud Dataflow
For production workloads on Google Cloud:Apache Flink
Run on an Apache Flink cluster:Best Practices
Use Composite Transforms
Use Composite Transforms
Encapsulate reusable pipeline logic:
Implement Efficient DoFns
Implement Efficient DoFns
Use
@Setup and @Teardown for expensive initialization:Handle Large State Efficiently
Handle Large State Efficiently
Use State API for stateful processing:
Resources
API Reference
Complete JavaDoc documentation
Code Examples
Sample pipelines and patterns
I/O Transforms
Available connectors and I/O transforms
Programming Guide
In-depth SDK concepts
Next Steps
- Explore Runners to deploy your pipeline
- Learn about windowing and triggers for streaming data
- Check out I/O connectors for data sources and sinks
- Browse examples for common patterns