Overview
WordCount is the “Hello World” of data processing pipelines. These examples show you how to:- Read text data from files
- Transform and process data using Beam transforms
- Count and aggregate results
- Write output to files
- Configure and run pipelines on different runners
Start with MinimalWordCount
Begin with the simplest implementation to understand the basic pipeline structure.
MinimalWordCount
The simplest WordCount implementation focuses on the core pipeline structure without additional complexity.- Java
- Python
- Go
TextIO.read(): Reads text files line by lineFlatMapElements: Splits each line into multiple wordsFilter: Removes empty stringsCount.perElement(): Counts occurrences of each unique wordMapElements: Formats output as “word: count”TextIO.write(): Writes results to files
WordCount with Best Practices
The full WordCount example demonstrates production-ready patterns:- Java
- Custom DoFn classes: Define reusable processing logic
- Metrics: Track pipeline behavior (counters, distributions)
- Composite transforms: Bundle multiple transforms for reuse
- Pipeline options: Configure pipelines via command-line arguments
- Named transforms: Easier debugging and monitoring
Running WordCount
Local Execution
Run with the DirectRunner for local testing:- Java
- Python
- Go
Running on Cloud Dataflow
- Java
- Python
Running on Apache Flink
- Java
- Python
WindowedWordCount
The windowed example shows how to process streaming data with time-based windows:- Timestamps: Associate data with event times
- Windows: Group data into time-based buckets
- FixedWindows: Non-overlapping time intervals
- Per-window processing: Process each window independently
Key Takeaways
Start Simple
Begin with MinimalWordCount to understand core concepts before adding complexity.
Add Metrics
Use counters and distributions to monitor pipeline behavior in production.
Create Composite Transforms
Bundle related transforms for reusability and better testing.
Use Pipeline Options
Make pipelines configurable via command-line arguments instead of hardcoding values.
Next Steps
Cookbook Examples
Explore common patterns like filtering, joining, and combining data.
Transforms Guide
Learn about all available Beam transforms.
I/O Connectors
Read from and write to various data sources.
Runners
Execute pipelines on different distributed processing backends.