Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/apache/beam/llms.txt

Use this file to discover all available pages before exploring further.

This quickstart will guide you through installing Apache Beam and running your first WordCount pipeline. WordCount is a classic example that demonstrates core Beam concepts: reading data, applying transformations, and writing results.

Choose your language

Install Python SDK

Apache Beam requires Python 3.8 or newer. Install the Beam SDK using pip:
pip install apache-beam
For Google Cloud Dataflow support, install with the gcp extra:
pip install apache-beam[gcp]

Create your pipeline

Create a file named wordcount.py with the following code:
import re
import argparse
import apache_beam as beam
from apache_beam.io import ReadFromText, WriteToText
from apache_beam.options.pipeline_options import PipelineOptions

def main(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument(
    '--input',
    dest='input',
    default='gs://dataflow-samples/shakespeare/kinglear.txt',
    help='Input file to process.')
parser.add_argument(
    '--output',
    dest='output',
    default='output.txt',
    help='Output file to write results to.')

known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)

with beam.Pipeline(options=pipeline_options) as p:
    # Read the text file into a PCollection
    lines = p | ReadFromText(known_args.input)
    
    # Count the occurrences of each word
    counts = (
        lines
        | 'Split' >> beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))
        | 'PairWithOne' >> beam.Map(lambda x: (x, 1))
        | 'GroupAndSum' >> beam.CombinePerKey(sum))
    
    # Format the counts into strings
    output = counts | 'Format' >> beam.Map(
        lambda word_count: '%s: %s' % (word_count[0], word_count[1]))
    
    # Write the output
    output | WriteToText(known_args.output)

if __name__ == '__main__':
main()

Run the pipeline

Execute the pipeline locally using the DirectRunner:
python wordcount.py --output output.txt
This will process Shakespeare’s King Lear and write word counts to output.txt-00000-of-00001.

What’s happening?

1

Read input

ReadFromText reads lines from the input file into a PCollection of strings.
2

Split into words

FlatMap applies a regex to extract individual words from each line.
3

Create key-value pairs

Map transforms each word into a tuple of (word, 1).
4

Count occurrences

CombinePerKey groups by word and sums the counts.
5

Format output

Another Map formats each word-count pair as a string.
6

Write results

WriteToText writes the formatted results to output files.

Running on other runners

By default, pipelines run locally using the DirectRunner. To run on a distributed backend:
python wordcount.py \
  --runner=DataflowRunner \
  --project=YOUR_PROJECT_ID \
  --region=us-central1 \
  --temp_location=gs://YOUR_BUCKET/temp \
  --output=gs://YOUR_BUCKET/output

Next steps

Programming guide

Learn about advanced transforms, windowing, and triggers

Pipeline I/O

Connect to databases, message queues, and cloud storage

Testing pipelines

Write unit and integration tests for your pipelines

Tour of Beam

Interactive tutorials covering all Beam concepts