Java SDK - Apache Beam

The Apache Beam Java SDK provides a comprehensive framework for building batch and streaming data processing pipelines in Java. It offers strong typing, extensive I/O connectors, and excellent IDE support.

Installation

Set up Java Development Kit

Install JDK 8 or later. You can verify your Java installation:

java -version

Add Beam dependency to your project

Choose your build tool and add the Apache Beam dependency:

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-sdks-java-core</artifactId>
  <version>2.73.0</version>
</dependency>

<!-- Add a runner dependency -->
<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-runners-direct-java</artifactId>
  <version>2.73.0</version>
  <scope>runtime</scope>
</dependency>

Add I/O connectors (optional)

Include additional dependencies for specific data sources:

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-sdks-java-io-google-cloud-platform</artifactId>
  <version>2.73.0</version>
</dependency>

Quick Start

Here’s a simple word count example demonstrating core Beam concepts:

import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.Count;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.SimpleFunction;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollection;

public class WordCount {
  
  // DoFn to extract words from each line
  static class ExtractWordsFn extends DoFn<String, String> {
    @ProcessElement
    public void processElement(ProcessContext c) {
      String[] words = c.element().split("[^\\p{L}]+");
      for (String word : words) {
        if (!word.isEmpty()) {
          c.output(word);
        }
      }
    }
  }
  
  // Format the word count results
  public static class FormatAsTextFn extends SimpleFunction<KV<String, Long>, String> {
    @Override
    public String apply(KV<String, Long> input) {
      return input.getKey() + ": " + input.getValue();
    }
  }
  
  public static void main(String[] args) {
    PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create();
    Pipeline pipeline = Pipeline.create(options);
    
    pipeline
      .apply("Read", TextIO.read().from("gs://dataflow-samples/shakespeare/*.txt"))
      .apply("ExtractWords", ParDo.of(new ExtractWordsFn()))
      .apply("CountWords", Count.perElement())
      .apply("FormatResults", MapElements.via(new FormatAsTextFn()))
      .apply("Write", TextIO.write().to("output/wordcount"));
    
    pipeline.run().waitUntilFinish();
  }
}

Core Concepts

Pipeline

A Pipeline encapsulates your entire data processing task. It manages the directed acyclic graph (DAG) of transforms and data:

PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create();
Pipeline pipeline = Pipeline.create(options);

PCollection

PCollection<T> represents a distributed dataset that your pipeline operates on. PCollections can be bounded (batch) or unbounded (streaming).

PTransform

Transforms define the data processing operations. Common transforms include:

ParDo: Parallel processing for element-wise transformations
GroupByKey: Groups elements by key
Combine: Combines grouped values
Flatten: Merges multiple PCollections
Partition: Splits a PCollection into multiple collections

DoFn

DoFn is the primary way to implement custom processing logic:

static class MyDoFn extends DoFn<InputType, OutputType> {
  @ProcessElement
  public void processElement(@Element InputType element, OutputReceiver<OutputType> out) {
    // Process element and emit output
    out.output(transform(element));
  }
}

SDK Features

Type Safety

Java SDK provides compile-time type checking for transforms and data types:

PCollection<String> lines = ...;
PCollection<KV<String, Integer>> wordCounts = lines
  .apply(ParDo.of(new ExtractWordsFn()))
  .apply(Count.perElement());

Windowing

Process unbounded data using time-based windows:

import org.apache.beam.sdk.transforms.windowing.*;

PCollection<String> windowedData = input
  .apply(Window.<String>into(
    FixedWindows.of(Duration.standardMinutes(1))))
  .apply(/* your transforms */);

Side Inputs

Access additional data during processing:

PCollectionView<Map<String, Integer>> sideInputView = 
  sideData.apply(View.asMap());

PCollection<String> results = mainData.apply(
  ParDo.of(new DoFn<String, String>() {
    @ProcessElement
    public void processElement(ProcessContext c) {
      Map<String, Integer> map = c.sideInput(sideInputView);
      // Use side input data
    }
  }).withSideInputs(sideInputView));

Metrics and Logging

Monitor pipeline execution with built-in metrics:

import org.apache.beam.sdk.metrics.*;

public class MyDoFn extends DoFn<String, String> {
  private final Counter processedElements = Metrics.counter(MyDoFn.class, "processed");
  
  @ProcessElement
  public void processElement(ProcessContext c) {
    processedElements.inc();
    // Process element
  }
}

I/O Connectors

The Java SDK includes extensive I/O support:

File-based: TextIO, AvroIO, ParquetIO, XML, JSON
Google Cloud: BigQuery, Bigtable, Spanner, Pub/Sub, Cloud Storage
Apache: Kafka, HBase, Cassandra, Solr, Hadoop
Databases: JDBC, MongoDB, Redis, Elasticsearch
Streaming: Kinesis, MQTT, AMQP

// Reading from BigQuery
PCollection<TableRow> rows = pipeline.apply(
  BigQueryIO.readTableRows().from("project:dataset.table"));

// Writing to Kafka
events.apply(KafkaIO.<Void, String>write()
  .withBootstrapServers("localhost:9092")
  .withTopic("my-topic")
  .withValueSerializer(StringSerializer.class));

Running Pipelines

Direct Runner (Local)

For testing and development:

mvn compile exec:java -Dexec.mainClass=com.example.WordCount \
  --runner=DirectRunner

Google Cloud Dataflow

For production workloads on Google Cloud:

mvn compile exec:java -Dexec.mainClass=com.example.WordCount \
  --runner=DataflowRunner \
  --project=YOUR_PROJECT_ID \
  --region=us-central1 \
  --tempLocation=gs://YOUR_BUCKET/temp

Apache Flink

Run on an Apache Flink cluster:

mvn package -Pflink-runner
flink run target/your-pipeline-bundled.jar \
  --runner=FlinkRunner \
  --flinkMaster=localhost:8081

Best Practices

Use Composite Transforms

Encapsulate reusable pipeline logic:

public static class CountWords extends 
    PTransform<PCollection<String>, PCollection<KV<String, Long>>> {
  
  @Override
  public PCollection<KV<String, Long>> expand(PCollection<String> lines) {
    return lines
      .apply(ParDo.of(new ExtractWordsFn()))
      .apply(Count.perElement());
  }
}

Implement Efficient DoFns

Use @Setup and @Teardown for expensive initialization:

class MyDoFn extends DoFn<String, String> {
  private transient ExpensiveResource resource;
  
  @Setup
  public void setup() {
    resource = new ExpensiveResource();
  }
  
  @Teardown
  public void teardown() {
    resource.close();
  }
  
  @ProcessElement
  public void processElement(ProcessContext c) {
    // Use resource
  }
}

Handle Large State Efficiently

Use State API for stateful processing:

class StatefulDoFn extends DoFn<KV<String, Integer>, String> {
  @StateId("sum")
  private final StateSpec<ValueState<Integer>> sumSpec = 
    StateSpecs.value();
  
  @ProcessElement
  public void process(
      @Element KV<String, Integer> element,
      @StateId("sum") ValueState<Integer> sum) {
    Integer current = sum.read();
    sum.write((current == null ? 0 : current) + element.getValue());
  }
}

Resources

API Reference

Complete JavaDoc documentation

Code Examples

Sample pipelines and patterns

I/O Transforms

Available connectors and I/O transforms

Programming Guide

In-depth SDK concepts

Next Steps

Explore Runners to deploy your pipeline
Learn about windowing and triggers for streaming data
Check out I/O connectors for data sources and sinks
Browse examples for common patterns

Documentation Index

​Installation

​Quick Start

​Core Concepts

​Pipeline

​PCollection

​PTransform

​DoFn

​SDK Features

​Type Safety

​Windowing

​Side Inputs

​Metrics and Logging

​I/O Connectors

​Running Pipelines

​Direct Runner (Local)

​Google Cloud Dataflow

​Apache Flink

​Best Practices

​Resources