Pipeline - Apache Beam

The Pipeline API provides the foundation for building Apache Beam data processing workflows. A Pipeline manages a directed acyclic graph of PTransforms and the PCollections they consume and produce.

Pipeline Class

The Pipeline class represents a data processing workflow.

Creating a Pipeline

create()

Creates a pipeline from default PipelineOptions.

public static Pipeline create()

Returns: A new Pipeline instance Example:

Pipeline pipeline = Pipeline.create();

create(PipelineOptions)

Creates a pipeline from the provided PipelineOptions.

public static Pipeline create(PipelineOptions options)

options

PipelineOptions

required

Configuration options for the pipeline, including runner specification

Returns: A new Pipeline instance Example:

PipelineOptions options = PipelineOptionsFactory.create();
Pipeline pipeline = Pipeline.create(options);

Applying Transforms

apply(PTransform)

Adds a root PTransform to the pipeline.

public <OutputT extends POutput> OutputT apply(
    PTransform<? super PBegin, OutputT> root)

root

PTransform<? super PBegin, OutputT>

required

The root transform to apply (e.g., Read or Create)

output

OutputT

The output PCollection or POutput from the transform

Example:

PCollection<String> lines = pipeline.apply(
    TextIO.read().from("gs://bucket/dir/file*.txt")
);

apply(String, PTransform)

Adds a named root PTransform to the pipeline.

public <OutputT extends POutput> OutputT apply(
    String name,
    PTransform<? super PBegin, OutputT> root)

name

String

required

The name for this transform node in the pipeline graph

root

PTransform<? super PBegin, OutputT>

required

The root transform to apply

output

OutputT

The output PCollection or POutput from the transform

Example:

PCollection<String> lines = pipeline.apply(
    "ReadInputFiles",
    TextIO.read().from("gs://bucket/dir/file*.txt")
);

Running the Pipeline

run()

Runs the pipeline using the PipelineOptions specified during creation.

public PipelineResult run()

result

PipelineResult

The result of the pipeline execution

Example:

Pipeline p = Pipeline.create();
// ... build pipeline ...
PipelineResult result = p.run();

run(PipelineOptions)

Runs the pipeline using the specified PipelineOptions.

public PipelineResult run(PipelineOptions options)

options

PipelineOptions

required

The pipeline options to use for execution

result

PipelineResult

The result of the pipeline execution

Other Methods

begin()

Returns a PBegin for this pipeline, used as input for root transforms.

public PBegin begin()

pbegin

PBegin

A PBegin owned by this pipeline

getOptions()

Returns the PipelineOptions for this pipeline.

public PipelineOptions getOptions()

options

PipelineOptions

The pipeline options

getCoderRegistry()

Returns the CoderRegistry for this pipeline.

public CoderRegistry getCoderRegistry()

registry

CoderRegistry

The coder registry for encoding/decoding data

getSchemaRegistry()

Returns the SchemaRegistry for this pipeline.

public SchemaRegistry getSchemaRegistry()

registry

SchemaRegistry

The schema registry for schema-aware transforms

PipelineResult Interface

Represents the result of a pipeline execution.

Methods

getState()

Retrieves the current state of the pipeline execution.

State getState()

state

State

The current state (UNKNOWN, STOPPED, RUNNING, DONE, FAILED, CANCELLED, UPDATED, UNRECOGNIZED)

cancel()

Cancels the pipeline execution.

State cancel() throws IOException

state

State

The state after cancellation attempt

waitUntilFinish()

Waits until the pipeline finishes and returns the final status.

State waitUntilFinish()

state

State

The final state of the pipeline

Example:

PipelineResult result = pipeline.run();
State finalState = result.waitUntilFinish();
if (finalState == State.DONE) {
    System.out.println("Pipeline completed successfully!");
}

waitUntilFinish(Duration)

Waits until the pipeline finishes with a timeout.

State waitUntilFinish(Duration duration)

duration

Duration

required

The maximum time to wait (less than 1ms means infinite wait)

state

State

The final state of the pipeline, or null on timeout

metrics()

Returns metrics from the pipeline execution.

MetricResults metrics()

metrics

MetricResults

The metrics from pipeline execution

PipelineOptions Interface

Configuration options for pipeline execution. Created using PipelineOptionsFactory. Example:

PipelineOptions options = PipelineOptionsFactory.create();
options.setRunner(DataflowRunner.class);
options.setTempLocation("gs://bucket/temp");

Pipeline pipeline = Pipeline.create(options);

Complete Example

import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.Count;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.values.TypeDescriptors;

public class WordCount {
    public static void main(String[] args) {
        // Create pipeline options
        PipelineOptions options = PipelineOptionsFactory
            .fromArgs(args)
            .create();
        
        // Create the pipeline
        Pipeline p = Pipeline.create(options);
        
        // Build the pipeline
        p.apply("ReadLines", TextIO.read().from("input.txt"))
         .apply("CountWords", Count.perElement())
         .apply("FormatResults", MapElements
             .into(TypeDescriptors.strings())
             .via(kv -> kv.getKey() + ": " + kv.getValue()))
         .apply("WriteResults", TextIO.write().to("output"));
        
        // Run the pipeline
        p.run().waitUntilFinish();
    }
}

Documentation Index

​Pipeline Class

​Creating a Pipeline

​create()

​create(PipelineOptions)

​Applying Transforms

​apply(PTransform)

​apply(String, PTransform)

​Running the Pipeline

​run()

​run(PipelineOptions)

​Other Methods

​begin()

​getOptions()

​getCoderRegistry()

​getSchemaRegistry()

​PipelineResult Interface

​Methods

​getState()

​cancel()

​waitUntilFinish()

​waitUntilFinish(Duration)

​metrics()

​PipelineOptions Interface

​Complete Example

Pipeline Class

Creating a Pipeline

create()

create(PipelineOptions)

Applying Transforms

apply(PTransform)

apply(String, PTransform)

Running the Pipeline

run()

run(PipelineOptions)

Other Methods

begin()

getOptions()

getCoderRegistry()

getSchemaRegistry()

PipelineResult Interface

Methods

getState()

cancel()

waitUntilFinish()

waitUntilFinish(Duration)

metrics()

PipelineOptions Interface

Complete Example