A PCollection (Parallel Collection) represents a distributed dataset that your Beam pipeline operates on. It is the fundamental data abstraction in Apache Beam.
/** * A {@link PCollection PCollection<T>} is an immutable collection of values of type {@code T}. * A {@link PCollection} can contain either a bounded or unbounded number of elements. * Bounded and unbounded {@link PCollection PCollections} are produced as the output of * {@link PTransform PTransforms}. * * Each element in a {@link PCollection} has an associated timestamp. Readers assign * timestamps to elements when they create {@link PCollection PCollections}, and other * {@link PTransform PTransforms} propagate these timestamps from their input to their output. * * Additionally, a {@link PCollection} has an associated {@link WindowFn} and each element * is assigned to a set of windows. By default, the windowing function is {@link GlobalWindows} * and all elements are assigned into a single default window. */
PCollections are immutable. Once created, you cannot modify a PCollection’s contents. Transforms create new PCollections instead:
PCollection<String> lines = p.apply(TextIO.read().from("input.txt"));// This creates a NEW PCollection, doesn't modify 'lines'PCollection<String> words = lines.apply(new ExtractWords());
Immutability enables parallel processing and fault tolerance. Beam can retry failed operations safely because data doesn’t change.
PCollections are distributed across multiple workers. You don’t control where elements are processed:
# Elements are automatically distributedlines = p | beam.io.ReadFromText('large_file.txt') # Millions of lines# Beam distributes these across available workers
Every element in a PCollection has an associated timestamp:
// Elements get timestamps from the sourcePCollection<String> lines = p.apply(TextIO.read()...);// Default timestamp for file elements is MIN_TIMESTAMP// You can assign custom timestampsPCollection<String> timestamped = lines.apply( ParDo.of(new DoFn<String, String>() { @ProcessElement public void processElement( @Element String element, OutputReceiver<String> out) { Instant timestamp = new Instant(); out.outputWithTimestamp(element, timestamp); } }));
Elements are organized into windows for processing:
// Default: all elements in a single global windowPCollection<String> data = ...;// Apply fixed-time windowsPCollection<String> windowed = data.apply( Window.into(FixedWindows.of(Duration.standardMinutes(5))));
Windowing is covered in detail in the Windowing section.