Skip to main content
Version: 2.5.0

Distributed Story Pipeline

The Distributed Story Pipeline is the data model that drives the design of ChronoKeeper and ChronoGrapher. It describes how log events travel from client applications through a multi-tier collection pipeline — being progressively grouped, merged, and archived — until they reach persistent storage. At any moment in time, segments of a Story's pipeline can be distributed across multiple processes, nodes, and storage tiers.

Pipeline at a Glance

The diagram below shows a point-in-time snapshot of two stories (S1 and S2) as their data flows through the four tiers of the pipeline. Time flows from right (newest events being generated) to left (oldest data, already archived).

DISTRIBUTED PIPELINE SNAPSHOTnowoldert0t10t20t30Persistent TierChronoGrapherChronoKeepersClientsPersistent TierHDF5 archivesS1 Chunk t0–t5S1 Chunk t5–t10S2 Chunk t3–t9S2 Chunk t9–t15HDF5archiveCHRONOGRAPHERMerged complete chunksStory 1 PipelineS1 t10–t15S1 t15–t20S1 t20–t25complete (all clients merged)Story 2 PipelineS2 t15–t21S2 t21–t27complete (all clients merged)RDMAbulkCHRONOKEEPERSPartial chunks (subset of events)ChronoKeeper 1S1 t25–30S1 t30–35S2 t27–33events from Client 1partial (Client 1 events only)ChronoKeeper 2S1 t25–30S1 t30–35S2 t27–33events from Client 2partial (Client 2 events only)log_event()RPCClient 1Generating events forStory 1 and Story 2S1 e1S2 e2S1 e3S1 e4Client 2Generating events forStory 1 and Story 2S1 e5S1 e6S2 e7S2 e8

Reading right-to-left, each tier plays a distinct role:

  • Clients generate individual LogEvents and stream them to ChronoKeepers via RPC.
  • ChronoKeepers (compute nodes) group incoming events into time-windowed partial StoryChunks. Each Keeper only sees the events sent to it, so its chunks contain a subset of the full story.
  • ChronoGrapher (storage node) receives partial chunks from all Keepers and merges them into complete, globally-ordered StoryChunks.
  • Persistent Tier stores the final, archived StoryChunks in HDF5 format on the POSIX filesystem.

At any instant, the newest data lives on the right (still being generated), while progressively older, more complete data has migrated left through the pipeline toward persistent storage.

Building Blocks

Three core abstractions make up the pipeline data model.

Story

A Story is a time-series dataset composed of individual log events that client processes running on various HPC nodes generate over time. Every log event is identified by the StoryId of the story it belongs to, the ClientId of the application that generated it, and a timestamp assigned by the ChronoLog client library at the moment it takes ownership of the event. To handle the case where the same timestamp is generated on different threads of a client application, ChronoLog also augments the timestamp with a client-specific index.

StoryChunk

A StoryChunk is a container for events belonging to the same story that fall into the time range [start_time, end_time) — including the start time and excluding the end time. Events within a StoryChunk are ordered by their EventSequence = (timestamp, ClientId, index).

StoryChunks are the fundamental unit of data movement through the pipeline. They travel from ChronoKeepers to ChronoGrapher as partial batches, get merged into complete batches, and are eventually archived to persistent storage.

StoryPipeline

A StoryPipeline is the ordered set of StoryChunks for a given story, sorted by their start_time values. For any story with active writers, segments of the StoryPipeline can be distributed over several processes, nodes, and storage tiers simultaneously — as illustrated in the snapshot above.

Parallel Ingestion

A key design goal of ChronoLog is to distribute the ingestion workload across many compute nodes. Rather than funneling all events through a single process, client applications spread their events across multiple ChronoKeepers using round-robin selection.

PARALLEL INGESTION AND MERGINGClient 1events: e1, e4, e7Client 2events: e2, e5, e8Client 3events: e3, e6, e9round-robin distribution (solid = current, dashed = next rotation)ChronoKeeper 1Story 1 PipelineS1 t0–30 (partial)S1 t30–60 (partial)contains events from Client 1 + Client 3 onlyChronoKeeper 2Story 1 PipelineS1 t0–30 (partial)S1 t30–60 (partial)contains events from Client 2 + Client 3 onlyRDMA bulk transferChronoGraphermerges partial chunks into complete, globally-ordered StoryChunksK1: S1 t0–30K2: S1 t0–30mergeS1 t0–30 (complete)K1: S1 t30–60K2: S1 t30–60mergeS1 t30–60 (complete)HDF5 Persistent Storage

Because each client can send events to any ChronoKeeper, each Keeper involved in the same story recording can only have a subset of the story's events for a given time range. These are called partial StoryChunks.

When the ChronoGrapher receives partial StoryChunks for the same story and overlapping time ranges from different Keepers, it merges them into complete StoryChunks that contain events from all clients, sorted by EventSequence. StoryChunks in the ChronoGrapher's pipeline typically use longer time ranges (coarser granularity) than those in the ChronoKeepers, since the Grapher operates at a slower, batch-oriented cadence.

Acceptance Windows

Each tier in the pipeline holds onto its StoryChunks for a configurable acceptance window before forwarding them downstream. This window provides tolerance for network delays and late-arriving events — an event that arrives slightly after its chunk's time range has ended can still be inserted into the correct chunk, as long as the acceptance window has not yet expired.

ACCEPTANCE WINDOW AND CHUNK LIFECYCLEtimet=0st=30st=60st=120st=240sChunkDurationevents grouped into chunkKeeperWindowlate events still accepted (network delay tolerance)Extract& TransferRDMAGrapherWindowmerge partial chunks from all KeepersArchiveto HDF5HDF5Total time from event ingestion to persistent archive: ~240s (configurable per tier)

The lifecycle of a single StoryChunk follows these phases:

  1. Chunk Duration — During the chunk's time window (default: 30 seconds), the ChronoKeeper groups arriving events into the chunk by StoryId and timestamp.
  2. Keeper Acceptance Window — After the chunk's time range ends, the Keeper continues accepting late events for a configurable window (default: 60 seconds from chunk start) to tolerate network delays.
  3. Extract and Transfer — Once the acceptance window expires, the Keeper serializes the partial StoryChunk and sends it to the ChronoGrapher via RDMA bulk transfer.
  4. Grapher Acceptance Window — The ChronoGrapher holds the chunk in its own pipeline (default: 180 seconds), during which it merges partial chunks from different Keepers into complete StoryChunks.
  5. Archive — When the Grapher's acceptance window expires, the complete StoryChunk is extracted and written to HDF5 persistent storage.

All timing parameters are configurable per deployment to balance between latency (shorter windows) and completeness (longer windows that tolerate more delay).

Design Benefits

The distributed Story Pipeline model provides several key advantages for high-throughput event collection in HPC environments:

  • Parallelized ingestion — Using a group of ChronoKeepers running on different HPC nodes to record individual events for the same story distributes the immediate ingestion workload and parallelizes early sequencing of events into partial StoryChunks across compute nodes.
  • Batch data movement — Moving events in presorted batches (StoryChunks) between compute and storage nodes, rather than forwarding individual events, provides significantly higher overall throughput across the network.
  • Progressive ordering — Events are first locally sorted within each Keeper, then globally merged at the Grapher. This two-phase approach avoids the bottleneck of a centralized sequencer.
  • Network delay tolerance — Configurable acceptance windows at each tier ensure that late-arriving events are correctly placed, without sacrificing pipeline throughput.
  • Decoupled processing — Each tier operates independently: Keepers can continue ingesting while the Grapher merges and archives, preventing back-pressure from slowing down client applications.