Distributed Story Pipeline
The Distributed Story Pipeline is the data model that drives the design of ChronoKeeper and ChronoGrapher. It describes how log events travel from client applications through a multi-tier collection pipeline — being progressively grouped, merged, and archived — until they reach persistent storage. At any moment in time, segments of a Story's pipeline can be distributed across multiple processes, nodes, and storage tiers.
Pipeline at a Glance
The diagram below shows a point-in-time snapshot of two stories (S1 and S2) as their data flows through the four tiers of the pipeline. Time flows from right (newest events being generated) to left (oldest data, already archived).
Reading right-to-left, each tier plays a distinct role:
- Clients generate individual LogEvents and stream them to ChronoKeepers via RPC.
- ChronoKeepers (compute nodes) group incoming events into time-windowed partial StoryChunks. Each Keeper only sees the events sent to it, so its chunks contain a subset of the full story.
- ChronoGrapher (storage node) receives partial chunks from all Keepers and merges them into complete, globally-ordered StoryChunks.
- Persistent Tier stores the final, archived StoryChunks in HDF5 format on the POSIX filesystem.
At any instant, the newest data lives on the right (still being generated), while progressively older, more complete data has migrated left through the pipeline toward persistent storage.
Building Blocks
Three core abstractions make up the pipeline data model.
Story
A Story is a time-series dataset composed of individual log events that client processes running on various HPC nodes generate over time. Every log event is identified by the StoryId of the story it belongs to, the ClientId of the application that generated it, and a timestamp assigned by the ChronoLog client library at the moment it takes ownership of the event. To handle the case where the same timestamp is generated on different threads of a client application, ChronoLog also augments the timestamp with a client-specific index.
StoryChunk
A StoryChunk is a container for events belonging to the same story that fall into the time range [start_time, end_time) — including the start time and excluding the end time. Events within a StoryChunk are ordered by their EventSequence = (timestamp, ClientId, index).
StoryChunks are the fundamental unit of data movement through the pipeline. They travel from ChronoKeepers to ChronoGrapher as partial batches, get merged into complete batches, and are eventually archived to persistent storage.
StoryPipeline
A StoryPipeline is the ordered set of StoryChunks for a given story, sorted by their start_time values. For any story with active writers, segments of the StoryPipeline can be distributed over several processes, nodes, and storage tiers simultaneously — as illustrated in the snapshot above.
Parallel Ingestion
A key design goal of ChronoLog is to distribute the ingestion workload across many compute nodes. Rather than funneling all events through a single process, client applications spread their events across multiple ChronoKeepers using round-robin selection.
Because each client can send events to any ChronoKeeper, each Keeper involved in the same story recording can only have a subset of the story's events for a given time range. These are called partial StoryChunks.
When the ChronoGrapher receives partial StoryChunks for the same story and overlapping time ranges from different Keepers, it merges them into complete StoryChunks that contain events from all clients, sorted by EventSequence. StoryChunks in the ChronoGrapher's pipeline typically use longer time ranges (coarser granularity) than those in the ChronoKeepers, since the Grapher operates at a slower, batch-oriented cadence.
Acceptance Windows
Each tier in the pipeline holds onto its StoryChunks for a configurable acceptance window before forwarding them downstream. This window provides tolerance for network delays and late-arriving events — an event that arrives slightly after its chunk's time range has ended can still be inserted into the correct chunk, as long as the acceptance window has not yet expired.
The lifecycle of a single StoryChunk follows these phases:
- Chunk Duration — During the chunk's time window (default: 30 seconds), the ChronoKeeper groups arriving events into the chunk by StoryId and timestamp.
- Keeper Acceptance Window — After the chunk's time range ends, the Keeper continues accepting late events for a configurable window (default: 60 seconds from chunk start) to tolerate network delays.
- Extract and Transfer — Once the acceptance window expires, the Keeper serializes the partial StoryChunk and sends it to the ChronoGrapher via RDMA bulk transfer.
- Grapher Acceptance Window — The ChronoGrapher holds the chunk in its own pipeline (default: 180 seconds), during which it merges partial chunks from different Keepers into complete StoryChunks.
- Archive — When the Grapher's acceptance window expires, the complete StoryChunk is extracted and written to HDF5 persistent storage.
All timing parameters are configurable per deployment to balance between latency (shorter windows) and completeness (longer windows that tolerate more delay).
Design Benefits
The distributed Story Pipeline model provides several key advantages for high-throughput event collection in HPC environments:
- Parallelized ingestion — Using a group of ChronoKeepers running on different HPC nodes to record individual events for the same story distributes the immediate ingestion workload and parallelizes early sequencing of events into partial StoryChunks across compute nodes.
- Batch data movement — Moving events in presorted batches (StoryChunks) between compute and storage nodes, rather than forwarding individual events, provides significantly higher overall throughput across the network.
- Progressive ordering — Events are first locally sorted within each Keeper, then globally merged at the Grapher. This two-phase approach avoids the bottleneck of a centralized sequencer.
- Network delay tolerance — Configurable acceptance windows at each tier ensure that late-arriving events are correctly placed, without sacrificing pipeline throughput.
- Decoupled processing — Each tier operates independently: Keepers can continue ingesting while the Grapher merges and archives, preventing back-pressure from slowing down client applications.