Skip to main content
Version: 2.5.0

System Overview

ChronoLog is a distributed, tiered log storage system designed for High-Performance Computing (HPC) environments. It captures, sequences, and archives streams of timestamped Events — called Stories — without relying on a central sequencer. Each Event carries a physical timestamp assigned at the source, and ChronoLog's pipeline progressively merges and orders these Events as they move through storage tiers. The system is built on Thallium RPC with OFI transport and supports RDMA for high-throughput bulk data movement.

Component Architecture

ChronoLog is composed of five main components, each running as an independent process:

ComponentRoleTypical deployment
ChronoVisorOrchestrator — client portal, metadata directory, process registry, load balancingOne per deployment, on a dedicated node
ChronoKeeperFast event ingestion — receives log events from clients, groups them into partial StoryChunksMany per deployment, on compute nodes
ChronoGrapherMerge and archive — merges partial StoryChunks into complete ones, writes to persistent storage as HDF5 filesOne per Recording Group, on a storage node
ChronoPlayerRead-back — serves story replay queries from both in-memory data and HDF5 archivesOne per Recording Group, on a storage node
Client LibraryApplication-facing API (chronolog_client.h) for connecting, creating Chronicles/Stories, recording events, and replaying dataLinked into client applications

Recording Groups

A Recording Group is a logical grouping of recording processes that work together to handle a subset of the story recording workload:

  • Each group contains multiple ChronoKeepers, one ChronoGrapher, and one ChronoPlayer.
  • ChronoVisor assigns newly acquired Stories to a recording group using uniform random distribution for load balancing.
  • All processes in a group register with ChronoVisor and send periodic heartbeat/statistics messages so that ChronoVisor can monitor group health and composition.
  • A deployment can have multiple Recording Groups, allowing ChronoLog to scale horizontally by adding more groups.

Data Flow

Write path

WRITE PATHClient Applog_event(record)Client LibraryTimestamps record,calls RPC to sendevent to ChronoVisorChronoVisorAcquireStory → assignsstory to Recording Group.Notifies Keepers/Graphervia DataStoreAdmin RPCsChronoKeeperIngestion Queue → StoryPipeline (in-memory). Eventsgrouped into partial StoryChunks.Retired chunks extractedChronoGrapherReceives partial StoryChunksfrom all Keepers. Merges intocomplete chunks. Archives topersistent storageHDF5 ArchivesPOSIX filesystemon storage node
  1. Client app calls log_event() with payload → passes to Client library
  2. Client library timestamps the Event → sends to ChronoVisor
  3. ChronoVisor assigns the Story to a Recording Group → notifies all group processes
  4. ChronoKeeper ingests Events into in-memory Story Pipeline → groups into partial StoryChunks
  5. Retired chunks are drained via RDMA bulk transfer to ChronoGrapher
  6. ChronoGrapher merges partials from all Keepers → archives complete StoryChunks to HDF5 archive files

Read path

READ PATHClient ApplicationReplayStory(storyId, startTime, endTime)ChronoVisorRoutes query to ChronoPlayer in the appropriate Recording GroupCHRONOPLAYERChronoPlayerQueries both in-memory and archival data sourcesPlayer Data Store (in-memory)Most recent merged chunksArchive Reading AgentReads from HDF5 persistent storagePlayback Response Transfer AgentBulk transfer back to client → merged, time-ordered event stream

The Player maintains an in-memory copy of the most recent story segments (the same chunks sent to ChronoGrapher), so recent events can be served before they are fully committed to the archive tier.

Communication Model

ChronoLog uses Thallium as its RPC framework, layered on top of Mercury and OFI (OpenFabrics Interfaces). The default transport protocol is ofi+sockets; for clusters with RDMA support, ofi+verbs enables native RDMA.

Default service ports and provider IDs

ServicePortProvider ID
Visor Client Portal555555
Visor Keeper Registry888888
Keeper Recording Service666666
Keeper→Grapher Drain (RDMA)999999
DataStore Admin Service444444

Registration and heartbeat protocol

  1. Each Keeper, Grapher, and Player process starts by sending a Register RPC to ChronoVisor's Recording Process Registry Service (port 8888).
  2. After registration, processes send periodic Heartbeat/Statistics messages so ChronoVisor can track liveness and load.
  3. ChronoVisor maintains DataStoreAdminClient connections to every registered process and uses them to push StartStoryRecording / StopStoryRecording notifications when clients acquire or release stories.

Key Concepts

TermDefinition
ChronicleA named collection of Stories. Carries metadata, indexing granularity, type (standard/priority), and a tiering policy.
StoryAn individual, named log stream within a Chronicle. The unit of data acquisition — clients acquire and release stories.
StoryChunkA time-range-bound container of log events for a single story. Defined by a start time and end time; events within are ordered by timestamp.
LogEventA single timestamped record: {storyId, eventTime, clientId, eventIndex, logRecord}.
StoryPipelineThe processing pipeline inside Keepers, Graphers, and Players that ingests events/chunks, orders them by time, groups them into StoryChunks, and retires completed chunks to the next tier.
Recording GroupA set of Keeper + Grapher + Player processes that collectively handle story recording for a subset of the workload.

For detailed data structure definitions, see the Data Model section.

Tiered Storage Design

ChronoLog implements a three-tier storage hierarchy that progressively trades latency for capacity:

TierLocationComponentMediumPurpose
HotCompute nodesChronoKeeperIn-memory (Story Pipeline)Fast event ingestion with sub-second latency
WarmStorage nodeChronoGrapher / ChronoPlayerIn-memory (Story Pipeline)Chunk merging, recent-data playback
ColdStorage nodeChronoGrapherHDF5 files on POSIX filesystemLong-term persistent archive

Data moves automatically from hot to cold:

  • Keepers retire partial StoryChunks once they exceed the configured chunk duration (default: 30 seconds) or the story stops recording.
  • Graphers merge partials from all Keepers into complete StoryChunks and archive them to HDF5 files.
  • Players maintain a warm copy of recent chunks for fast read-back while the archive catches up.

Tiering policy can be set per-Chronicle (normal, hot, or cold) to bias toward performance or capacity.

Design Principles

  • Physical timestamps — Events carry timestamps assigned at the source. There is no global sequencer; ordering is resolved progressively through the pipeline.
  • Double-buffering — StoryPipelines use a two-deque pattern (active and passive queues) so that ingestion and extraction can proceed in parallel without blocking each other. The active deque receives new data while the passive deque is drained by sequencing/extraction threads; they swap atomically when conditions are met.
  • Parallelized ingestion — Multiple ChronoKeepers per Recording Group accept events concurrently, distributing ingestion load across compute nodes.
  • Batch data movement — Retired StoryChunks are transferred in bulk from Keepers to Graphers, amortizing RPC overhead.
  • RDMA-capable transport — The Keeper→Grapher drain path uses Thallium's tl::bulk for zero-copy RDMA transfers when the OFI provider supports it (ofi+verbs), falling back to ofi+sockets otherwise.
  • Single-writer per tier — Each Recording Group has exactly one Grapher and one Player, avoiding write conflicts at the merge and archive stages.

Further Reading