Version: 3.0.0

System Overview

ChronoLog is a distributed, tiered log storage system designed for High-Performance Computing (HPC) environments. It captures, sequences, and archives streams of timestamped Events — called Stories — without relying on a central sequencer. Each Event carries a physical timestamp assigned at the source, and ChronoLog's pipeline progressively merges and orders these Events as they move through storage tiers. The system is built on Thallium RPC with OFI transport and supports RDMA for high-throughput bulk data movement.

Component Architecture

ChronoLog is composed of five main components, each running as an independent process:

Component	Role	Typical deployment
ChronoVisor	Orchestrator — client portal, metadata directory, process registry, load balancing	One per deployment, on a dedicated node
ChronoKeeper	Fast event ingestion — receives log events from clients, groups them into partial StoryChunks	Many per deployment, on compute nodes
ChronoGrapher	Merge and archive — merges partial StoryChunks into complete ones, writes to persistent storage as HDF5 files	One per Recording Group, on a storage node
ChronoPlayer	Read-back — serves story replay queries from both in-memory data and HDF5 archives	One per Recording Group, on a storage node
Client Library	Application-facing API (`chronolog_client.h`) for connecting, creating Chronicles/Stories, recording events, and replaying data	Linked into client applications

Recording Groups

A Recording Group is a logical grouping of recording processes that work together to handle a subset of the story recording workload:

Each group contains multiple ChronoKeepers, one ChronoGrapher, and one ChronoPlayer.
ChronoVisor assigns newly acquired Stories to a recording group using uniform random distribution for load balancing.
All processes in a group register with ChronoVisor and send periodic heartbeat/statistics messages so that ChronoVisor can monitor group health and composition.
A deployment can have multiple Recording Groups, allowing ChronoLog to scale horizontally by adding more groups.

Data Flow

Write path

Client app calls log_event() with payload → passes to Client library
Client library timestamps the Event → sends to ChronoVisor
ChronoVisor assigns the Story to a Recording Group → notifies all group processes
ChronoKeeper ingests Events into in-memory Story Pipeline → groups into partial StoryChunks
Retired chunks are drained via RDMA bulk transfer to ChronoGrapher
ChronoGrapher merges partials from all Keepers → archives complete StoryChunks to HDF5 archive files

Read path

The Player maintains an in-memory copy of the most recent story segments (the same chunks sent to ChronoGrapher), so recent events can be served before they are fully committed to the archive tier.

Communication Model

ChronoLog uses Thallium as its RPC framework, layered on top of Mercury and OFI (OpenFabrics Interfaces). The default transport protocol is ofi+sockets; for clusters with RDMA support, ofi+verbs enables native RDMA.

Default service ports and provider IDs

Service	Port	Provider ID
Visor Client Portal	5555	55
Visor Keeper Registry	8888	88
Keeper Recording Service	6666	66
Keeper→Grapher Drain (RDMA)	9999	99
DataStore Admin Service	4444	44

Registration and heartbeat protocol

Each Keeper, Grapher, and Player process starts by sending a Register RPC to ChronoVisor's Recording Process Registry Service (port 8888).
After registration, processes send periodic Heartbeat/Statistics messages so ChronoVisor can track liveness and load.
ChronoVisor maintains DataStoreAdminClient connections to every registered process and uses them to push StartStoryRecording / StopStoryRecording notifications when clients acquire or release stories.

Key Concepts

Term	Definition
Chronicle	A named collection of Stories. Carries metadata, indexing granularity, type (standard/priority), and a tiering policy.
Story	An individual, named log stream within a Chronicle. The unit of data acquisition — clients acquire and release stories.
StoryChunk	A time-range-bound container of log events for a single story. Defined by a start time and end time; events within are ordered by timestamp.
LogEvent	A single timestamped record: `{storyId, eventTime, clientId, eventIndex, logRecord}`.
StoryPipeline	The processing pipeline inside Keepers, Graphers, and Players that ingests events/chunks, orders them by time, groups them into StoryChunks, and retires completed chunks to the next tier.
Recording Group	A set of Keeper + Grapher + Player processes that collectively handle story recording for a subset of the workload.

For detailed data structure definitions, see the Data Model section.

Tiered Storage Design

ChronoLog implements a three-tier storage hierarchy that progressively trades latency for capacity:

Tier	Location	Component	Medium	Purpose
Hot	Compute nodes	ChronoKeeper	In-memory (Story Pipeline)	Fast event ingestion with sub-second latency
Warm	Storage node	ChronoGrapher / ChronoPlayer	In-memory (Story Pipeline)	Chunk merging, recent-data playback
Cold	Storage node	ChronoGrapher	HDF5 files on POSIX filesystem	Long-term persistent archive

Data moves automatically from hot to cold:

Keepers retire partial StoryChunks once they exceed the configured chunk duration (default: 30 seconds) or the story stops recording.
Graphers merge partials from all Keepers into complete StoryChunks and archive them to HDF5 files.
Players maintain a warm copy of recent chunks for fast read-back while the archive catches up.

Tiering policy can be set per-Chronicle (normal, hot, or cold) to bias toward performance or capacity.

Design Principles

Physical timestamps — Events carry timestamps assigned at the source. There is no global sequencer; ordering is resolved progressively through the pipeline.
Double-buffering — StoryPipelines use a two-deque pattern (active and passive queues) so that ingestion and extraction can proceed in parallel without blocking each other. The active deque receives new data while the passive deque is drained by sequencing/extraction threads; they swap atomically when conditions are met.
Parallelized ingestion — Multiple ChronoKeepers per Recording Group accept events concurrently, distributing ingestion load across compute nodes.
Batch data movement — Retired StoryChunks are transferred in bulk from Keepers to Graphers, amortizing RPC overhead.
RDMA-capable transport — The Keeper→Grapher drain path uses Thallium's tl::bulk for zero-copy RDMA transfers when the OFI provider supports it (ofi+verbs), falling back to ofi+sockets otherwise.
Single-writer per tier — Each Recording Group has exactly one Grapher and one Player, avoiding write conflicts at the merge and archive stages.

Component Architecture​

Recording Groups​

Data Flow​

Write path​

Read path​

Communication Model​

Default service ports and provider IDs​

Registration and heartbeat protocol​

Key Concepts​

Tiered Storage Design​

Design Principles​

Further Reading​