System Overview
ChronoLog is a distributed, tiered log storage system designed for High-Performance Computing (HPC) environments. It captures, sequences, and archives streams of timestamped Events — called Stories — without relying on a central sequencer. Each Event carries a physical timestamp assigned at the source, and ChronoLog's pipeline progressively merges and orders these Events as they move through storage tiers. The system is built on Thallium RPC with OFI transport and supports RDMA for high-throughput bulk data movement.
Component Architecture
ChronoLog is composed of five main components, each running as an independent process:
| Component | Role | Typical deployment |
|---|---|---|
| ChronoVisor | Orchestrator — client portal, metadata directory, process registry, load balancing | One per deployment, on a dedicated node |
| ChronoKeeper | Fast event ingestion — receives log events from clients, groups them into partial StoryChunks | Many per deployment, on compute nodes |
| ChronoGrapher | Merge and archive — merges partial StoryChunks into complete ones, writes to persistent storage as HDF5 files | One per Recording Group, on a storage node |
| ChronoPlayer | Read-back — serves story replay queries from both in-memory data and HDF5 archives | One per Recording Group, on a storage node |
| Client Library | Application-facing API (chronolog_client.h) for connecting, creating Chronicles/Stories, recording events, and replaying data | Linked into client applications |
Recording Groups
A Recording Group is a logical grouping of recording processes that work together to handle a subset of the story recording workload:
- Each group contains multiple ChronoKeepers, one ChronoGrapher, and one ChronoPlayer.
- ChronoVisor assigns newly acquired Stories to a recording group using uniform random distribution for load balancing.
- All processes in a group register with ChronoVisor and send periodic heartbeat/statistics messages so that ChronoVisor can monitor group health and composition.
- A deployment can have multiple Recording Groups, allowing ChronoLog to scale horizontally by adding more groups.
Data Flow
Write path
- Client app calls
log_event()with payload → passes to Client library - Client library timestamps the Event → sends to ChronoVisor
- ChronoVisor assigns the Story to a Recording Group → notifies all group processes
- ChronoKeeper ingests Events into in-memory Story Pipeline → groups into partial StoryChunks
- Retired chunks are drained via RDMA bulk transfer to ChronoGrapher
- ChronoGrapher merges partials from all Keepers → archives complete StoryChunks to HDF5 archive files
Read path
The Player maintains an in-memory copy of the most recent story segments (the same chunks sent to ChronoGrapher), so recent events can be served before they are fully committed to the archive tier.
Communication Model
ChronoLog uses Thallium as its RPC framework, layered on top of Mercury and OFI (OpenFabrics Interfaces). The default transport protocol is ofi+sockets; for clusters with RDMA support, ofi+verbs enables native RDMA.
Default service ports and provider IDs
| Service | Port | Provider ID |
|---|---|---|
| Visor Client Portal | 5555 | 55 |
| Visor Keeper Registry | 8888 | 88 |
| Keeper Recording Service | 6666 | 66 |
| Keeper→Grapher Drain (RDMA) | 9999 | 99 |
| DataStore Admin Service | 4444 | 44 |
Registration and heartbeat protocol
- Each Keeper, Grapher, and Player process starts by sending a Register RPC to ChronoVisor's Recording Process Registry Service (port 8888).
- After registration, processes send periodic Heartbeat/Statistics messages so ChronoVisor can track liveness and load.
- ChronoVisor maintains DataStoreAdminClient connections to every registered process and uses them to push StartStoryRecording / StopStoryRecording notifications when clients acquire or release stories.
Key Concepts
| Term | Definition |
|---|---|
| Chronicle | A named collection of Stories. Carries metadata, indexing granularity, type (standard/priority), and a tiering policy. |
| Story | An individual, named log stream within a Chronicle. The unit of data acquisition — clients acquire and release stories. |
| StoryChunk | A time-range-bound container of log events for a single story. Defined by a start time and end time; events within are ordered by timestamp. |
| LogEvent | A single timestamped record: {storyId, eventTime, clientId, eventIndex, logRecord}. |
| StoryPipeline | The processing pipeline inside Keepers, Graphers, and Players that ingests events/chunks, orders them by time, groups them into StoryChunks, and retires completed chunks to the next tier. |
| Recording Group | A set of Keeper + Grapher + Player processes that collectively handle story recording for a subset of the workload. |
For detailed data structure definitions, see the Data Model section.
Tiered Storage Design
ChronoLog implements a three-tier storage hierarchy that progressively trades latency for capacity:
| Tier | Location | Component | Medium | Purpose |
|---|---|---|---|---|
| Hot | Compute nodes | ChronoKeeper | In-memory (Story Pipeline) | Fast event ingestion with sub-second latency |
| Warm | Storage node | ChronoGrapher / ChronoPlayer | In-memory (Story Pipeline) | Chunk merging, recent-data playback |
| Cold | Storage node | ChronoGrapher | HDF5 files on POSIX filesystem | Long-term persistent archive |
Data moves automatically from hot to cold:
- Keepers retire partial StoryChunks once they exceed the configured chunk duration (default: 30 seconds) or the story stops recording.
- Graphers merge partials from all Keepers into complete StoryChunks and archive them to HDF5 files.
- Players maintain a warm copy of recent chunks for fast read-back while the archive catches up.
Tiering policy can be set per-Chronicle (normal, hot, or cold) to bias toward performance or capacity.
Design Principles
- Physical timestamps — Events carry timestamps assigned at the source. There is no global sequencer; ordering is resolved progressively through the pipeline.
- Double-buffering — StoryPipelines use a two-deque pattern (active and passive queues) so that ingestion and extraction can proceed in parallel without blocking each other. The active deque receives new data while the passive deque is drained by sequencing/extraction threads; they swap atomically when conditions are met.
- Parallelized ingestion — Multiple ChronoKeepers per Recording Group accept events concurrently, distributing ingestion load across compute nodes.
- Batch data movement — Retired StoryChunks are transferred in bulk from Keepers to Graphers, amortizing RPC overhead.
- RDMA-capable transport — The Keeper→Grapher drain path uses Thallium's
tl::bulkfor zero-copy RDMA transfers when the OFI provider supports it (ofi+verbs), falling back toofi+socketsotherwise. - Single-writer per tier — Each Recording Group has exactly one Grapher and one Player, avoiding write conflicts at the merge and archive stages.
Further Reading
- Component deep-dives: ChronoVisor | ChronoKeeper | ChronoGrapher | ChronoPlayer
- Data Model: Overview