Skip to main content
Version: 2.5.0

Multi-Node (Cluster) Deployment

deploy_cluster.sh deploys ChronoLog across multiple nodes using parallel-ssh for remote process management. Like deploy_local.sh, it operates on an already-installed ChronoLog tree and supports starting, stopping, and cleaning only — no build or install steps. The script is located at tools/deploy/ChronoLog/deploy_cluster.sh in the repository.

Prerequisites

The following tools must be available on PATH of the launch node:

ToolPurpose
jqParse and generate JSON configuration files
parallel-sshLaunch/stop remote processes in parallel
sshRemote command execution and hostname resolution
lddInspect shared library dependencies
nohupRun processes independent of the shell session
pkillStop processes by name/path
readlinkResolve symbolic links
realpathCanonicalize file paths
chrpathAdjust RPATH entries in binaries

Passwordless SSH must be configured from the launch node to all cluster nodes. The ChronoLog installation tree (<work-dir>) must be accessible at the same path on all nodes (e.g., via a shared filesystem).

Host Files

Without a Slurm job ID, the script reads host lists from plain-text files under <work-dir>/conf/, one hostname per line:

FileRole
hosts_visorNode(s) running ChronoVisor — typically one entry
hosts_grapherNodes running ChronoGraphers — one per recording group
hosts_playerNodes running ChronoPlayers — one per recording group
hosts_keeperNodes running ChronoKeepers — all keeper nodes

Example hosts_keeper:

node01
node02
node03
node04

Keepers are distributed evenly across recording groups. With 4 keepers and 2 recording groups, each group gets 2 keepers.

Slurm Integration

When --job-id <JOB_ID> is provided, the script derives the node list from the active Slurm job and overwrites the host files:

  • First node → ChronoVisor
  • Last N nodes → ChronoGraphers and ChronoPlayers (one per recording group)
  • All nodes → ChronoKeepers
important

ChronoLog must be launched from an interactive job shell, not via sbatch or srun. The script will exit with an error if it detects it is running inside an srun step.

To start an interactive session first:

salloc -N 8 --time=01:00:00
# then, from the interactive shell:
./deploy_cluster.sh --start --job-id $SLURM_JOB_ID --record-groups 2

Recording Groups

--record-groups controls how many ChronoGrapher + ChronoPlayer pairs are deployed. Each recording group handles a disjoint subset of ChronoKeepers. Increasing the number of recording groups improves write throughput for large deployments.

The value of --record-groups must be ≤ the total number of keeper nodes.

Execution Modes

Exactly one mode must be specified per invocation:

ModeDescription
--startStart all ChronoLog processes across the cluster
--stopStop all ChronoLog processes gracefully (force-kills after 5 minutes)
--cleanRemove generated config files, per-group host files, logs, and output. All processes must be stopped first.

Options

OptionDefaultDescription
-w, --work-dir <path>Script location /../Root of the installed ChronoLog tree
-r, --record-groups <n>1 (or count of lines in hosts_grapher if it exists)Number of recording groups
-j, --job-id <id>(none)Slurm job ID; overrides host files when set
-m, --monitor-dir <path><work-dir>/monitorDirectory for remote process launch logs
-u, --output-dir <path><work-dir>/outputDirectory for story file output
-v, --visor-bin <path><work-dir>/bin/chrono-visorPath to the ChronoVisor binary
-g, --grapher-bin <path><work-dir>/bin/chrono-grapherPath to the ChronoGrapher binary
-p, --keeper-bin <path><work-dir>/bin/chrono-keeperPath to the ChronoKeeper binary
-a, --player-bin <path><work-dir>/bin/chrono-playerPath to the ChronoPlayer binary
-f, --conf-file <path><work-dir>/conf/default-chrono-conf.jsonMain configuration template
-n, --client-conf-file <path><work-dir>/conf/default-chrono-client-conf.jsonClient configuration template
-q, --visor-hosts <path><work-dir>/conf/hosts_visorOverride path to visor host file
-k, --grapher-hosts <path><work-dir>/conf/hosts_grapherOverride path to grapher host file
-o, --keeper-hosts <path><work-dir>/conf/hosts_keeperOverride path to keeper host file
-e, --verbosefalseEnable verbose output

Examples

Start with pre-configured host files (default work-dir):

./deploy_cluster.sh --start

Start using a Slurm job with 2 recording groups:

./deploy_cluster.sh --start --job-id $SLURM_JOB_ID --record-groups 2

Start from a custom installation with 3 recording groups:

./deploy_cluster.sh --start --work-dir /shared/chronolog --record-groups 3

Stop the deployment using existing host files:

./deploy_cluster.sh --stop

Stop using a Slurm job (re-derives hosts from job):

./deploy_cluster.sh --stop --job-id $SLURM_JOB_ID

Clean up generated files (run after stopping):

./deploy_cluster.sh --clean