90 Seconds to 300 Milliseconds: Fixing CI Install Times

Ninety seconds. That’s how long yarn install takes on a fresh CI runner for a monorepo with 1,500 packages. Every run. Every PR. Every merge queue check. The developer who triggered the build has already moved on to another task. The CI runner is still downloading tarballs.

Teams know this hurts. The fix everyone reaches for: cache node_modules as a tarball. GitHub Actions provides actions/cache. Azure Pipelines provides Cache@2. The recipe is everywhere: hash the lockfile, tar up node_modules, upload it, restore on the next run. Lockfile unchanged? Skip install entirely. Ninety seconds becomes two seconds. Problem solved.

Until one package changes.

The enemy: the tarball model

When you cache node_modules as a single tarball keyed by the lockfile hash, you are making a bet: the lockfile changes rarely enough that re-uploading 1.4 GB is acceptable when it does. That bet loses in any actively maintained monorepo. A Dependabot PR bumps one transitive dependency. The lockfile hash changes. The entire tarball is invalidated. Re-download all 1,500 packages. Re-tar. Re-upload. Back to ninety seconds.

That’s the obvious failure. The non-obvious ones are worse.

Symlinks break cross-platform. pnpm’s node_modules is a symlink graph. node_modules/.pnpm/react@18.2.0/node_modules/react/ is the real location; node_modules/react/ is a symlink to it. Cross-platform tar implementations disagree on how to handle absolute vs relative symlink targets. Restore a pnpm node_modules tarball on a runner with a different path prefix and the symlinks point nowhere. The install “succeeded.” The packages are missing.

Update granularity is all-or-nothing. One package changed out of 1,500. The tarball model forces you to re-upload all 1,500. Per-package CAS uploads only the one that changed. The difference is 1.4 GB vs 2 MB.

ABI fragility. Native addons compiled on macOS segfault on Linux. A tarball does not encode what platform it was built for. Restore a tarball from a macOS CI runner onto a Linux runner and esbuild crashes with an illegal instruction. You discover this at runtime, not at restore time.

Debugging opacity. When a tarball restore fails, you get “tar: Unexpected EOF” or a silent corruption. There is no way to know which of the 1,500 packages is wrong. With per-package CAS, a failure points at the failing key, the expected hash, and the on-disk state.

Storage scaling. A monorepo’s lifetime worth of tarballs is gigabytes. Its lifetime worth of unique package versions in CAS is megabytes, because most package versions are reused across builds.

This is not a new observation. BuildXL uses per-pip CAS. Bazel rules_js uses per-package CAS. pnpm’s local store is itself a per-package CAS. The consensus is clear. CI cache actions ignore it.

The insight: lockfiles already contain the keys

Every modern lockfile already contains a content hash for every external package:

# pnpm-lock.yaml
integrity: sha512-abc123...

# yarn.lock (classic)
integrity sha512-abc123...

# yarn.lock (berry)
checksum: 10c0/sha512hex...

# package-lock.json
"integrity": "sha512-abc123..."

These are computed by the package manager from the tarball bytes downloaded from the registry. They are deterministic and machine-independent. The package manager already verifies them on every install.

The CAS key is trivial:

cas_key = blake3(integrity_string)

The value is the tarball bytes, copied from the package manager’s local cache. No re-download from the registry. On restore, extract the tarball into node_modules/{name}/. That’s it.

This is what rage does. The CAS is monotonic: once a package version enters the store, it stays until garbage-collected. A monorepo’s lockfile churn over months builds a CAS that contains every version ever pinned. Restore is milliseconds of hardlink syscalls, not seconds of network I/O.

The bug that drove the v2 design

The first implementation cached yarn install as a skip marker. Here’s what it looked like in crates/scheduler/:

let marker = cache.dir().join(format!("root-{fp}.done"));
if marker.exists() {
    eprintln!("[rage] {}#{} ✓ (cached)", ..., ...);
    return Ok(());                              // restores nothing
}
// run the install
let _ = std::fs::write(&marker, b"");           // empty file as the "artifact"

The marker file was a zero-byte .done file. Its existence meant “install already ran.” The cache had cached nothing. It had cached the fact that an install happened, not the result of the install.

On a fresh CI runner with the same lockfile as a previous build: the marker file existed in the shared cache. The install was skipped. node_modules/ did not exist. The next task ran tsc and got command not found. The cache reported a hit. The build crashed with a missing binary.

This is the structural problem with marker-based caching. The marker is a statement about history (“an install ran”), not a statement about state (“these packages exist on disk”). On a machine that shares cache metadata but not filesystem state, the distinction is fatal.

The fix: cache the package contents themselves, at per-package granularity, and restore them on hit. The marker becomes a summary of what needs to be on disk, not a replacement for it.

Postinstall: the three-axis cache key

Package tarballs are platform-independent. Postinstall scripts are not. A postinstall can compile a native addon with node-gyp, download a platform-specific binary, or write to ~/.cache/. The tarball cache restores source files faithfully. It cannot restore the side effects of running a postinstall script.

This is where the three-axis cache key comes in. From crates/scheduler/src/postinstall_cache.rs:

pub fn postinstall_cas_key(task: &PostinstallTask) -> [u8; 32] {
    let platform = format!("{}-{}", std::env::consts::OS, std::env::consts::ARCH);
    let node_version = read_node_version();
    let input = format!(
        "{}:{}:{}",
        task.tarball_integrity, platform, node_version
    );
    blake3::hash(input.as_bytes()).into()
}

Three axes. Each one earns its place.

tarball_integrity is the lockfile hash for this exact package version. Bump esbuild from 0.21.4 to 0.21.5 and this axis changes. The old postinstall output is dead.

platform is darwin-aarch64, linux-x86_64, or whatever the runner reports. A native binary compiled on macOS cannot run on Linux. Without this axis, a macOS CI run would cache a Mach-O binary and a Linux runner would try to execute it. Segfault.

node_version is the running Node.js major version. Native addons compiled against the Node 18 ABI fail to load on Node 20. The V8 module ABI version changes across major Node releases. Without this axis, upgrading Node breaks every native addon in the cache. Silently.

A change in any axis invalidates the cache for that package’s postinstall. There is no way to restore a darwin-aarch64 postinstall on linux-x86_64 by accident.

The manifest model

Restoring a postinstall is not “copy files back.” It’s a structured materialization from a manifest that records three things naive file copying ignores.

The manifest entry, from the rage source:

pub enum FileKind {
    Regular,
    Symlink(PathBuf),
}

pub struct ManifestEntry {
    pub rel_path: PathBuf,
    pub content_hash: [u8; 32],   // blake3; zeroed for symlinks
    pub mode: u32,                 // st_mode & 0o777; zero for symlinks
    pub kind: FileKind,
}

Executable bits. A native addon compiled by node-gyp is 0o755. The mode is stored in the manifest and restored via set_permissions. Without this, esbuild’s binary loses its execute bit on restore and every bundler invocation fails with EACCES.

Symlinks as symlinks. A postinstall that creates bin/foo -> ../target/release/foo is recorded as a Symlink(PathBuf) entry. On restore, it becomes a symlink, not a copy. This matters for tools like Playwright that build a symlink tree to platform-specific browser binaries.

Hardlinks from CAS. Each regular file is restored as a hardlink into ~/.rage/artifacts/content/. Zero-copy at the byte level. The kernel writes a directory entry, not file content. This is why restore is measured in milliseconds, not seconds.

The restore is a walk over the manifest:

for entry in &manifest {
    let dest = target_dir.join(&entry.rel_path);
    std::fs::create_dir_all(dest.parent().unwrap())?;
    match &entry.kind {
        FileKind::Regular => {
            let cas_path = store.cas_file_path(&entry.content_hash);
            let _ = std::fs::remove_file(&dest);
            if std::fs::hard_link(&cas_path, &dest).is_err() {
                std::fs::copy(&cas_path, &dest)?;  // cross-device fallback
            }
            std::fs::set_permissions(
                &dest,
                Permissions::from_mode(entry.mode),
            )?;
        }
        FileKind::Symlink(target) => {
            let _ = std::fs::remove_file(&dest);
            std::os::unix::fs::symlink(target, &dest)?;
        }
    }
}

The hard_link call is the fast path. If the CAS and the target directory are on the same filesystem (they usually are), the link succeeds and no bytes are copied. If they’re on different filesystems (EXDEV), it falls back to copy. Either way, the mode and symlink structure are preserved exactly.

The numbers

A 200-package monorepo bumps esbuild from 0.21.4 to 0.21.5. One lockfile entry changes.

With the tarball model: lockfile hash changed, entire cache invalidated, re-download all 200 packages, re-tar, re-upload. Back to 90 seconds.

With per-package CAS: 1 tarball miss, 199 hardlink hits. 1 postinstall key changes (esbuild’s integrity differs), 199 postinstall keys are unchanged. The scheduler reruns esbuild’s postinstall, restores the other 199 from CAS via hardlinks. Total time: 300ms.

The code is at github.com/kenotron-ms/rage. The install caching design doc is at docs/architecture/INSTALL-CACHING.md.