post

How We Made the Cache Stop Lying

· 15 min read

Post 1 showed the lie. Every JS monorepo cache hashes what you declared and trusts you got it right. You didn’t. The cache returns stale output, and nobody finds out for weeks.

This post shows the fix.

The fix is not “be more careful about declaring inputs.” The fix is a two-phase fingerprinting scheme that BuildXL has used at Microsoft scale for years, now ported to JS monorepos in rage. The algorithm separates what could matter from what actually mattered. The first phase is cheap. The second phase is correct. Together, they give you a cache that tells the truth without requiring perfect input declarations.

Here’s the full algorithm.

Phase 1: The Weak Fingerprint

The weak fingerprint (WF) captures everything statically known about a task before it runs. It’s the “what could matter” hash. Here’s the actual struct from crates/cache/src/weak_fp.rs in rage:

pub struct WeakFpInputs<'a> {
    pub command: &'a str,
    pub tool_path: &'a Path,
    pub package_path: &'a Path,
    pub declared_input_globs: &'a [String],
    pub tracked_env: &'a [(String, String)],
    pub dep_abi_fingerprints: &'a [(String, String)],
}

Six fields. Each one earns its place.

command is the shell command the task runs: tsc -b, jest --ci, whatever. Different command, different output. If you change from tsc -b to tsc -b --declaration, the WF changes and the old cache entry is dead.

tool_path resolves to the actual binary on disk. Rage hashes the bytes of the resolved tool (following symlinks through node_modules/.bin/tsc into typescript/bin/tsc). A TypeScript upgrade flips the WF for every TypeScript task in the monorepo. You don’t declare this. It happens automatically. This is what Turborepo misses: same tsc -b command, different compiler version, same cache key. Rage catches it because the tool binary hash changed.

package_path is the workspace-relative path. It differentiates packages/api#build from packages/web#build when both run the same command. Without it, two packages running tsc -b would collide in the cache.

declared_input_globs expand against the package directory. For TypeScript, the plugin defaults to **/*.ts and tsconfig*.json. Each matching file gets hashed with blake3. The plugin author defines these, not the developer. You don’t write input globs for the common case.

tracked_env covers environment variables that affect output: NODE_ENV=production vs NODE_ENV=development. These are opt-in from rage.json. They’re never auto-detected because auto-detecting which environment variables affect a program’s output is undecidable.

dep_abi_fingerprints is the most interesting field. For each direct dependency package, rage stores the ABI hash from that dependency’s last build. If packages/utils is a dependency of packages/api, then api’s WF includes the ABI fingerprint of utils. Change something in utils that affects its public API surface, and api’s WF changes. Change something that doesn’t (a comment, an internal refactor), and api’s WF stays the same. More on this in the ABI section below.

The WF is blake3 over all six components. It’s fast to compute: one pass over the declared globs (typically hundreds of kilobytes of source per package), plus a few string hashes. Sub-millisecond on warm filesystem caches.

The WF is also what Turborepo computes as its entire cache key. Turborepo stops here. Rage uses it as the first phase of a two-phase lookup.

The Bridge: PathSets

Between the two phases sits the pathset. This is the data the sandbox produces when a task runs. It’s the list of files the process actually read and wrote:

pub struct Pathset {
    pub reads: BTreeSet<PathBuf>,    // every file the task read
    pub writes: BTreeSet<PathBuf>,   // every file the task wrote
}

The pathset comes from the file-access sandbox. On macOS, rage injects a dylib via DYLD_INSERT_LIBRARIES that intercepts filesystem calls. On Linux, it uses eBPF tracepoints. Both record every open, read, stat, and write the task performs, then ship the log back to the cache layer.

The pathset is stored keyed by WF. A single WF can have multiple pathsets attached. The same tsc -b invocation might follow different code paths under different conditions and read different files. The strong fingerprint phase handles the disambiguation.

The pathset is the bridge between the two phases. Phase 1 asks “what could matter?” Phase 2 asks “given what the task actually read last time, has any of it changed?” The pathset is the answer to “what did the task actually read last time.”

Here’s why this matters: if your task reads ../../tsconfig.base.json two directories up, and you never declared it as an input, the sandbox catches the read. The pathset includes it. The next cache lookup hashes its content. If someone changes tsconfig.base.json, the cache misses and re-runs the task. No human had to remember to declare that file. The mechanism caught it.

Phase 2: The Strong Fingerprint

The strong fingerprint (SF) is the WF plus the content hashes of every file in the pathset. This is the “what actually mattered” hash. From crates/cache/src/strong_fp.rs:

pub fn compute_strong_fingerprint(weak_fp: &str, pathset_reads: &[PathBuf]) -> String {
    let mut hasher = blake3::Hasher::new();
    hasher.update(b"wf:");
    hasher.update(weak_fp.as_bytes());
    hasher.update(b"\n");

    let mut sorted: Vec<&Path> = pathset_reads.iter()
        .map(|p| p.as_path())
        .filter(|p| !p.components()
            .any(|c| c.as_os_str() == OsStr::new("node_modules")))
        .collect();
    sorted.sort();
    sorted.dedup();

    for p in sorted {
        hasher.update(b"read:");
        hasher.update(p.as_os_str().as_encoded_bytes());
        hasher.update(b":");
        let content = std::fs::read(p).unwrap_or_default();
        hasher.update(blake3::hash(&content).as_bytes());
        hasher.update(b"\n");
    }

    hasher.finalize().to_hex().to_string()
}

Three details worth calling out in this code.

First: node_modules files are excluded from the SF. They’re pinned by the lockfile, which is already covered by the root install task’s fingerprint. A lockfile change invalidates the install task and cascades to all downstream tasks. Excluding node_modules turns a TypeScript SF computation from O(thousands of files) into O(actual source files). Every .d.ts in the TypeScript stdlib closure would otherwise get hashed on every lookup. That’s the kind of cost that makes the cache slower than running the task.

Second: the path is included alongside the content hash. If two files swap paths, the SF changes. Content-identity alone is not enough. The same bytes at a different path can produce different output (think tsconfig.json path mappings).

Third: a missing file hashes as the empty buffer, but the path is still in the SF. A file that existed last run and is gone this run produces a different SF. Present-vs-absent is a change, and the cache catches it.

The Lookup Algorithm

Here’s how the pieces fit together on a cache lookup. From crates/cache/src/lookup.rs:

fn lookup(task: &Task) -> Option<CachedOutput> {
    let wf = compute_weak_fingerprint(&WeakFpInputs { ... });
    let candidates = pathset_store.get_pathsets(&wf)?;          // possibly many
    for pathset in candidates {
        let sf = compute_strong_fingerprint(&wf, &pathset.reads);
        if let Some(output) = output_store.get(&sf) {
            return Some(output);                                 // hit
        }
    }
    None                                                         // miss
}

Step by step:

Step 1. Compute the weak fingerprint from the task’s static inputs. This is fast: hash the command, tool binary, declared globs, env vars, and upstream ABI hashes.

Step 2. Look up the WF in the pathset store. This is a filesystem read of ~/.rage/cache/pathsets/{wf_prefix}/{wf_hex}.json. If no pathsets exist for this WF, it’s a cold cache miss. Run the task.

Step 3. For each candidate pathset stored under this WF, compute the strong fingerprint. Read every file in the pathset from disk, hash each one with blake3, combine with the WF. This is the expensive step: you’re hashing the actual content of every file the task read last time. With node_modules excluded, this is typically hundreds of kilobytes of source. Sub-millisecond.

Step 4. Look up the SF in the output store. If a match exists, it’s a cache hit. Return the stored outputs, stdout, and stderr. Replay them to the terminal.

Step 5. If no candidate pathset produces an SF match, it’s a cache miss. Run the task inside the sandbox. The sandbox produces a new pathset. Compute the SF from the new pathset. Store the pathset under the WF. Store the outputs under the SF.

The Two-Phase Lookup Flow 1. Compute Weak Fingerprint command + tool + globs + env + ABI 2. Look up WF in pathset_store WF -> [Pathset_A, Pathset_B, ...] No pathsets (cold) Candidate pathsets found 3. For each candidate Pathset: SF = blake3(WF + content_hashes(reads)) 4. Look up SF in output_store CACHE HIT replay outputs No match: try next candidate 5. MISS: run task in sandbox, store new pathset + SF

The cost on a hit: one WF hash (sub-ms), one pathset JSON read, one SF hash over the pathset’s actual files (sub-ms for typical TypeScript packages), and one output store read. The slow step is hashing the contents of the pathset’s read files, and with node_modules excluded, that’s fast.

The cost on a miss: running the task plus one pathset write plus one output write. A one-time cost that sets up the next hit.

Why two phases instead of one

The naive design hashes all declared inputs into a single cache key. That fails in two directions.

Over-declaration is the Bazel approach. Every input must be declared. Miss one and the build is wrong. Correctness costs constant human effort.

Under-declaration is the Turborepo approach. Hash only what the developer declared. Miss one and the cache silently lies. Correctness depends on human perfection.

The two-phase scheme escapes this dilemma. The WF is cheap and runs on every lookup. It narrows the candidate set. The SF is expensive (reads files from disk) and runs only when the WF finds candidates. The pathset is observed, not declared. The user declares the WF inputs (and even those are mostly plugin defaults). The sandbox supplies the rest.

The cost of a false WF match is one SF computation, then a miss. Fast. The cost of a false SF match is theoretically zero: every byte that affects the output is in the SF input by construction.

Correctness is a property of the mechanism, not of declaration discipline.

The bonus: ABI early-cutoff

The two-phase scheme gives you correctness. The ABI fingerprint gives you speed.

When a rage plugin implements abi_fingerprint(outputs) -> Option<String>, rage records the ABI hash alongside the task’s outputs. For the TypeScript plugin, the ABI fingerprint is blake3 over all .d.ts files the task produced. The .d.ts is the public type surface of a TypeScript package. If the .d.ts didn’t change, nothing downstream needs to care.

Here’s how it plays out:

packages/utils#build
  outputs: dist/index.js, dist/index.d.ts
  abi_fingerprint = blake3(all .d.ts contents)

packages/api#build
  WF includes: ("packages/utils", utils.abi_fingerprint)

A developer changes a comment inside utils/src/foo.ts. The change doesn’t affect the public type surface. What happens:

  1. utils#build runs. Its WF changed because foo.ts changed.
  2. tsc produces the same .d.ts as before. The comment is internal. The types are identical.
  3. abi_fingerprint(outputs) returns the same hash as the previous run.
  4. api#build’s WF is unchanged. The dep_abi_fingerprints field still contains the same hash for utils.
  5. api#build is a cache hit. So is every other package that depends on utils.

Without ABI fingerprints, every change to utils invalidates every dependent. With them, only changes that affect the public surface propagate. In a monorepo with deep dependency chains, this is the difference between rebuilding 3 packages and rebuilding 80.

ABI Early-Cutoff Without ABI fingerprint utils#build comment changed -> REBUILD api#build REBUILD web#build REBUILD api#test REBUILD web#test REBUILD 5 tasks rebuilt With ABI fingerprint utils#build comment changed -> REBUILD .d.ts unchanged ABI hash: same api#build CACHE HIT web#build CACHE HIT api#test CACHE HIT web#test CACHE HIT 1 task rebuilt, 4 cache hits The .d.ts file is the ABI surface of a TypeScript package. If it didn't change, downstream WF inputs are unchanged. The cascade stops at the boundary where the public surface is stable.

This is the BuildXL early-cutoff mechanism. BuildXL has used it for years to avoid cascading rebuilds across thousands of pips in Microsoft’s internal monorepos. Rage ports the same idea to JS monorepos, using .d.ts content as the ABI surface for TypeScript packages.

ABI fingerprinting is plugin-defined and optional. Plugins that can’t cheaply expose an ABI return None. The strong fingerprint still carries correctness alone. You just lose the cutoff optimization. The cascade propagates further, and more tasks re-run than strictly necessary. Correctness is never compromised.

The two-phase scheme separates discrimination from verification.

The weak fingerprint discriminates: “which stored results could be relevant to this task?” It’s fast, it narrows the search space, and it’s wrong in only one direction (it might include candidates that turn out not to match, but it never excludes valid ones).

The strong fingerprint verifies: “given a specific set of files this task read last time, have any of them changed?” It’s exact. Every byte that could affect the output is in the hash.

The pathset is the bridge. It comes from the sandbox, not from human declarations. It makes the strong fingerprint possible without requiring the developer to know everything their tools read. And the ABI fingerprint sits on top, cutting off cascades at the boundary where the public surface is stable.

This is how you build a cache that can’t lie.