post

The Architecture Mistake I Made in Lage

· 12 min read

I built lage. It runs task orchestration across JS monorepos at Microsoft. Hundreds of developers use it daily. Teams depend on it for CI, for local builds, for the “run lage build and it works” experience that makes monorepo development tolerable.

Its architecture has a fundamental flaw. I put it there.

This is the story of how I found the flaw, why patching it was impossible, and what I built to replace it. If you’ve been following this series, this is the post that connects the other five.

Where lage came from

In 2020, the JS monorepo ecosystem had a gap. Bazel could orchestrate everything, but the ceremony was punishing: BUILD files, Starlark, a mental model designed for Google’s infrastructure, not for a team of twelve shipping a React app. BuildXL was powerful but pointed in the other direction: designed for 50-machine build farms, not for a developer on a laptop running npm test.

What JS teams needed was a task runner that understood the package graph. You have 200 packages. Package api depends on utils. Run build across everything in the right order, skip what hasn’t changed, parallelize where possible. No Starlark. No pip declarations. No build farm.

That’s what lage did. You wrote a pipeline declaration:

module.exports = {
  pipeline: {
    build: ["^build"],  // upstream deps first
    test: ["build"],    // own build first
    lint: [],           // no deps, run immediately
  }
};

Lage resolved the package graph from package.json dependencies, built a DAG, scheduled tasks in topological order with parallelism, and cached results by hashing declared inputs. Teams adopted it. At peak, lage orchestrated builds across some of the largest JS monorepos inside Microsoft.

I was proud of it. The design solved a real problem for real teams.

But it had a gap, and the gap was load-bearing.

File access was an afterthought

Lage’s cache worked like every other JS monorepo cache: you declared your inputs, lage hashed them, and if the hash matched, the cache returned stored output. If you forgot to declare an input, the cache returned stale results and called them hits. I covered this failure mode in detail in Post 1. The short version: declared-input trust does not scale. In a monorepo maintained by 40 engineers over 3 years, someone will forget a file. The cache will lie silently.

I knew BuildXL had the fix. Sandbox every task. Observe what files the process reads and writes. Use those observations to build a second-phase fingerprint that catches what declarations miss. The two-phase fingerprinting scheme makes the cache correct without requiring developers to be perfect.

So lage got a BuildXL integration path. You could run lage on top of BuildXL, and BuildXL’s file access monitoring would feed sandbox data back into the caching layer.

This was a hack. Lage reached into BuildXL’s internal file access mechanism rather than having a clean API between the task runner and the sandbox. The coupling was tight: lage couldn’t run with sandboxing unless BuildXL was present. BuildXL couldn’t expose its sandbox to other runners cleanly. The result worked at Microsoft but was non-portable and impossible to ship as something anyone outside the company could use.

The architecture’s original sin was not that lage lacked sandboxing. It was that file access observation was never a first-class primitive. It was grafted on from an external system, and the cache and sandbox lived in different codebases with different APIs and different assumptions.

The Architecture's Original Sin lage Task Runner Cache declared inputs only BuildXL File Access Reporting Bolted on. External. Non-portable. rage Task Runner Two-Phase Cache WF + pathset + SF Sandbox syscall-level file access Foundation. First-class. Composable.

Watching the ceiling arrive

The ceiling didn’t arrive all at once. Monorepos grew. New teams onboarded. The number of undeclared dependencies grew with them.

Every few months, someone filed a bug: “my build is stale but CI says it’s cached.” Every time, the root cause was the same. Find the missing input declaration, add it to the pipeline config, move on. The fix was always one line. The debugging cost was always two days.

The pattern was unmistakable. The system’s correctness depended on human discipline, and human discipline degrades as the repository grows. I tried partial fixes: better default globs, documentation about what to declare, tooling to detect common omissions. None addressed the structural problem. Lage’s cache was only as correct as its input declarations were complete, and completeness required every developer to understand every file that every build step touched. That is not a reasonable expectation for a tool used by hundreds of people who didn’t write the build configuration.

The right fix was the one BuildXL had known for years: observe instead of declare. Make the sandbox a first-class primitive. Let the system see the truth.

That fix couldn’t be patched into lage. The architecture assumed file access was external. The cache assumed declared inputs were sufficient. Rebuilding the foundation meant rebuilding everything above it.

Why Rust, not Go

TypeScript 7.0 Beta shipped April 21, 2026. The compiler was rewritten in Go: 10x faster overall, 30x faster type checking, 2.9x less memory. Microsoft chose Go, not Rust. I covered why tsgo makes the build harness more important in Post 4.

The language question for the harness was separate. If the primary workload is a Go binary (tsgo), does the harness itself need to be Rust?

Yes. For one specific reason.

The harness’s most correctness-critical component is the syscall sandbox. On macOS: DYLD_INSERT_LIBRARIES interposing. On Linux: eBPF tracepoints via aya. On Windows: DLL injection with inline hooking via retour. All three mechanisms involve unsafe memory manipulation, function pointer replacement, and cross-process injection. Rust’s ownership model and explicit unsafe boundaries make these operations auditable. You can grep for unsafe blocks and know exactly where the danger lives. Go’s garbage collector can move memory during a syscall hook. Rust doesn’t have that problem because Rust doesn’t have a garbage collector.

ConcernRustGo
Syscall-level sandboxing (eBPF, ptrace, Detours)Strong safety modelPossible but less idiomatic
Parallel task schedulingRayon, TokioGoroutines (simpler)
Single-binary CLIYesYes
Reusable library (no_std capable)YesRuntime required

Go would have been fine for a CLI tool. Rust was the right choice for a library of composable crates with a sandbox that needs to be correct at the syscall level.

The velocity

The spec was complete before the first line of Rust. Every crate boundary, every trait, every platform-specific feature flag was designed on paper. Then the implementation went fast.

17 crates. ~40,000 lines. Phases 1 through 11 in roughly 48 hours of implementation time. Not because the coding was fast. Because the spec was complete and the architecture was designed, not discovered.

rage/
├── crates/
│   ├── workspace-tools/         package graph, PM detection
│   ├── build-graph/             DAG, topological sort
│   ├── pipeline-config/         rage.json, policy resolution
│   ├── cache/                   two-phase fingerprinting
│   ├── scoping/                 git-diff affected packages
│   ├── scheduler/               wave-parallel execution
│   ├── sandbox/                 file-access monitoring
│   ├── sandbox-macos-dylib/     DYLD interpose cdylib
│   ├── sandbox-windows-detours/ Detours DLL injection
│   ├── plugin/                  EcosystemPlugin trait
│   ├── plugin-typescript/       TS detection, postinstall
│   ├── daemon/                  watcher, reconciler, HTTP/WS
│   └── telemetry/               Chrome DevTools traces
└── cli/                         thin binary, opinionated defaults

Each phase had a plan document, TDD throughout, and a verification gate. Phase 3 (local cache) landed with 73 tests. Phase 9 (two-phase cache) landed with 165. Phase 11 (daemon with HTTP status page) landed with 192. By the end of the postinstall caching work: 350+ tests across the workspace.

The speed came from the spec. When you know what every crate’s public API looks like before you open your editor, you don’t spend time discovering the architecture. You spend time implementing it. Design documents convert uncertainty into typing.

What rage doesn’t have yet

This is not a victory lap. Rage is a working system with real correctness guarantees. It is not a production tool that 200 engineers depend on tomorrow.

What lage still does better:

Documentation. Lage has years of guides, examples, and Stack Overflow answers. Rage has code comments and plan documents. A developer picking it up cold would struggle.

Ecosystem maturity. Lage has been hammered by real monorepos for years. Edge cases have been filed, triaged, and fixed. Rage has been tested against fixture workspaces and one production codebase. The long tail of weird things that real packages do hasn’t been explored.

Community. Lage has users, contributors, and institutional knowledge. Rage has one author and an AI pair programmer.

Remote cache. Rage has local cache only. The CacheProvider trait exists and the Azure Blob and S3 provider specs are written, but no remote cache implementation ships yet.

Linux sandbox. The eBPF sandbox via aya is designed and spec’d. The macOS and Windows sandboxes are implemented and tested. Linux is the gap. The fallback is unsandboxed execution with weak-fingerprint-only caching.

These are real gaps. Rage solves the correctness problem that lage cannot. Lage solves the adoption problem that rage has not yet faced. Both matter.

The thread

This series covered cache correctness, two-phase fingerprinting, syscall sandboxing, TypeScript 7’s implications for build tooling, and install caching. Six posts that look unrelated from the outside. They are one idea.

Post 1 showed the failure mode: declared-input trust doesn’t work at scale. Post 2 showed the fix: weak fingerprints find candidates, strong fingerprints verify them, the sandbox-derived pathset bridges the two. Post 3 showed how to build the sandbox that produces the pathset, across three operating systems, without root. Post 4 showed why a 10x faster compiler makes the harness more important, not less. Post 5 showed the unsexy problem nobody talks about: node_modules install caching, the CI bottleneck that persists even when everything else is fast.

The connecting thread: make correctness a property of the mechanism, not of human discipline.

Declared inputs are human discipline. Sandboxed observation is mechanism. Weak fingerprints are human discipline. Strong fingerprints from sandbox data are mechanism. Expecting developers to configure isolation flags is human discipline. Enforcing isolatedModules at the harness level is mechanism.

Every time you move a correctness guarantee from “someone has to remember” to “the system does this automatically,” you buy reliability that compounds over the lifetime of the project. The 200th engineer to join the monorepo gets the same correctness as the first, without reading a wiki page about cache input declarations.

That’s what rage is for. Not a faster lage. A correct one.

The Series Post 1 Your Build Cache Is Lying the failure mode Post 2 Two-Phase Fingerprinting the fix Post 3 Hooking Syscalls the data source Post 4 TS7 and the Harness why it matters more Post 5 Install Caching the unsexy problem Post 6 Why I Rewrote It the full story Make correctness a property of the mechanism, not of human discipline.