post

50KB Down to 2KB: Why We Stopped Parsing Source Files

· 12 min read

A typical TypeScript module in a large codebase is about 50KB of source text. Imports, exports, JSX, type annotations, function bodies, comments. The bundler parses all of it on every build. Then it throws away everything except the dependency edges.

That’s the waste. A bundler’s graph analysis only cares about five things: what does this module export, what does it import, does it have side effects, what calls what, and does it touch any globals. That information fits in about 2KB. The other 48KB is function bodies, JSX markup, and type annotations that the graph never looks at.

At 100 modules, nobody notices. At 50,000 modules, the difference is 2.5GB versus 100MB. One fits in an L3 cache. The other doesn’t fit in RAM on half the laptops in your organization.

The struct that replaces parsing

Every source file gets reduced to a ModuleSummary. This is the actual Rust struct from the Cloudpack summarizer:

pub struct ModuleSummary {
    pub exports: Vec<Export>,
    pub imports: Vec<Import>,
    pub side_effects: SideEffectMarker,
    pub call_edges: Vec<CallEdge>,
    pub ambient_refs: Vec<String>,
}

Five fields. Exports: what names does this module make available. Imports: what specifiers does it pull in, and which bindings. Side effects: can the bundler skip this module if nothing imports from it. Call edges: which exported function calls which other exported function, so the tree-shaker can reason about cross-function reachability. Ambient refs: does this module touch window, document, or globalThis.

That’s everything Phase 2 (the graph analysis) needs. It never reads source text. It operates on summaries only.

Source File vs Module Summary Source File (~50KB) import / export declarations side-effect statements call edges between exports ambient refs (window, document) function bodies JSX markup type annotations comments discarded (96% of bytes) 25x Module Summary (~2KB) exports: Vec<Export> imports: Vec<Import> side_effects: SideEffectMarker call_edges ambient_refs At 50,000 modules: Sources: 50,000 x 50KB = 2.5GB Summaries: 50,000 x 2KB = 100MB 100MB fits in L3 cache

How the summarizer sees side effects

The hardest field to compute is side_effects. A module with side effects can’t be tree-shaken even if nothing imports from it, because removing it might change program behavior. The SideEffectMarker has three states:

pub enum SideEffectMarker {
    None,                         // safe to eliminate
    Possible { reason: String },  // heuristic: maybe
    Definite,                     // must keep
}

The analyzer walks every top-level statement in the SWC AST. Import and export declarations are never side effects. Function, class, and type declarations are pure. Variable declarations are pure only when every initializer is a literal, an arrow function, or an identifier.

Anything else gets flagged. A top-level expression statement like console.log('hello') is Possible. An assignment to window.APP_VERSION = '2.0.0' is Definite, because the module writes to a known ambient global.

The globals list is short and explicit:

const AMBIENT_GLOBALS: &[&str] = &[
    "window", "document", "globalThis",
    "global", "self", "navigator",
];

No heuristics, no LLM, no annotations. Walk the AST, classify the top-level, report. The entire side-effect analyzer is 86 lines of Rust. The conservative default is correct: if you’re uncertain, mark it Possible and let Phase 2 keep the module.

The two-tier cache

A summary only needs to be computed once per unique source file. The cache key is SHA-256(source):

let source = fs::read_to_string(path)?;
let hash = ContentHash::from_source(&source);

Same content, same hash, same summary. Rename the file, move it to a different directory, copy it into another package. The summary is the same because the source bytes are the same. This is the content-addressed property: identity follows content, not path.

The cache has two tiers:

File-level (the default). Every source file under active development is cached by SHA-256(source). Change one character, the hash changes, the summarizer re-parses. Change nothing, the cache returns the stored summary without touching SWC.

Package-level (the fast path). For node_modules, the cache key is SHA-256(package-name + version + resolved-dep-tree). If the lockfile says react@19.1.0 and nothing in React’s transitive dependency tree changed, every module in that package gets its summary from cache without reading a single source file from disk. At Teams scale, the majority of 50,000 modules are stable vendor packages. This tier means a new developer’s first build is fast, because CI already seeded the remote cache.

The parallel loop

Summarization is embarrassingly parallel. Every file is independent. No file’s summary depends on any other file’s summary. This is the actual loop:

paths
    .par_iter()
    .map(|path| {
        let source = fs::read_to_string(path)?;
        let hash = ContentHash::from_source(&source);

        // Cache hit: return stored summary, skip SWC entirely
        if let Some(cached) = cache.get(&hash)? {
            return Ok(BundleGraphNode {
                id: hash,
                path: path.to_string_lossy().into_owned(),
                summary: cached,
                alive: false,
                chunk_id: None,
                source: None,
            });
        }

        // Cache miss: parse with SWC, store summary
        let summarizer = ModuleSummarizer::new();
        let node = summarizer.summarize(path)?;
        cache.put(&node.id, &node.summary)?;
        Ok(node)
    })
    .collect()

par_iter() is rayon. It distributes work across a thread pool with work-stealing. No shared mutable state. No locks. Each thread reads a file, hashes it, checks the cache, and either returns a stored summary or runs the SWC parse. The AtomicUsize counters for cache-hit tracking use Ordering::Relaxed because they’re statistics, not synchronization.

On a warm rebuild where one file changed out of 10,000, there are 9,999 cache hits and 1 SWC parse. Each cache hit reads the source (for hashing), does a SHA-256, and looks up a 2KB blob from disk. That’s it. No AST allocation, no visitor traversal, no memory pressure.

The Parallel Summarizer paths.par_iter() work-stealing work-stealing Thread 0 Thread 1 Thread N ... read_to_string(path) SHA-256(source) read_to_string(path) SHA-256(source) read_to_string(path) SHA-256(source) cache.get(&hash)? cache.get(&hash)? cache.get(&hash)? HIT return cached MISS SWC parse HIT return cached MISS SWC parse HIT return cached MISS SWC parse Warm rebuild (1 file changed, 10k modules): 9,999 hits, 1 miss. Zero shared mutable state. Zero locks. Zero coordination.

The numbers

Measured on Apple Silicon (M-series), analysis pipeline only (SWC parse + cache + graph analysis, excluding the transform step):

ModulesCold (ms)Warm (ms)Speedup
1001936.3x
50073272.7x
1,000141582.4x
5,0006942972.3x
10,0001,3636042.3x

The warm build at 10,000 modules is 604ms. That’s the time to hash 10,000 files, check 10,000 cache entries, SWC-parse the 1 file that changed, and rebuild the dependency graph.

The speedup looks like it converges at 2.3x, but that’s misleading. The warm build is dominated by I/O: reading 10,000 files from disk to compute their hashes. The actual cache lookup and SWC parse for the single changed file is negligible. The bottleneck is read_to_string across 10,000 paths. With the package-level cache tier for node_modules, even those reads go away for vendor code, and the practical speedup on a repo where 40,000 of 50,000 modules are in node_modules is much higher.

At the large scale projection (50,000 modules), the analysis pipeline goes from 6.8 seconds cold to 604ms warm. That’s an 11x speedup. Eleven times faster, and the warm build’s cost is almost entirely reading files that didn’t change.

Why this matters for everything downstream

The compression ratio is not just a performance trick. It changes what’s architecturally possible.

When your entire module graph fits in 100MB of summaries, Phase 2 (graph analysis, tree-shaking, chunk assignment) becomes a single-threaded scan at memory bandwidth. 100MB at memory-bus speed is about 50ms. The serial step that every other bundler treats as the bottleneck becomes a rounding error.

When summaries are content-addressed, they’re safe to share across machines. CI summarizes once, developers pull from a remote cache, and a new developer’s first build skips Phase 1 for everything CI already processed. The summary is the unit of caching, and it survives renames, moves, and copies because it’s keyed to content, not path.

When the summarizer runs in parallel with zero shared state, it scales linearly with cores. Add cores, get proportionally faster. No Amdahl’s law problem because there’s no serial section.

The 50KB-to-2KB compression makes all of this fall out naturally. Shrink the representation, and the system that operates on it gets faster, cacheable, and parallelizable without any additional cleverness.