# Hiding Work in the Waiting Time

There is a recurring optimization pattern in the Claude Code codebase that I call "hiding work in the waiting time." Once you understand this pattern, you will understand many of the designs in the code that seem "superfluous" at first glance.

> 💡 **Plain English**: It is like the **parallel optimization of getting up in the morning**—when the alarm rings, you do not put on your clothes first and then boil water (doing things one by one, serially). Instead, you turn on the kettle and go brush your teeth (executing in parallel). By the time you are dressed, the water is boiling. Claude Code stuffs every "might as well wait" time window with useful work.

> 🌍 **Industry Context**: "Hiding work in the waiting time" is not a pattern invented by Claude Code—it is the application of a classic computer science technique called **latency hiding**. CPU pipelining (1960s), instruction prefetching (1970s), TCP preconnection (1990s), and browser preload/prefetch (2010s) are all manifestations of the same idea. In the AI coding assistant space, **Cursor** hides model inference latency with tab prediction and background pre-generation; **GitHub Copilot** starts inferring the next completion while the user is still typing; **Aider** takes the opposite approach—pure synchronous serial execution with no speculative execution, trading simplicity for architectural cleanliness. What makes Claude Code unique is not that it "discovered this pattern," but that it **systematically applies latency hiding across three layers** (process startup, streaming tool execution, and speculative reasoning during user thinking time), and designs rollback mechanisms for each. This depth of implementation across all three layers is the most complete among similar tools.

---

## The Three Layers of Waiting Time

### Layer 1: The "Head Start" at Program Startup

> 💡 The title's "Pre-import I/O" translates to "sneaking in some read/write operations before loading the code."

The following three lines are the first to run when Claude Code starts. You do not need to understand the code syntax; you only need to know that **as soon as the program starts, before it has had time to load other functional modules, it secretly initiates two "early read" operations**:

```javascript
// Top of main.tsx (before any imports)
profileCheckpoint('main_tsx_entry')  // Mark entry timestamp
startMdmRawRead()                    // Execute immediately: spawn a subprocess to read enterprise management policies
startKeychainPrefetch()              // Execute immediately: read auth credentials from macOS Keychain in parallel
```

Node.js (the platform environment in which Claude Code runs) takes approximately 100–200ms to load all modules (because `import` pulls in all dependent code files one by one like a chain; the source comment notes "~135ms," which is an internal measurement by the development team under a specific build environment, and the actual time varies with hardware and module count). During this time, the CPU is busy processing module loading, and the program logic has not yet started.

These two calls put this time to use: `startMdmRawRead()` spawns a subprocess to read enterprise management policies (`plutil` on macOS, `reg query` on Windows); `startKeychainPrefetch()` simultaneously initiates two macOS Keychain reads (OAuth token and legacy API key). Otherwise, these two reads would be executed serially later in `applySafeConfigEnvironmentVariables()`, adding about 65ms of blocking time.

In addition, during the initialization phase in `init.ts`, there is an independent API preconnection optimization, `preconnectAnthropicApi()`—after settings are loaded but before the user starts typing, it "says hello" to Anthropic's servers in advance (TCP + TLS handshake—similar to dialing a number and waiting for the other party to pick up before a call, a process that usually takes 100–200ms). Because Bun's fetch shares a global keep-alive connection pool, subsequent real API requests can reuse the already established connection.

**Cost of Layer 1**: This layer is almost zero risk. In the worst case, the prefetched data is not used (for example, the user exits immediately), and the only waste is a few system calls. Preconnection is automatically skipped in proxy/mTLS/Unix socket environments to avoid warming up the wrong connection pool. The increase in complexity is mainly in code organization—you must ensure these side effects are executed before `import`, which requires eslint rule exemptions and a strict file loading order.

### Layer 2: Tool Execution During Model Output

```
Traditional path:
  Model finishes output → parse tool_use → execute tool → wait for result

StreamingToolExecutor:
  Model emits a complete tool_use block → tool execution starts immediately
  Model continues outputting... tool runs in the background...
  Model stops outputting → tool result is already ready
```

While the model is still outputting "I will read these three files in order to...", the file I/O operations have already begun. For file reads (a few milliseconds) and Bash commands (potentially several seconds), this is a clear latency optimization.

The `StreamingToolExecutor` implementation has several key details: it maintains a state transition for each tool invocation (queued → executing → completed → result returned), and the criterion for "complete" is the Anthropic API streaming event—once the JSON input for a `tool_use` content block is fully received (the `content_block_stop` event), it is immediately added to the execution queue. Concurrency-safe tools (such as Read, Glob, Grep) can run in parallel, while non-concurrency-safe tools (such as Bash) execute exclusively. Results are yielded through an ordered buffer in receive order, ensuring that even if tool B finishes before tool A, the results are returned to the model in A → B order.

**Cost of Layer 2**: Risk level is medium. When a tool execution fails, `StreamingToolExecutor` terminates other running tool subprocesses in the same batch via `siblingAbortController`, preventing wasted work from continuing. However, if a Bash command produces side effects (such as writing to a file) while executing, and subsequent model output changes the intent, these side effects cannot be automatically rolled back—the system relies on the model to detect and fix such inconsistencies in the next turn.

### Layer 3: Speculative Execution While the User Is Thinking

```
The user is reading the AI's response and thinking about the next step...

Meanwhile, the system:
  → Generates a prompt suggestion: predicting what the user is most likely to type
  → Uses the predicted content as input to call runForkedAgent and execute the entire AI reasoning loop in advance
  → If the user accepts the suggestion (presses Tab): inject the precomputed result, latency ≈ 0
  → If the user types something else: discard the prediction result and overlay files, process normally
```

The user's "thinking time" (potentially several seconds to tens of seconds) is utilized for AI inference. In the best case, the AI's response is already prepared the instant you press Enter.

**Prediction Strategy**: The speculation is not a guess at what natural language the user will say; instead, the `promptSuggestion` module generates a structured "next-step suggestion." Based on the current conversation context (the content of the most recent model response), the system makes a lightweight API call to generate the user's most likely next instruction (for example, "Continue implementing the remaining tests" or "Fix the type error mentioned above"). This suggestion appears in the user's input box; the user can press Tab to accept it or ignore it.

**Isolation Mechanism—the Overlay Filesystem**: The most critical engineering challenge of speculative execution is: how do you let the pre-executed Agent read and write files without polluting the user's actual workspace? The solution in `speculation.ts` is a copy-on-write overlay:

- All write operations (Edit, Write, NotebookEdit) are redirected to a temporary directory `~/.claude/tmp/speculation/<pid>/<id>/`
- Read operations prefer the overlay for already modified files; unmodified files are read directly from their original paths
- Non-readonly Bash commands trigger a "boundary stop"—speculative execution halts at this point and does not continue
- If the user accepts the predicted result, files in the overlay are copied to the workspace via `copyOverlayToMain`; if rejected, the entire overlay directory is deleted via `safeRemoveOverlay`

**Execution Constraints**: Speculative execution does not run without limits. It sets upper bounds of `MAX_SPECULATION_TURNS = 20` and `MAX_SPECULATION_MESSAGES = 100`. Tool usage is subject to strict permission checks—only file edit operations that have already received auto-approval (`acceptEdits` or `bypassPermissions` mode) can proceed during speculation; operations requiring user confirmation cause speculation to stop at that point and wait for the user's decision.

**Pipeline Optimization**: Speculative execution has an even more aggressive optimization—when a speculative run completes, the system immediately generates the *next* prompt suggestion via `generatePipelinedSuggestion`, forming a continuous speculation chain. If the user accepts the current prediction, the next prediction is already waiting, achieving a continuous pre-execution similar to a CPU pipeline.

> 📚 **Course Connection**: Layer 3 speculative execution is almost isomorphic to **branch prediction** in computer architecture courses—predict the next instruction, execute early, and rollback on misprediction (pipeline flush). The difference is that a CPU's rollback cost is a few dozen clock cycles, while Claude Code's rollback cost is wasted API token fees and cleanup of overlay files. In addition, the copy-on-write strategy of the overlay filesystem is the same idea as the COW page management after a process fork in **operating systems** courses—copy only when an actual write occurs, minimizing the overhead of isolation.
>
> For a complete technical analysis of the speculative execution subsystem (including the state machine, cache sharing strategy, and detailed interpretation of telemetry instrumentation), see **Part 3: Complete Analysis of the Speculative Execution Subsystem**.

**Cost of Layer 3**: This is the riskiest of the three layers. Every speculative execution consumes real API tokens—if the prediction is rejected, those tokens are pure waste. Currently, speculative execution is enabled only for Anthropic internal users (`USER_TYPE === 'ant'`), which itself indicates the team remains cautious about its cost-benefit ratio. In addition, while the overlay filesystem provides write isolation, it has no rollback capability for external side effects produced by Bash commands (such as network requests or process launches)—which is exactly why non-readonly Bash commands trigger a speculation stop.

---

## The Abstraction of This Pattern

**Formula**: Find places where "must wait for X to finish before continuing" → during the wait for X, do all work that does not depend on X's result

Three conditions:
1. There is a "waiting period" (module loading, model output, user interaction)
2. The work that can be done during the waiting period is deterministic (you know what to do)
3. Mistakes can be rolled back or isolated (Layer 1 prefetch results can be ignored; Layer 2 tool results are yielded in order and failing tools terminate their batch; Layer 3 file modifications are isolated via overlay and non-readonly operations trigger a boundary stop)

> ⚠️ **Note on Performance Data**: As of this writing, Anthropic has not publicly released performance benchmark data for Claude Code's latency hiding mechanisms (such as speculative execution hit rate, end-to-end latency improvement ratio, token waste rate, etc.). This chapter's analysis is based on the design intent and implementation mechanisms in the source code, not measured performance data. Telemetry events in the source code (such as the `time_saved_ms` and `tools_executed` fields recorded by `tengu_speculation`) indicate the team has conducted systematic internal performance tracking, but this data has not been made public. Readers who wish to replicate similar patterns in their own systems should establish their own benchmarks to verify the benefits.

---

## Similar Designs in Other Fields

It should be noted that the following examples are not "things Claude Code learned from"—on the contrary, **Claude Code is rediscovering and applying these classic patterns that have existed for decades**. Understanding this history will help you recognize and apply the same ideas in your own systems.

**CPU Pipelining and Speculative Execution**:
- When a CPU encounters a branch instruction, it predicts and executes without waiting for the branch condition to be determined
- If the prediction is correct, there is no delay; if wrong, the result is discarded and re-executed

**Database Prefetch**:
- Before query results are actually needed, the next page of data is loaded into the buffer in advance
- When the user pages through the data, it is already in memory, and no delay is felt

**Web Application Resource Preloading**:
- When the user's mouse hovers over a link, the target page begins preloading
- When the user clicks, the page is already cached

**Browser HTML Parsing**:
- When the browser encounters `<script src>`, it immediately requests the script without waiting for the current HTML parsing to finish

---

## When Designing Your Own AI Application

When waiting for a large language model's response (typically several seconds to tens of seconds), what can your application do at the same time?

Some ideas:
- Preload data the user might need (judging direction from the first few tokens of model output)
- Update certain parts of the UI (even though the main response is not yet complete)
- Run background tasks that do not depend on the response (logging, analytics, cache warming)
- Start the next expected API request (if the workflow is highly predictable)

Every "waiting period" is an opportunity to hide work, but several important constraints must be kept in mind:
- **Prediction accuracy is key**: If the prediction hit rate falls below a certain threshold, the net benefit of hidden work becomes negative (wasted resources > time saved)
- **Parallelism is not free**: Parallel execution increases system complexity and debugging difficulty, and in resource-constrained environments (such as mobile) it may be slower than serial execution
- **There must be a rollback or isolation mechanism**: Speculative execution without rollback capability can lead to unrecoverable side effects

---

## Code Locations

- `src/main.tsx`, lines 1–20: Pre-import phase—before `import` statements, calls `startMdmRawRead()` (spawns MDM policy subprocess) and `startKeychainPrefetch()` (reads Keychain credentials in parallel)
- `src/utils/settings/mdm/rawRead.ts`: MDM policy prefetch implementation—spawns `plutil` (macOS) or `reg query` (Windows) subprocess
- `src/utils/secureStorage/keychainPrefetch.ts`: Keychain prefetch implementation—initiates parallel reads of OAuth token and legacy API key
- `src/utils/apiPreconnect.ts`: API preconnection implementation—during `init.ts` initialization, sends an HTTP HEAD request to complete TCP+TLS handshake (100–200ms)
- `src/services/tools/StreamingToolExecutor.ts`: Streaming tool executor—manages concurrent tool execution, ordered buffering, and error propagation
- `src/services/PromptSuggestion/speculation.ts`: Speculative execution engine—overlay filesystem, `runForkedAgent` invocation, boundary detection, and result injection
- `src/utils/forkedAgent.ts`: Forked Agent infrastructure—creates an isolated sub-Agent context that shares the parent's prompt cache

## Costs and Trade-offs

"Hiding work in the waiting time" is not a free lunch. The three layers have completely different risk levels and cost structures:

| Layer | Risk Level | Cost on Failure | Complexity Cost |
|------|-----------|-----------------|-----------------|
| Layer 1: Pre-import I/O | **Low** | Prefetched data goes unused, wasting a few system calls | Code must execute before `import`, requiring eslint exemptions and strict loading order |
| Layer 2: Streaming Tool Execution | **Medium** | Failing tool execution requires terminating the rest of the batch; commands with side effects (such as file writes) cannot be automatically rolled back | Requires maintaining ordered buffers, concurrency control, `siblingAbortController`, and other concurrent infrastructure |
| Layer 3: Speculative Execution | **High** | API tokens consumed when a prediction is rejected are pure waste; external side effects of non-readonly Bash cannot be rolled back | Overlay filesystem, permission checks, state isolation, pipelined suggestions—highest system complexity |

**Shared Limitation**: Latency hiding depends on correctly predicting "what happens next." Layer 1 predicts "the user will definitely need the API key and enterprise policy"—this is almost always true; Layer 2 predicts "the tool calls output by the model should be executed"—this is true in most cases; Layer 3 predicts "the user will accept the system's generated next-step suggestion"—this depends on suggestion quality and user habits, and the hit rate is far lower than the first two layers. The more uncertain the prediction, the less stable the net benefit of hidden work.

**On Complexity Cost**: Increased parallelism means increased debugging difficulty. When system behavior is abnormal, the state space to consider is much larger than that of a purely serial architecture. This also explains why Aider chose pure synchronous serial execution—architectural simplicity itself has engineering value, especially for tools that lean toward batch-style usage scenarios.