# The Heart: `queryLoop` and Streaming Engine

The `queryLoop` (`queryLoop`) is the core engine of Claude Code—a `while(true)` loop that repeatedly executes "call API → process response → execute tools → call again." This chapter dives into the internal structure of that loop, including streaming token processing, tool call scheduling, and termination condition evaluation.

> **🌍 Industry Context**: The `while(true)` Agent Loop is the standard pattern across all LLM agent frameworks—LangChain's AgentExecutor, AutoGPT's main loop, and Cursor's Agent Mode all use similar structures. This isn't a pattern invented by Claude Code; it's the foundational paradigm of AI agents. Claude Code's differentiation lies not in the loop itself, but in three engineering decisions *inside* the loop: five-tier progressive context compaction, streaming parallel tool execution (`StreamingToolExecutor`), and a fine-grained concurrency safety classification mechanism. This chapter focuses on the design logic and trade-offs of these three decisions.

---

## Prologue: Systole and Diastole

A human heart beats 60–100 times per minute. Each heartbeat has two phases: **systole** (pumping blood out) and **diastole** (letting blood flow back in). Claude Code's `queryLoop` does exactly the same thing:

- **Systole**: Pump messages to the Anthropic API and wait for the AI's response
- **Diastole**: Process tool calls in the response, execute the tools, and "reflow" the results back into the message history

`while(true)` — each iteration is one heartbeat. When the AI stops calling tools (the "final answer"), the heart stops beating, and the conversation turn ends.

> **🔑 OS Analogy:** The `queryLoop` is like a smartphone's **home-screen loop**—you open an app → the app processes your action → returns a result → you tap the next one. `queryLoop` repeatedly executes "call AI → process reply → execute tools → call AI again," matching the exact rhythm of how you interact with your phone.
>
> 💡 **Plain English**: The `queryLoop` is like a **package-sorting assembly line**—receive packages (incoming messages) → sort them (AI decides what to do) → load the truck (call tools) → deliver (get results) → confirm receipt (return to AI for further decisions) → wait for the next package. The line keeps running until every package is delivered (the AI gives a final answer with no more tool calls).

---

## 1. The Skeleton of the Loop

`queryLoop()` in `query.ts` is an **AsyncGenerator** function—it doesn't return a result all at once, but streams events continuously like a faucet. The caller (QueryEngine) consumes these events via `for await`.

Each loop iteration (one "heartbeat") consists of the following steps:

```
┌─── One Heartbeat ─────────────────────────────────────┐
│                                                        │
│  1. Pre-call context compaction (5+1 tier mechanism)   │
│     a. applyToolResultBudget()  — trim tool result size│
│     b. snipCompactIfNeeded()    — snip unimportant     │
│                                   intermediate segments│
│     c. microcompact()           — eliminate redundancy │
│     d. applyCollapsesIfNeeded() — fold old dialogue    │
│     e. autocompact()            — auto full-text       │
│                                   compression if over  │
│                                   threshold            │
│                                                        │
│  2. Token budget check                                 │
│     Exceeds hard limit? → terminate immediately,       │
│     return blocking_limit                              │
│                                                        │
│  3. Call API (systole)                                 │
│     callModel() → Anthropic API, streaming request     │
│                                                        │
│  4. Streaming processing                               │
│     for await (message of stream):                     │
│       ├── text_delta → display to user in real time    │
│       ├── tool_use  → enqueue for execution            │
│       │              → StreamingToolExecutor starts    │
│       │                 immediately                     │
│       └── error     → determine if recoverable         │
│                                                        │
│  5. Post-sampling hooks                                │
│     executePostSamplingHooks()                         │
│                                                        │
│  6. Tool execution (diastole)                          │
│     runTools() → execute all tools in parallel         │
│       → each tool passes canUseTool() permission check │
│       → tool results → new UserMessage                 │
│                                                        │
│  7. Stop hooks                                         │
│     handleStopHooks()                                  │
│                                                        │
│  8. Decision: any more tool calls?                     │
│     ├── Yes → continue (next heartbeat)                │
│     └── No  → return (heart stops, turn ends)          │
│                                                        │
└────────────────────────────────────────────────────────┘
```

### Exit Conditions

The heart doesn't beat forever. Six categories of events can terminate the loop:

| Exit Reason | Meaning | Analogy |
|-------------|---------|---------|
| `stop` | AI finished normally, no more tool calls | Normal end of heartbeat |
| `max_turns` | Maximum turn limit reached | Heart rate too high, forced rest |
| `aborted_streaming` | User interrupted with Ctrl+C | Manual stop during surgery |
| `blocking_limit` | Context window full | Heart overloaded |
| `model_error` | API call failed | Arrhythmia |
| `image_error` | Image processing failed | Foreign object blockage |
| `token_budget_stop` | TOKEN_BUDGET reached 90% or diminishing returns | Marathon runner has had enough |
| `stop_hook_prevented` | Stop hook blocked continuation | Referee called the match |
| `max_output_tokens_exhausted` | max_output_tokens recovery failed 3 times | Resuscitation failed |

### TOKEN_BUDGET: The Continuation Mechanism Behind +500k

When a user appends `+500k` (or another number) to the end of a message, Claude Code doesn't simply "run a bit longer"—it enables a complete **client-side continuation strategy**.

**How it works**:

1. **Parse budget**: User input is parsed by `parseTokenBudget()` using a regex to extract the token count and create a `BudgetTracker`
2. **End-of-turn evaluation**: When the model normally `stop`s (no more tool calls), `checkTokenBudget()` steps in—this happens at the **final step** of post-hoc control
3. **Inject meta message**: If token usage is below 90% (`COMPLETION_THRESHOLD = 0.9`) and diminishing returns haven't appeared, the system injects an `isMeta: true` user message with the content: "Stopped at X% of token target (Y / Z). Keep working — do not summarize." (source: `utils/tokenBudget.ts:72`)
4. **Model thinks the user is urging it on**: This meta message is invisible to the UI but visible to the model—it thinks the user is prompting it to continue, so it begins the next round of work
5. **Diminishing returns early stop**: When `continuationCount >= 3` and the delta token increase for two consecutive turns is below 500 (`DIMINISHING_THRESHOLD`), the system judges that the model is "treading water" and forces a stop, even if budget remains

> 💡 **Plain English**: +500k is like a marathon aid station—after each leg (one turn), the system checks if you still have energy (token budget) and hands you water to keep running (injects a meta message). If it sees you running in place (diminishing returns), it cuts you off even if the station still has water—forcing you to stop to avoid waste.

**⚠️ taskBudget and TOKEN_BUDGET are different concepts** (source `query.ts:193-197` comments make this explicit):
- `taskBudget` is an API-level **server-side** budget (`output_config.task_budget`), enforced by Anthropic's servers
- `TOKEN_BUDGET` is a client-side **continuation policy**, executed locally by Claude Code. Both can coexist

### The Precise Three-Stage Sequence of Post-Hoc Control

End-of-heartbeat processing isn't a single judgment call; it's a **strictly ordered three-stage flow** (source `query.ts`, lines 1119–1355):

1. **Error recovery** (lines 1119–1256): Check for prompt-too-long (→ trigger reactiveCompact retry) and max-output-tokens (→ first upgrade to 64K and retry; if that fails, inject meta message "Output token limit hit. Resume directly — no apology, no recap...", up to 3 recoveries)
2. **Stop Hooks** (lines 1258–1306): Run `handleStopHooks()`, where business logic decides whether to block continuation
3. **TOKEN_BUDGET evaluation** (lines 1308–1355): Token budget is checked *last*—only if the first two stages didn't terminate the loop does budget judgment occur

**Why order matters**: First recover failed calls (error recovery), then check whether business rules block continuation (Stop Hooks), and finally decide whether it's worth continuing (TOKEN_BUDGET). If budget were checked first, a model truncation due to max-output-tokens might be misjudged as "model completed."

**Deeper motivation: preventing the death spiral infinite loop attack.** The source comment at `query.ts:1119-1183` explicitly states: "No recovery — surface the withheld error and exit. Do NOT fall through to stop hooks: ... Running stop hooks on prompt-too-long creates a death spiral: error → hook blocking → retry → error → ..."

> 💡 **What is a death spiral?** It's a specific class of self-harm attack/bug: an error triggers a hook that injects more context → context becomes fuller → next call is again prompt-too-long → triggers the hook again → injects more... looping until the system crashes. The source author specifically guarded against the possibility that a malicious or misconfigured hook could drag the system into an infinite loop of repeated compaction.

This defensive motivation reveals a deep design principle: **the three stages of post-hoc control aren't arbitrarily arranged; they're a systematic defense against specific attack/failure modes.** Error recovery must execute before Stop Hooks (to prevent hooks from being triggered infinitely), and Stop Hooks must execute before TOKEN_BUDGET (to prevent business blocking from being bypassed by budget continuation)—each stage's position has security significance.

---

## 2. Context Compaction: "Blood Purification" Before the Heart Beats

Before each heartbeat begins, the system compresses the message history. Just as blood must pass through the lungs for purification before the heart pumps it out.

Why compaction? Because the context window is limited—roughly 200,000 tokens at most. The longer the conversation and the more tool calls, the more tokens accumulate. Without compaction, it would fill up quickly.

The five-tier pre-processing compaction mechanism executes from light to heavy (with a sixth line of defense, reactiveCompact, triggered on API 413 errors; see end of this section):

> 📌 **On "five tiers" vs "six mechanisms" vs "three tiers"**: This chapter uses the narrative of "**five in-call tiers + the sixth defense line reactiveCompact**" because it focuses on the compaction logic inside the main `queryLoop`, whereas reactiveCompact is a posterior fallback (triggered only when the API returns 413, not on the main loop path). Other chapters use different framings:
> - **Q02 Why Context Compaction Needs Six Mechanisms** combines "five in-call + reactiveCompact" into **"six compaction mechanisms,"** emphasizing the complete defensive system
> - **Part 5 Chapter 1** uses the high abstraction of **"three tiers"** (light trim / medium compaction / heavy full-text compaction) for critical analysis
>
> These three descriptions correspond to **different abstraction granularities of the same system**, not contradictions. This chapter uses "five tiers" from an engineering perspective, Q02 uses "six mechanisms" from a systems perspective, and Part 5 uses "three tiers" from a classification perspective.

### Tier 1: Tool Result Trimming (applyToolResultBudget)

**Analogy**: Your suitcase is limited; first fold the bulkiest clothes more compactly.

Each tool result has a maximum size limit (`maxResultSizeChars`). Content beyond the limit is truncated. This is the lightest compaction—it only chops the large ones, leaving small ones untouched.

### Tier 2: Historical Snipping (snipCompactIfNeeded)

**Analogy**: A diary has hundreds of pages; tear out the "nothing happened today" pages in the middle.

Message segments marked `HISTORY_SNIP` are removed. These are typically intermediate content that is no longer important.

### Tier 3: Micro-Compaction (microcompact)

**Analogy**: Delete templated phrases like "above is the file content" that repeatedly appear in documents.

Eliminates redundant markers and duplicate content in tool results. Doesn't change semantics, only reduces "fluff."

### Tier 4: Context Collapse (applyCollapsesIfNeeded)

**Analogy**: Fold the first few months of your diary into a single summary—"March was mainly about refactoring."

`CONTEXT_COLLAPSE` replaces old dialogue segments with AI-generated summaries. Information is significantly reduced, but core points are preserved.

### Tier 5: Automatic Full-Text Compaction (autocompact)

**Analogy**: The suitcase really won't close—take everything out and repack it in vacuum bags.

When token usage exceeds the threshold, a **new AI instance** is spawned to summarize the entire conversation. This is the heaviest operation—it requires an extra API call, but can compress context to 1/3 or 1/5 of its original size.

**Design point**: The five tiers are triggered conditionally—every heartbeat **checks** whether each tier needs to run, but only executes if conditions are met (short-circuit evaluation, not a waterfall where everything always runs). The first three tiers (trimming, snipping, micro-compaction) have negligible check overhead (a few string comparisons), so running them every time has almost no performance impact. Tier 4 (collapse) and Tier 5 (full-text compaction) trigger only when token usage exceeds the threshold. The precise trigger threshold for `autocompact` comes from the source calculation formula (`autoCompact.ts:33-76`):

```
effectiveContextWindow = contextWindow - min(maxOutputTokens, 20_000)
autocompactThreshold   = effectiveContextWindow - AUTOCOMPACT_BUFFER_TOKENS
                       (AUTOCOMPACT_BUFFER_TOKENS = 13_000)
```

For a standard 200K context window: threshold ≈ (200K − 20K reserved for summary output) − 13K ≈ **167K** (~83.5%). For models with the context-1M beta enabled, the ceiling is **1M** (`context.ts:71,86,89`), and the threshold scales proportionally larger. 99% of heartbeats complete context management within the first three tiers, without triggering the expensive fourth and fifth tiers.

> 💡 **Plain English**: Context compaction is like a **meeting note-taker's strategy**—step 1: fold up lengthy attachments (trim tool results) → step 2: tear out "nothing happened today" pages (snip useless segments) → step 3: remove repetitive fluff (micro-compaction) → step 4: compress months of content into a summary (collapse old dialogue) → step 5: the full recording is too long, replace it with an executive summary (full-text compaction). Most of the time, the first three steps are enough; you don't need to deploy the "nuclear option."

> **Sixth line of defense: reactiveCompact (responsive emergency compaction)**—if the API still returns a 413 error after the five tiers, the system immediately triggers an emergency full-text summary and retries. It only attempts once: if it gets another 413 after compaction, it's considered unrecoverable. See Q02 for details.

---

## 3. Streaming Processing: Catching the Ball While Running

Traditional API calls use a request-response pattern: send the request, wait for the complete response. But Claude Code uses **streaming**: every token the AI generates is sent to the client immediately.

This brings two benefits:
1. **User experience**: The user sees the AI "typing in real time," rather than waiting several seconds for a wall of text to appear
2. **Tool parallelism**: Tools mentioned by the AI can start executing while it's still speaking

### StreamingToolExecutor: Starting Work Before the Model Finishes Talking

Most agent frameworks adopt a "talk first, act later" pattern: AI generates full response → parse tool calls → execute tools. Three serial steps. Claude Code's StreamingToolExecutor breaks this order—it **executes on the fly** during streaming. This isn't a completely new invention (Cursor's Agent Mode has similar streaming handling), but Claude Code's implementation makes more finely grained engineering decisions in **concurrency safety classification**.

StreamingToolExecutor's flow: AI is talking → a `tool_use` block is detected → **immediately start executing that tool** → AI continues talking → another is detected → **immediately start that one too** → by the time the AI finishes speaking, the tools may already be done.

```
Timeline (traditional approach):
  AI speaks ████████████████████ → parse → execute tool A ███ → execute tool B ███ → execute tool C ███
  Total time: ═══════════════════════════════════════════════════════════════════════════

Timeline (StreamingToolExecutor):
  AI speaks ████████████████████
         tool A ███ (started while AI is still talking)
              tool B ███
                   tool C ███
  Total time: ════════════════════════ (significant time saved)
```

**Key constraint**: Only tools marked `isConcurrencySafe` can execute in parallel during the streaming phase. This classification mechanism is the genuinely valuable engineering decision in StreamingToolExecutor:

| Concurrency Safe | Tool Examples | Reason |
|------------------|---------------|--------|
| ✅ Parallelizable | Read, Glob, Grep, WebSearch | Pure read operations, no state mutation |
| ❌ Must be exclusive | Edit, Write, Bash, NotebookEdit | Modify files / execute commands, may conflict |
| ⚠️ Conditional | Agent (subagent) | Depends on the subagent's isolation level |

> 📚 **Course Connection**: This is essentially the application-layer implementation of **instruction pipelining** from computer architecture courses. The role of `isConcurrencySafe` is equivalent to **hazard detection** in a CPU—identifying which instructions have data dependencies (RAW/WAW hazards); dependent ones must be serial, independent ones can be parallel.

> **🔑 OS Analogy:** It's like **downloading a video while watching it** on your phone—the downloader fetches the file in the background while the player streams it, neither waiting for the other, both happening simultaneously. But if two apps tried to write to the same file at once, you'd have problems—so the system needs to judge "which operations can run in parallel and which must queue up."

> 🌍 **Competitor Comparison**: Cursor's Agent Mode also supports streaming tool execution, but at finer granularity—Cursor operates directly on the AST (abstract syntax tree) rather than text-level Edit, enabling more fine-grained parallelism. Aider adopts a pure "talk first, act later" model with no streaming parallelism. Windsurf (Codeium)'s Cascade uses a multi-step plan-then-execute model, forming a classic architectural contrast of "planning vs flexibility" with Claude Code's reactive loop.

---

## 4. Tool Execution: The Ten-Step Permission Check of a "System Call"

When the heartbeat enters "diastole" (the tool execution phase), each tool call must pass through the permission system. This process is analyzed in detail in Part 3 Q05 and Part 4 Chapter 1; here we only describe its place in the `queryLoop`.

```
queryLoop discovers a tool_use block
  → canUseTool() permission check (full 10-step chain):
    → ① Tool existence check — does this tool name exist in the 40 built-in tool catalog + MCP dynamic tools?
    → ② Tool availability check — is it enabled in the current mode? (e.g., write tools disabled in plan mode)
    → ③ Enterprise Policy check — does MDM/remote policy allow it? 【Highest priority, non-overridable】
    → ④ Sandbox check — is the operation within the sandbox's allowed scope?
    → ⑤ Read-only fast path — tools with isReadOnly()=true bypass directly, skipping subsequent steps
    → ⑥ Permission rule matching — check the user's allow/deny rule list
    → ⑦ Permission mode determination — which external mode: default / acceptEdits / bypassPermissions / plan / dontAsk? (source `types/permissions.ts:16-22` EXTERNAL_PERMISSION_MODES)
    → ⑧ Iron Gate check — bypass-immune operations (e.g., Bash rm -rf) must be confirmed even in bypass mode
    → ⑨ Tool implementation self-check — content-sensitive check via tool.checkPermissions() (e.g., Bash subcommands)
    → ⑩ Final decision consolidation — "passthrough" converted to "ask", yielding the final ternary result of Allow / Ask / Deny
    → Result: Allow / Ask / Deny
  → Allow → tool.call() executes
  → Ask → UI permission dialog pops up, waits for user
  → Deny → return denial message to the AI
```

> 📚 **Design Pattern Connection**: This ten-step chain is the classic **Chain of Responsibility** pattern—each step can independently decide "pass" or "deny" without knowing the other steps exist. At the same time, step ⑤'s read-only fast path is an engineering application of **short-circuit evaluation**. Detailed analysis in Part 3 Q05.

**Tool execution results** are placed into a `tool_result` content block inside a `UserMessage`, following the Anthropic API message format. A single UserMessage can contain multiple `tool_result` blocks—this is the mechanism for batch-returning results after multiple tools execute in parallel during one heartbeat. From the AI's perspective, "I called several tools, and then the user told me the execution results." In reality, the "user" did nothing—these "user messages" are auto-generated by the system. This "deceptive abstraction" lets the AI remain unaware of the tool system's existence, preserving the simplicity of the API interface.

> 📚 **Design Pattern Connection**: This is the classic **Adapter Pattern**—the true source of `tool_result` is the tool runtime, but by wrapping it as a UserMessage, it adapts to the Anthropic API constraint of "only two roles, user and assistant, alternating in conversation."

---

## 5. The Design Choice of AsyncGenerator

Why is `queryLoop` an AsyncGenerator rather than a plain async function?

> 💡 **Plain English (no coding background needed)**: A regular function is like **ordering takeout**—you place the order, wait, and get everything served at once. An AsyncGenerator is like a **conveyor-belt hot pot**—the kitchen places dishes on the belt one by one; you eat one, and the next arrives; if you eat slowly, it waits; you can stop anytime. `queryLoop` is that conveyor belt: every AI thought, every tool result is a dish on the belt, and the frontend UI displays them to the user one plate at a time.

**Plain async function** (takeout mode: served all at once):
```typescript
async function queryLoop(): Promise<Result> {
  // Once finished, returns the result in one go—caller has no idea what happened in between
}
```

**AsyncGenerator** (conveyor belt mode: made and served continuously):
```typescript
async function* queryLoop(): AsyncGenerator<Event, Result> {
  yield progressEvent;   // "Conveyor belt sends out a plate"—intermediate state visible in real time
  yield toolResultEvent; // "Another plate out"—caller can update UI immediately
  return finalResult;    // "The main course"—conveyor belt stops
}
```

AsyncGenerator is a conventional choice in Node.js stream processing (not unique to Claude Code), but it matches AI agent requirements precisely:

1. **Intermediate states observable**: each `yield` exposes internal state to the caller, which can update the UI in real time
2. **Interruptible**: the caller can `return()` the generator at any time to implement Ctrl+C interruption
3. **Backpressure** (in plain terms: automatic slowdown when consumption can't keep up): when the caller processes slowly, the generator automatically pauses—unlike EventEmitter, which can cause memory pileup
4. **Composability**: multiple generators can be nested with `yield*`—a subagent's `queryLoop` can be embedded into a parent agent's `queryLoop`

> 📚 **Course Connection**: AsyncGenerator's backpressure mechanism is essentially the **producer-consumer model** from operating systems courses—`queryLoop` is the producer, the UI is the consumer, and `yield` is a bounded-capacity buffer. When the consumer can't keep up, the producer automatically blocks.
>
> **🔑 OS Analogy:** It's like a factory **conveyor belt**—`queryLoop` keeps placing products (events) on the belt, and the caller continuously takes them off the other end, with quality checks and packaging stages insertable in between.

> 🌍 **Why not other options?** In the Node.js ecosystem, RxJS Observable and EventEmitter can achieve similar event streams. AsyncGenerator's advantages are: it's native to the language (no external dependency needed), supports `for await` syntax (cleanest code), and has built-in backpressure (RxJS requires manual handling, EventEmitter has none). The trade-off is debugging difficulty—stack traces break at generator boundaries, and async error propagation paths are more complex than Promise chains.

### Real Source Code: The Core Skeleton of `queryLoop`

Below is extracted from `src/query.ts` (simplified, core structure preserved):

```typescript
// src/query.ts — core skeleton of queryLoop (simplified)

async function* queryLoop(
  params: QueryParams,
  consumedCommandUuids: string[],
): AsyncGenerator<StreamEvent | Message, Terminal> {
  // ① Immutable parameters—never change during the entire loop
  const { systemPrompt, canUseTool, maxTurns } = params

  // ② Mutable state—destructured at the start of each iteration, replaced wholesale on continue
  let state: State = {
    messages: params.messages,
    toolUseContext: params.toolUseContext,
    autoCompactTracking: undefined,
    hasAttemptedReactiveCompact: false,
    turnCount: 1,
    // ... also stopHookActive, pendingToolUseSummary, etc.
  }

  // ③ Heartbeat loop—while(true) until AI stops calling tools
  while (true) {
    let { toolUseContext } = state
    const { messages, turnCount } = state

    // ④ Five-tier compaction: trim → snip → micro-compact → collapse → auto full-text compact
    messagesForQuery = await applyToolResultBudget(messagesForQuery, ...)
    messagesForQuery = snipCompactIfNeeded(messagesForQuery)
    messagesForQuery = (await microcompact(messagesForQuery, ...)).messages
    messagesForQuery = (await contextCollapse.applyCollapsesIfNeeded(...)).messages
    messagesForQuery = (await applyCollapsesIfNeeded(messagesForQuery, ...)).messages  // context collapse
    const { compactionResult } = await autocompact(messagesForQuery, ...)
    // after autocompact succeeds: messagesForQuery = buildPostCompactMessages(compactionResult)
    // → spliced directly back into current turn to continue, no query rebuild (source query.ts:528-535)

    // ⑤ Call API (systole)
    yield { type: 'stream_request_start' }
    const stream = callModel(messagesForQuery, systemPrompt, ...)

    // ⑥ Streaming processing—receive and execute tools on the fly
    for await (const event of stream) {
      // text_delta → display to user in real time
      // tool_use  → StreamingToolExecutor starts immediately
    }

    // ⑦ Tool execution (diastole) + permission check
    const toolResults = await runTools(toolUseBlocks, canUseTool, ...)

    // ⑧ Post-hoc control (three stages in order)
    // 8a. Error recovery: prompt-too-long → reactiveCompact; max-output-tokens → upgrade to 64K retry (max 3 times)
    // 8b. Stop Hooks: handleStopHooks() business rule blocking
    // 8c. TOKEN_BUDGET: checkTokenBudget() decides whether to inject meta message for continuation

    // ⑨ Decision: more tool calls? Or TOKEN_BUDGET injected a continuation message?
    if (!needsFollowUp) {
      return { reason: 'stop', ... }  // heart stops, turn ends
    }
    // otherwise continue, enter next heartbeat
  }
}
```

**Code highlights**:
- **Immutable vs mutable separation** (①②): `params` never changes during the loop; `state` is replaced wholesale on each continue (rather than field-by-field assignment)—the source comment says "Continue sites write `state = { ... }` instead of 9 separate assignments," preventing bugs where a field update is forgotten
- **`yield` is the conveyor belt exit** (⑤): each `yield` pushes events to the caller (QueryEngine → React UI), realizing AsyncGenerator's "make while delivering"
- **Five-tier compaction happens before the API call** (④): "purify the blood" at the start of each heartbeat to avoid exceeding the context window

---

## 6. Calling the API: The Complete Journey of a "System Call"

`callModel()` is the only "system call" `queryLoop` makes to the outside world—requesting the Anthropic API. It does more than you might expect:

### 6.1 Request Construction

```
Message history (messages array)
  + System prompt (system prompt)
  + Tool schemas (tools array, JSON Schema for each tool)
  + Model parameters (model, max_tokens, temperature=1)
  + Cache control markers (cache_control breakpoints)
```

### 6.2 Cache Boundary Optimization

A key feature of the Anthropic API is **Prompt Cache**. If the prefix of two consecutive requests is identical, the repeated portion doesn't need to be reprocessed. Per Anthropic's official documentation: cache reads cost **10%** of the base price, and latency can be reduced by roughly **85%**. Note, however, that the first cache write costs slightly **more** than the base price (~125%), so real savings only materialize when the prefix is reused multiple times—which is almost always true in Claude Code's multi-turn dialogue scenarios.

> 📚 **Course Connection**: This is conceptually identical to **HTTP caching** (ETag / Cache-Control) from computer networking courses—marking which content hasn't changed and only transmitting the delta. It's also similar to the **page cache** in operating systems: frequently read pages stay resident in memory to avoid repeated disk I/O.

The system places `cache_control` breakpoints according to Anthropic best practices (Claude Code, as Anthropic's own product, naturally takes full advantage of its own API features):
- End of the system prompt (the most stable part, almost always hits)
- End of the tool schema (tool lists rarely change during a session)
- Before the most recent messages (historical messages are static, only new ones change)

**Analogy**: Imagine writing a letter to the same person every day. The salutation ("Dear Zhang San, I'm Li Si") is identical each time. If the post office remembers this opening, you only need to mail the new part, and the opening is "automatically attached." Prompt Cache is that "post office memory."

### 6.3 The Temperature=1 Design Decision

A surprising detail in the source code: Claude Code's main loop fixes temperature at **1** (`claude.ts:1694`: `options.temperatureOverride ?? 1`).

> 💡 **Plain English**: Temperature is like a **chef's creative freedom**—temperature=0 means strictly following the recipe (same dish every time), temperature=1 means allowing the chef to improvise (add spices, change the plating). You might think an AI coding assistant should use temperature=0 (for determinism), but Claude Code chose 1.

**Why not temperature=0?** Three reasons:

1. **API requirement**: The Anthropic API **requires** temperature=1 when extended thinking is enabled (source comment: "the API requires temperature: 1 when thinking is enabled, which is already the default"). Claude Code enables thinking by default, so this value is API-mandated
2. **Diversity needed**: Programming tasks aren't math problems—"write me a README" has countless reasonable answers. temperature=1 lets the AI provide diverse responses within a reasonable range, avoiding identical code generation every time
3. **Tool call determinism doesn't come from temperature**: What really needs determinism isn't text generation, but tool call decisions (which file to read, which line to edit). The reliability of these decisions comes from precise instructions in the system prompt and constraints in the tool schema, not from low temperature

**Interesting contrast**: **Not everywhere** in the system uses temperature=1. The Auto mode permission classifier (`yoloClassifier`) uses **temperature=0** (`yoloClassifier.ts:784`), because permission judgments need maximum determinism—"is this Bash command dangerous?" is not a place for creative improvisation.

| Scenario | Temperature | Reason |
|----------|-------------|--------|
| Main loop (`queryLoop`) | 1 | API requirement + diversity need |
| Permission classifier (yoloClassifier) | 0 | Safety judgments need maximum determinism |
| Hook evaluation (apiQueryHookHelper) | 0 | Structured judgments don't need randomness |
| Skill improvement (skillImprovement) | 0 | Structured output needs consistency |

> 📚 **Course Connection**: Temperature corresponds to the concept of **softmax temperature scaling** in machine learning—as T→0, the probability distribution tends toward one-hot (the highest-probability token is selected); as T→∞, it tends toward uniform (all tokens equally likely). T=1 is the unscaled original distribution.

### 6.4 Stream Consumption

The API doesn't return a complete answer all at once; it arrives **like a typewriter, one character at a time**. Imagine watching someone type in a chat app—"I" → "I am" → "I am here" → "I am here to help"—each small piece is a "chunk." `queryLoop` receives and processes them chunk by chunk using `for await`.

At the data level, each chunk is a JSON object (the example below is for developers to dive deeper; non-technical readers only need to understand "pieces arriving one by one"):

```
chunk 1: "beginning of text"         → UI prepares display area
chunk 2: text content "I am"         → UI immediately shows "I am"
chunk 3: text content "help"         → UI appends "help"
  ...
chunk N: "begin tool call: Read"     → StreamingToolExecutor captures, prepares to execute
chunk N+1: tool params '{"fil'       → params not complete yet, continue concatenating...
chunk N+2: tool params '_path":"README.md"}' → params complete! immediately execute Read
  ...
chunk M: "message end"               → this API call completes
```

Text blocks are displayed to the user in real time. `tool_use` blocks are captured by StreamingToolExecutor—when the tool parameter JSON is fully assembled (i.e., `partial_json` accumulates into valid JSON), execution begins immediately, without waiting for the AI to finish speaking. This is the technical implementation of "talking while working."

**Partial JSON parsing mechanism**: When the AI streams `tool_use` parameters, the JSON arrives character by character (`{"fil` → `{"file_p` → `{"file_path": "/src/main.tsx"}`). StreamingToolExecutor attempts to parse the accumulated `partial_json` into complete JSON on every new chunk. Once parsing succeeds (`JSON.parse` doesn't throw), the parameters are considered complete and tool execution launches immediately. This means: if the AI calls multiple tools in one response, the first tool starts executing as soon as its parameters are complete, without waiting for the second tool's parameters to finish streaming.

---

## 7. A Complete Heartbeat Example

Let's trace a concrete scenario—the user says "read README.md for me and add a line":

```
Heartbeat 1:
  [Compact] — conversation just started, nothing to compact
  [Call API] — send user message + system prompt
  [Stream response] — AI says: "Sure, let me read the file first"
                AI calls tool_use: Read(file_path="README.md")
                → StreamingToolExecutor immediately starts reading
  [Tool execution] — Read tool returns file content
  [Decision] — tool calls exist, continue

Heartbeat 2:
  [Compact] — not much history, skip
  [Call API] — send updated message history (including Read result)
  [Stream response] — AI says: "The file contents are... now I'll edit it"
                AI calls tool_use: Edit(file_path="README.md", ...)
  [Tool execution] — permission check → UI confirmation popup → user approves → execute Edit
  [Decision] — tool calls exist, continue

Heartbeat 3:
  [Compact] — still not much
  [Call API] — send updated message history (including Edit result)
  [Stream response] — AI says: "The modification is complete. I added a line..."
  [Tool execution] — no tool calls
  [Decision] — no tool calls, return { reason: 'stop' }

→ Loop ends. Three heartbeats complete one task.
→ Background tasks start: SessionMemory extraction, file snapshot, Prompt Suggestion
```

---

## 8. Design Trade-Offs and Competitor Comparison

### Decisions That Work Well

1. **Five-tier progressive compaction** is genuinely valuable engineering—99% of cases only need the first three tiers, avoiding unnecessary heavy compaction overhead. **Contrast with Aider**: Aider takes a completely different strategy—through its "repo map," it only sends necessary content upfront (the most relevant function signatures and class names from the source), rather than compressing after the fact. The trade-off between the two strategies: Claude Code's approach preserves full context (more flexible but more expensive), while Aider's approach saves tokens but may miss relevant code
2. **StreamingToolExecutor's concurrency safety classification**—finely distinguishing Read (parallelizable) from Edit (must be exclusive), rather than simple all-serial or all-parallel. Though the classification criteria are simple, they cover the vast majority of real-world scenarios
3. **AsyncGenerator choice** matches requirements well—intermediate states observable, interruptible, supports backpressure. This is a conventional choice in the Node.js ecosystem (not unique), but it's more suitable for agent scenarios than Promise or EventEmitter
4. **Permission checks embedded inside the loop**—each tool call is individually reviewed; there's no possibility of "batch bypassing"

### Costs and Limitations

1. **AsyncGenerator debugging is hard**—stack traces break at generator boundaries, and async error propagation paths are complex. This is the inherent cost of choosing the generator pattern
2. **StreamingToolExecutor's concurrency safety relies on manual labeling**—`isConcurrencySafe` is a boolean set manually by the tool author, with no static analysis or runtime detection to verify correctness. If a tool is incorrectly labeled safe, it can cause concurrency conflicts
3. **Inter-tier coupling in the five-tier compaction**—Tier 3 (microcompact, eliminating redundancy) modifies message content in the current heartbeat, but Tier 2 (snipCompactIfNeeded) in the **next** heartbeat may make different decisions based on already-modified content. Concrete example: after Tier 3 removes a duplicate tool result, Tier 2's previous assessment of "this segment is unimportant, can be snipped" may become invalid because the content it would have snipped is already gone
4. **No global timeout**—`max_turns` limits turn count but not time. If a Bash tool runs a 10-minute compilation command, the entire heartbeat stalls for 10 minutes. **Contrast with Cursor**: Cursor sets per-tool timeouts, which is safer but may also kill legitimate long-running operations by mistake

### Four-Layer Failure Recovery Model

The `queryLoop`'s resilience comes not just from "being able to run," but from "being able to get back up after falling." The system's failure recovery is organized into **four progressive layers**—each covering a different granularity of failure:

| Layer | Recovery Granularity | Mechanism | Typical Scenario |
|-------|---------------------|-----------|------------------|
| **Turn-level self-rescue** | Current turn | reactive compact (compress and retry on prompt-too-long), max-output-tokens upgrade retry (8K→64K, max 3 times), recovery message injection | API returns 413 / output truncated |
| **Task-level hosting** | Current task | Task lifecycle management (background/foreground switching, isBackgrounded, DreamTask background tidying) | User switches to another task |
| **Session-level resumption** | Current session | JSONL persistence + `--continue` (live truth first, then transcript) + File History snapshot | Process killed / laptop closed |
| **Remote-level reconnection** | Cross-environment | Bridge Pointer recovery (4h TTL) + worktree fanout + perpetual teardown without closing transport | Network disconnect / remote CCR restart |

> 💡 **Plain English**: The four recovery layers are like a race car's safety systems—seatbelt (turn-level, handles small bumps) → airbag (task-level, keeps racing after a crash) → roll cage (session-level, car can be repaired after a rollover) → rescue helicopter (remote-level, driver can be airlifted to continue even if the car is totaled). Each layer covers a more severe failure, but activation cost is also higher.

### Architecture Alternative Analysis

| Decision | Claude Code's Choice | Alternative | Why Not Chosen |
|----------|---------------------|-------------|----------------|
| Loop pattern | `while(true)` implicit state flow | Explicit state machine (THINKING→TOOL_CALLING→WAITING) | State machines are more auditable/persistable (checkpoint resume), but Claude Code's loop body is simple enough that this complexity isn't needed |
| Event stream | AsyncGenerator + `for await` | RxJS Observable | RxJS is more powerful (composition operators) but introduces an external dependency and a steep learning curve |
| Context management | Five-tier post-hoc compaction | Aider-style pre-filtering (repo map) | Pre-filtering saves tokens but may miss context; post-hoc compaction preserves full information at higher cost |
| Tool execution | Streaming parallel (StreamingToolExecutor) | Wait-then-execute | Streaming parallel saves time but adds concurrency complexity; most competitors choose the simpler serial approach |
| Tool concurrency | Binary by `isConcurrencySafe` | Fine-grained concurrency control (e.g., semaphores/resource locks) | Binary classification is simple and effective, covering the vast majority of scenarios; fine-grained control adds too much complexity |

---

> **[Chart placeholder 2.4-A]**: Heartbeat sequence diagram — a complete heartbeat showing systole/diastole, with time proportions for API call, tool execution, and compaction
> **[Chart placeholder 2.4-B]**: StreamingToolExecutor parallel timing comparison — time savings of serial vs streaming parallel
> **[Chart placeholder 2.4-C]**: Five-tier compaction waterfall diagram — trigger conditions and compression ratios from light to heavy

---

## TL;DR for Non-Technical Readers

> **If you only remember three things**:
>
> 1. **Claude Code is an assembly line that never stops turning.** You send a message, and it starts looping: ask AI → AI says what to do → does it → tells AI the result → AI says what to do next... until the AI says "done," and the line stops. A simple "read a file and change one line for me" takes three full rotations (read file, edit file, report results).
>
> 2. **It listens and acts simultaneously, not waiting for the AI to finish.** Like a phone call with a colleague—while they're still saying "take a look at the README," you've already opened the file; you don't wait for the whole sentence to finish. This makes the whole process much faster. But risky tasks like editing files or running commands must line up and execute one by one; you can't edit two files at once.
>
> 3. **It automatically "takes notes" when the conversation gets too long.** AI memory has a ceiling (~200,000 tokens, roughly a 300-page book). When it's nearly full, the system automatically summarizes the earlier conversation into a condensed abstract, freeing up space to continue. Like a meeting note-taker compressing three hours of recording into one page—you won't notice, but the information is still there.

---

## Code Landmarks

- `src/services/api/claude.ts` — `query()` function: wraps Anthropic API calls, streaming token processing, prompt cache boundary control
- `src/query.ts` — `queryLoop()`: the body of the `while(true)` heartbeat loop, including termination condition evaluation and compaction triggers
- `src/tools/tools.ts` — tool registry: aggregation of all built-in tools, where `isConcurrencySafe` is declared
- `src/utils/messages.ts` — message utility functions: message creation, text extraction, assistant message location