# Why Can Tools Start Executing Before the Model Stops Speaking?

In traditional AI tool calling, the default assumption is "wait until the model finishes speaking before taking action." Claude Code breaks this assumption — it enables tools to execute in parallel while the model is still streaming its response, transforming idle wait time into productive work time. This chapter reveals the concurrency design behind `StreamingToolExecutor` and `runTools` batch scheduling, and how it compresses multi-tool-call latency from serial accumulation to parallel overlap.

> 💡 **Plain English**: It's like a restaurant chef who, upon hearing "first, bring me a soup," starts cooking the soup immediately without waiting for the guest to finish ordering the rest of the dishes.

### 🌍 Industry Context: The State of Concurrent AI Tool Execution

"Stream parsing + early execution" is not a brand-new concept, but its implementation varies across AI coding assistants:

- **OpenAI Function Calling**: The API supports the `parallel_tool_calls` parameter (introduced late 2023), allowing the model to output multiple tool calls in a single response. However, parallelization on the execution side is left to the client — the OpenAI SDK itself does not provide streaming early-execution capabilities.
- **Cursor**: Tool execution follows a "batch after full response" pattern. Because Cursor's tool calls (code edits, terminal commands) often require user confirmation, the benefits of early execution are limited.
- **Aider**: Tool calls are synchronous and serial — the model generates a complete response, edit instructions are parsed out, and applied one by one. There is no streaming parallel mechanism.
- **LangChain**: `AgentExecutor` executes tool calls serially by default; `LangGraph` supports parallel nodes via graph structures, but developers must manually design the parallel topology — it is not automatic.
- **CodeX (OpenAI)**: Version v0.118.0 introduced parallel agent workflows and a Mailbox communication mechanism, supporting asynchronous interaction across multiple background processes. The underlying layer has been completely rewritten in Rust (95.6%), giving it an inherent advantage in concurrent performance. However, its parallelization focuses more on asynchronous communication between multiple agents, rather than streaming early execution within a single agent.

Claude Code's `StreamingToolExecutor` implements two layers of optimization: (1) execute during the streaming parsing phase — without waiting for the full response; (2) automatically batch by concurrency safety — read-only tools run in parallel, write operations run serially. This combination of **streaming + safe batching** represents a relatively refined implementation among AI programming tools. OpenAI's `parallel_tool_calls` solves parallel output on the model side; Claude Code goes further on the client side to compress execution latency.

---

## The Question

The typical mental model of AI tool execution is: model generates full response → tool calls are discovered inside it → tools are executed → results are sent back to the model. But Claude Code has a design called `StreamingToolExecutor` that starts executing tools while the model is still outputting. How is this possible, and why does it matter?

---

## You Might Think...

You might think this requires some complex streaming protocol, or modifications to how the Anthropic API is called. In fact, the principle is straightforward, and its value is greater than you might expect.

---

## Here's How It Actually Works

### Key Insight: `tool_use` Blocks Can Appear Early

When Claude wants to use a tool, it generates a `tool_use` block in its response, roughly like this:

```json
{
  "type": "tool_use",
  "id": "tool_abc123",
  "name": "Read",
  "input": {"file_path": "src/main.tsx"}
}
```

In streaming mode, this block arrives incrementally — first `type`, then `name`, then the fields of `input`. **Once `input` is fully received, we have all the information needed to execute the tool.**

At this point, the model may still be producing more output — perhaps explaining why it wants to read this file, or emitting another tool call further down.

### StreamingToolExecutor Design

What `StreamingToolExecutor` does is simple:

1. Every time the model produces a complete `tool_use` block, immediately call `addTool(block)`
2. The tool immediately enters the execution queue; if it is concurrency-safe (read-only), it starts running right away
3. The model continues outputting...
4. By the time the model fully stops, most tools may already have finished executing
5. Call `getRemainingResults()` to collect all results, yielding them in original order

**Key detail: Results are always yielded in the order the tools were called.** Even if tool B finishes before tool A, A's result is yielded first, then B's. This ensures consistency of the message history.

### How Much Gain?

> **[Chart placeholder 2.4-A]**: Comparative timing diagram — left side "Traditional Serial" vs right side "StreamingToolExecutor Parallel", intuitively showing three Read tools going from 150ms → 50ms

Consider a typical scenario: the model invokes three `Read` tools in one response, reading three files:

**Traditional path:**
```
Model outputs Read(a) → Model outputs Read(b) → Model outputs Read(c) → Model stops
→ Read a (50ms) → Read b (50ms) → Read c (50ms)
Total tool wait time: 150ms (serial)
```

**StreamingToolExecutor:**
```
Model outputs Read(a) → Immediately start reading a (50ms)
Model outputs Read(b) → Immediately start reading b (50ms, in parallel with a)
Model outputs Read(c) → Immediately start reading c (50ms, in parallel with a/b)
Model stops
→ All three files finish reading at roughly the same time (50ms)
Total tool wait time: 50ms (parallel)
```

When the tools are I/O-bound (file reads, network requests, Bash commands), this parallelization is dramatically effective.

> 📚 **Course Connection · Computer Architecture**: This is precisely the software embodiment of the **CPU instruction pipeline** taught in architecture courses. The classic 5-stage pipeline (IF→ID→EX→MEM→WB) overlaps multiple instructions at different stages, multiplying throughput. The `StreamingToolExecutor` design maps directly: model output is "instruction fetch" (IF), JSON parsing is "decode" (ID), and tool execution is "execute" (EX) — multiple tool calls advance simultaneously at different stages. The `siblingAbortController` (one error aborts siblings) is analogous to a **pipeline flush** — when branch prediction fails, subsequent instructions already in the pipeline are discarded.

---

## Broader Concurrency Design: Batch Scheduling in `runTools`

Even without `StreamingToolExecutor`, `runTools()` optimizes tool concurrency:

**Batched by concurrency safety:**
```
Suppose the AI calls: Read, Grep, Read, Edit, Bash
→ Batch 1 (concurrent): Read, Grep, Read   ← read-only, can run together
→ Batch 2 (serial): Edit                    ← write operation, runs alone
→ Batch 3 (serial): Bash                    ← write operation, runs alone
```

Each tool declares whether it supports concurrency via the `isConcurrencySafe(input)` method. Read-only tools return true; write operations return false. The system runs at most 10 concurrent tools simultaneously (adjustable via the environment variable `CLAUDE_CODE_MAX_TOOL_USE_CONCURRENCY`).

> 📚 **Course Connection · Databases / Operating Systems**: This batching strategy of "reads concurrent, writes serial" aligns with the **readers-writer lock** principle from database courses — multiple readers can access a shared resource simultaneously, but a writer needs exclusive access. `isConcurrencySafe()` is essentially each tool declaring itself a "reader" or a "writer". It also resonates with the semaphore mechanism in operating systems — the max concurrency of 10 is simply the initial value of a counting semaphore.

### An Unexpected Detail: Bash Errors Kill Sibling Processes

When multiple Bash tools are running in a concurrent batch (for example, the AI runs three Bash commands at once), if one of them errors out, `StreamingToolExecutor` immediately aborts the others via `siblingAbortController`.

The logic: if one step fails, the subsequent steps are probably wrong too — rather than letting them finish and waste resources, it's better to stop immediately and let the AI see the error before deciding what to do.

---

## The Trade-offs Behind This Design

**The upside is obvious:** tool execution latency shrinks from "model output time + all tools serial time" to roughly just "model output time" (tool execution overlaps with model output), yielding significant speedups for I/O-heavy workflows.

**But there are costs:**

The system must continue consuming the model's output stream while tools are still executing. If the model's subsequent output depends on the results of these tools (which can happen when multiple interdependent tools are called in a single response), extra coordination is required. Fortunately, the model usually plans all independent parallel tool calls ahead of time in a single response; truly dependent tools are typically split across separate conversational turns.

Another cost: when a streaming fallback occurs (the primary model fails and switches to a fallback model), all tool results that have already started executing must be discarded (the `discard()` method), preventing orphaned `tool_result` blocks from interfering with the next request. This requires careful handling in the retry logic.

---

## What to Learn From This

**In any pipeline with "wait time," look for work that can be completed during that wait, and then parallelize it.**

Claude Code's tool execution parallelization exploits two "hidden wait windows":
1. While the model is still outputting but the tool parameters are already complete: launch tools in this window
2. When the model outputs multiple independent tool calls: execute all tools simultaneously

This is the same optimization idea as CPU pipeline execution, request preloading in web applications, and batch I/O in databases: **find the wait, and hide work inside it.**

---

## Code Locations

- `src/tools/StreamingToolExecutor.ts`: Full implementation; focus on `addTool()`, `getCompletedResults()`, `siblingAbortController`
- `src/utils/processToolCallGroup.ts`: Tool call grouping and batch execution logic
- `src/services/tools/toolOrchestration.ts`, `partitionToolCalls()` function (starting at line 90): Batch partitioning logic
- `src/services/tools/toolOrchestration.ts`, `getMaxToolUseConcurrency()` (line 7): Max concurrency configuration
- `src/query.ts`, lines 561-568: `StreamingToolExecutor` initialization condition (feature gate `streamingToolExecution`)
- `src/query.ts`, lines 1380-1408: Rendezvous point for collecting tool results

---

## Directions for Further Inquiry

- How is `isConcurrencySafe(input)` implemented for each tool? Why is `FileRead` safe, while `Edit` is not?
- How are permission checks (`canUseTool`) coordinated during concurrent tool execution? What happens if multiple tools simultaneously trigger permission confirmation dialogs?
- If the user submits a new message while tools are still executing, how does `interruptBehavior` affect the fate of those tools?

---

*Quality self-check:*
- [x] Coverage: Both `StreamingToolExecutor` and `runTools` batch scheduling are covered
- [x] Fidelity: Code location references are included
- [x] Readability: Concrete millisecond timelines build intuition
- [x] Consistency: Aligned with `global_map.md`
- [x] Critical: Identified potential issues with sibling abort and `discard()` complexity
- [x] Reusable: Linked chapters are listed
