Why Does Context Compression Need Six Mechanisms Instead of One?

A deep dive into Claude Code's six-tier context-compression pipeline—from lightweight tool-result trimming to emergency full-context summarization—and why a single strategy cannot handle the diversity of real-world scenarios.

### 🌍 Industry Context: Context-Management Strategies in AI Coding Tools

Context-window management is a central challenge for all LLM applications, yet different tools approach it very differently:

- **Cursor**: Uses a "retrieval-augmented" strategy. Instead of trying to cram all history into the context, it embeds the entire codebase and retrieves only the most relevant snippets for each turn. Context compression is relatively simple, relying mainly on RAG (Retrieval-Augmented Generation) to avoid window overflow.
- **Aider**: Its Repo Map has evolved to the **AST (Abstract Syntax Tree)** level, distilling codebases of hundreds of thousands of lines into a dense graph of class definitions, function signatures, and call dependencies, dramatically reducing token consumption. When conversation history grows too long, it performs a single AI summary replacement—essentially just the "autocompact" tier from Claude Code's six mechanisms. However, the AST-level refinement significantly reduces the pressure on conversation summarization.
- **Kimi Code**: Under its Agent Swarm architecture, each sub-agent is confined to a very small context scope, fundamentally sidestepping the problem of managing a single ultra-long context. The coordinator handles macro-level context, while sub-agents each maintain their own micro-level context—a "divide and conquer" context strategy that stands in sharp contrast to Claude Code's "single-context, multi-tier compression" approach.
- **LangChain / LlamaIndex**: As framework-level solutions, they offer `ConversationSummaryBufferMemory` and `ConversationTokenBufferMemory`. These are generic solutions with no specialized handling for tool-call results.
- **CodeX (OpenAI)**: The underlying layer has been rewritten in Rust (95.6%), giving it an inherent advantage in memory management. However, its context-management strategy focuses more on parallel agents each maintaining independent contexts rather than compressing a single long context.
- **Windsurf**: The Cascade Engine tracks the developer's cursor position, file-switching history, and terminal-output trajectory in real time, maintaining a highly dynamic context tree—combining RAG retrieval with continuous state awareness.
- **Cody (Sourcegraph)**: Deep Search combined with an MCP engine can trace architectural design documents across different Git repositories, building cross-microservice API call-dependency graphs for the LLM. This represents an enterprise-grade "code graph" approach to context management.

Claude Code's six-tier gradient compression is a relatively refined design in the industry. Most tools adopt 1–2 mechanisms (truncation + summarization), whereas Claude Code distinguishes six different "context too large" sub-scenarios and handles each separately. This layered method increases system complexity, but achieves a more nuanced balance between token cost and information retention.

---

## The Question

The Claude Code source code contains six distinct context-compression mechanisms: `toolResultBudget`, `snipCompact`, `microcompact`, `contextCollapse`, `autocompact`, and `reactiveCompact`. They execute in sequence before every AI call. Why so many? Isn't one enough?

> 💡 **Plain English**: It's like a **meeting minute-taker facing an ever-thickening notebook**—Step 1: fold up overly long attachments → Step 2: tear out blank pages that say "nothing happened today" → Step 3: remove duplicate entries → Step 4: write a one-page summary of last month's notes → Step 5: the whole notebook is too thick, so replace it with a distilled version → Step 6: the distilled version is still too thick? Perform an emergency compression. The six steps go from light to heavy; in most cases, the first three are sufficient.

---

## You Might Think…

Intuitively, you might assume that "when it's almost full, compress it—use an AI-generated summary to replace the old content" would be enough. Indeed, this is the most straightforward approach—mainstream solutions like Aider and LangChain's `ConversationSummaryMemory` are essentially this idea.

But reality is more complicated.

---

## How It Actually Works

Context compression is a **multi-dimensional trade-off**, and these six mechanisms each target a different dimension:

### Tier 1: toolResultBudget (Tool-Result Size Limit)

**Problem addressed:** A single tool call returns a giant result (e.g., reading a 100 KB file), but the AI quickly no longer needs that result.

**Approach:** Truncate the tool-result content according to a per-tool `maxResultSizeChars` limit, replacing the excess with a placeholder. This is done in memory with no AI call, so the cost is essentially zero.

**Why not replace it with full-context compression:** Because the issue here is not "the overall context is too large," but "a single result is too large." Using a sledgehammer to crack a nut is wasteful and damages result integrity—truncation is more precise.

---

### Tier 2: snipCompact (History Snipping) [feature: HISTORY_SNIP]

**Problem addressed:** The conversation history contains "intermediate process" segments—such as searching through dozens of files before finding the target. These searches consume a lot of tokens, but their information is already implicit in subsequent actions, so they can be safely deleted.

**Approach:** Use heuristics to identify snippable history segments and delete them directly, without generating a summary. This is faster than full-context summarization because no AI is involved.

**Why not replace it with full-context compression:** Because full-context compression is a "tear-down-and-rebuild" approach—it turns all history into an AI-generated summary, losing detail. Snipping is "precision removal": it only deletes parts known to be useless, preserving all remaining raw information.

---

### Tier 3: microcompact (Micro Compression)

**Problem addressed:** The same file is read many times (by the AI for inspection, by tools, or for edit confirmation), and each read is kept as an independent `tool_result` in the history, causing massive redundancy.

**Approach:** Identify duplicate `tool_result` content and replace older versions with a placeholder indicating "the file has not changed." There are actually three variants:

1. **Time-based microcompact**: If more than 60 minutes have passed since the last request (the API server-side cache TTL has expired), proactively clean up old tool results, since the cache will need rebuilding anyway.
2. **Cached microcompact**: Uses the API's `cache_edits` mechanism to directly edit tool results in the server-side cache, leaving local messages untouched—this only deletes at the API level without breaking the cache prefix.
3. **API-level microcompact** (`apiMicrocompact.ts`): Delegates cleanup entirely to the server via the `context_management` API parameter—the client tells the API, "When input tokens exceed 180K, automatically trim tool results to below 140K," without the client tracking any state. This marks an evolution in context management from "client-side negotiation" to "server-native support."

**Why not replace it with full-context compression:** The cost difference is an order of magnitude—microcompact is a pure in-memory operation (at most plus one cache-edit API call), whereas full-context compression requires a complete AI summarization call (much higher in both cost and latency).

---

### Tier 4: contextCollapse (Context Collapse) [feature: CONTEXT_COLLAPSE]

**Problem addressed:** The early stages of a conversation contain a lot of "exploratory work that is already completed" (initial code reading, requirements analysis, etc.). This content is not very helpful for subsequent work, but full-context summarization is too aggressive and would lose too much information.

**Approach:** "Fold" older conversation segments—replace them with a summary in the API context, while the REPL's UI history view still preserves the full version (the user can still scroll to see it). This is a design that separates the "view layer" from the "model layer."

**Special characteristic:** This is a *preventive* fold that triggers *before* a 413 error, acting earlier than `reactiveCompact`. It is also cheaper because it folds at a finer granularity and can reuse more cache.

---

### Tier 5: autocompact (Automatic Full-Context Compression)

**Problem addressed:** The overall context is approaching the context-window limit, and the previous mechanisms are insufficient. A thorough compression is needed.

**Approach:** Use a fork agent (a child AI instance) to read the full conversation history and generate a high-quality summary, then replace all history with that summary. The trigger threshold is: `context_window - 13,000 tokens` (reserving headroom for the upcoming AI response). The summary is not freeform—the compression prompt requires a **structured summary with 9 sections**: task intent, key technical concepts, files and code snippets, errors and fixes, problem-solving process, **all user messages verbatim** (critical for intent tracking), pending tasks, current work, and optional next steps. The output uses a two-stage structure: first generate an `<analysis>` draft block for the model to organize its thoughts, then generate the `<summary>` formal summary—the `<analysis>` is stripped before entering the context, serving only as "scratch paper" to improve summary quality without consuming the post-compression token budget.

💡 **Plain English**: It's like organizing notes before an exam—first sketch out the key points on scratch paper (`<analysis>`), then copy the essentials into the official notebook (`<summary>`), then throw the scratch paper away.

#### Compaction Prompt Excerpt: NO_TOOLS_PREAMBLE

**Source**: `services/compact/prompt.ts` → `NO_TOOLS_PREAMBLE` (lines 19–26)

This text appears at the very beginning of every compaction prompt, before any substantive instruction—specifically designed for Sonnet 4.6+'s adaptive-thinking models, which are more likely to attempt tool calls even when asked to "output text only":

```
CRITICAL: Respond with TEXT ONLY. Do NOT call any tools.

- Do NOT use Read, Bash, Grep, Glob, Edit, Write, or ANY other tool.
- You already have all the context you need in the conversation above.
- Tool calls will be REJECTED and will waste your only turn — you will fail the task.
- Your entire response must be plain text: an <analysis> block followed by a <summary> block.
```

**Design note**: Notice the extremely forceful tone ("CRITICAL," "REJECTED," "you will fail the task"). A source-code comment explains why: the fork agent used for compaction must inherit the parent's full tool set to hit the prompt cache, but `maxTurns: 1` means that if the model attempts a tool call, there is no second chance after the call is rejected—that compression attempt is wasted. Stating the consequence so explicitly is meant to quash any "let me try" impulse from the very start.

#### Compaction Prompt Excerpt: DETAILED_ANALYSIS_INSTRUCTION

**Source**: `services/compact/prompt.ts` → `DETAILED_ANALYSIS_INSTRUCTION_BASE` (lines 31–44)

This is the `<analysis>` draft-stage instruction embedded in the main summarization prompt:

```
Before providing your final summary, wrap your analysis in <analysis> tags to organize your thoughts 
and ensure you've covered all necessary points. In your analysis process:

1. Chronologically analyze each message and section of the conversation. For each section thoroughly identify:
   - The user's explicit requests and intents
   - Your approach to addressing the user's requests
   - Key decisions, technical concepts and code patterns
   - Specific details like:
     - file names
     - full code snippets
     - function signatures
     - file edits
   - Errors that you ran into and how you fixed them
   - Pay special attention to specific user feedback that you received, especially if the user told you 
     to do something differently.
2. Double-check for technical accuracy and completeness, addressing each required element thoroughly.
```

**Design note**: Notice the phrase "Pay special attention to specific user feedback that you received, especially if the user told you to do something differently." The easiest thing to lose during compression is not technical details, but **user corrections**—things like "no, don't do it that way" or "the approach I mentioned earlier doesn't work." If these corrections are lost in compression, the AI will repeat the same mistakes in subsequent conversation. The prompt specifically emphasizes this point, indicating it was learned from real-world pitfalls.

#### Compaction Prompt Excerpt: BASE_COMPACT_PROMPT (Full Version)

**Source**: `services/compact/prompt.ts` → `BASE_COMPACT_PROMPT` (lines 61–143)

This is the core prompt for full-context compression (`autocompact` / `reactiveCompact`). The `getCompactPrompt()` function sandwiches it between `NO_TOOLS_PREAMBLE` and `NO_TOOLS_TRAILER` before sending it to the fork agent:

```
Your task is to create a detailed summary of the conversation so far, paying close attention to the 
user's explicit requests and your previous actions.
This summary should be thorough in capturing technical details, code patterns, and architectural 
decisions that would be essential for continuing development work without losing context.

[...DETAILED_ANALYSIS_INSTRUCTION_BASE...]

Your summary should include the following sections:

1. Primary Request and Intent: Capture all of the user's explicit requests and intents in detail
2. Key Technical Concepts: List all important technical concepts, technologies, and frameworks discussed.
3. Files and Code Sections: Enumerate specific files and code sections examined, modified, or created. 
   Pay special attention to the most recent messages and include full code snippets where applicable 
   and include a summary of why this file read or edit is important.
4. Errors and fixes: List all errors that you ran into, and how you fixed them. Pay special attention 
   to specific user feedback that you received, especially if the user told you to do something 
   differently.
5. Problem Solving: Document problems solved and any ongoing troubleshooting efforts.
6. All user messages: List ALL user messages that are not tool results. These are critical for 
   understanding the users' feedback and changing intent.
7. Pending Tasks: Outline any pending tasks that you have explicitly been asked to work on.
8. Current Work: Describe in detail precisely what was being worked on immediately before this summary 
   request, paying special attention to the most recent messages from both user and assistant. Include 
   file names and code snippets where applicable.
9. Optional Next Step: List the next step that you will take that is related to the most recent work 
   you were doing. IMPORTANT: ensure that this step is DIRECTLY in line with the user's most recent 
   explicit requests, and the task you were working on immediately before this summary request. If your 
   last task was concluded, then only list next steps if they are explicitly in line with the users 
   request. Do not start on tangential requests or really old requests that were already completed 
   without confirming with the user first.
   If there is a next step, include direct quotes from the most recent conversation showing exactly 
   what task you were working on and where you left off. This should be verbatim to ensure there's no 
   drift in task interpretation.

[...example structure + custom instruction injection point...]
```

💡 **Plain English**: This prompt is like a **meeting-minutes template**—it doesn't just say "summarize this for me," but lists 9 specific columns to fill out, and requires a draft (`<analysis>`) to review whether anything was missed before final submission. Section 6's requirement to preserve "all user messages verbatim" is particularly interesting: even under compression, every word the user said must be kept in original form—because the user's wording carries intent, and an AI-paraphrased version may lose "tone" and "emphasis."

**Design logic of the 9 sections**:

| Section | Purpose | Information most easily lost |
|---------|---------|------------------------------|
| 1. Primary Request and Intent | Anchor the task goal | Late-stage requirement changes |
| 2. Key Technical Concepts | Maintain technical vocabulary | Domain-specific terminology |
| 3. Files and Code Sections | Code localization | File paths and code snippets |
| 4. Errors and fixes | Avoid repeating mistakes | User corrective behavior |
| 5. Problem Solving | Record solution reasoning | Explored dead ends |
| 6. All user messages | **Preserve original intent** | User tone and emphasis |
| 7. Pending Tasks | Don't miss TODOs | Midway inserted new tasks |
| 8. Current Work | Resume from interruption | Ongoing specific operations |
| 9. Optional Next Step | Continue work after compression | Original quotes prevent drift |

Section 9's requirement to "include direct quotes from the most recent conversation"—using verbatim quotes rather than AI summarization—is specifically designed to prevent task-interpretation drift ("task drift") when work resumes.

**Cost:** A full AI API call, expensive and slow. Comments note that p99.99 summary output is about 17,387 tokens, so 20,000 tokens of headroom are reserved for the summary. After compression, the system re-injects critical context: top 5 most recently read files (5K tokens each, 50K budget), loaded Skill contents (5K each, 25K budget), Plan attachments, tool definitions, MCP configs, etc.—so the post-compression context is not a blank slate, but a "refined workbench."

**Circuit breaker:** If `autocompact` fails 3 times in a row, stop retrying—production data showed some sessions were wasting about 250K API calls per day because of this, so the circuit breaker was added.

---

### Tier 6: reactiveCompact (Reactive Compression)

**Problem addressed:** Despite the first five tiers, sometimes an API 413 (`prompt_too_long`) error still occurs. This usually means the `autocompact` threshold did not have enough margin, or tool results suddenly ballooned within a single loop.

**Approach:** Upon receiving a 413, immediately trigger an emergency full-context summary compression, then retry. The difference from `autocompact` is: `autocompact` is "proactive compression when almost full," while `reactiveCompact` is "emergency compression after already full."

**Key detail:** A `hasAttemptedReactiveCompact` flag ensures `reactiveCompact` is attempted only once—if a 413 triggers again after compression, the error is considered unrecoverable and is surfaced directly. This prevents an infinite loop of "compress → fail → compress again → fail again."

---

## The Trade-offs Behind This Design

These six mechanisms form a **cost gradient**:

```
Cost from low to high:
toolResultBudget < snipCompact ≈ microcompact < contextCollapse < autocompact ≈ reactiveCompact
```

The system always tries the cheapest method first, and only escalates when cheaper methods are insufficient. This is a greedy strategy: while guaranteeing it won't crash, it preserves as much raw context as possible (lower-cost compression loses less information) and minimizes API call costs.

> 📚 **Course Connection · Operating Systems / Computer Architecture**: This cost gradient is highly isomorphic to the **memory hierarchy** taught in OS courses: CPU cache levels go L1 → L2 → L3 → main memory → disk, with decreasing access speed and increasing capacity, and the system prefers the fastest tier. Claude Code's compression gradient goes toolResultBudget → snip → microcompact → contextCollapse → autocompact → reactiveCompact, with increasing cost and increasing compression strength, and the system prefers the cheapest mechanism. The design philosophy is identical: **use a layered strategy to achieve optimal balance between speed (cost) and capacity (compression strength).** This also resonates with the "buffer replacement policy" from database courses—data of different temperatures is handled with different eviction strategies.

**Cost:** System complexity increases significantly. The six mechanisms have subtle interactions (for example, `snip` runs before `autocompact`, and the tokens it saves affect whether `autocompact` triggers; both `reactiveCompact` and `contextCollapse` can handle 413s, so precedence must be explicitly defined). The source code contains many comments specifically addressing these interactions, showing that these edge cases have indeed been hit in practice.

---

## What We Can Learn From This

**When a system faces a "continuous constraint-violation scenario" (here, "context is too long"), designing a single response mechanism is almost never optimal.**

The better approach is to decompose the problem into different sub-scenarios, analyze the constraints and costs of each, and then design a specialized solution for each.

The six mechanisms here correspond to six distinct "context too long" scenarios:
- Single result too large → truncation
- Redundant duplication → deduplication
- Useless history → precise snipping
- Overall somewhat large → progressive folding
- Approaching the limit → proactive compression
- Already exceeded the limit → emergency compression

In software design, the appeal of a "one-size-fits-all" solution comes from simplicity, but in practice it often just shifts complexity inside that single solution (making it either too aggressive or too conservative). The right design is to identify the real sub-problems and choose the most appropriate tool for each.

---

## Code Landmarks

- `src/query.ts`, `queryLoop()`: invocation order of the six mechanisms (around lines 380–544)
- `src/services/compact/autoCompact.ts`: `getAutoCompactThreshold()`, `AUTOCOMPACT_BUFFER_TOKENS = 13,000`
- `src/services/compact/compact.ts`: full-context compression implementation (`runForkedAgent` call)
- `src/query.ts`, `isWithheld413` handling: lines 1062–1183, division of labor between `reactiveCompact` and `contextCollapse`
- `src/query.ts`, max_output_tokens recovery message: lines 1224–1229

---

## Directions for Further Inquiry

- What prompt does `autocompact` use when generating a summary? How does it decide what to keep and what to compress? (→ see `services/compact/prompt.ts`)
- How does microcompact's "cache editing" work? Does the Anthropic API really support editing the prompt cache?
- How is the "view layer / model layer separation" UI of `contextCollapse` implemented? How does the user see folded history?

---

*Quality self-check:*
- [x] Coverage: all six mechanisms analyzed
- [x] Fidelity: code location references included
- [x] Readability: problem → solution → why format builds intuition
- [x] Consistency: terminology aligned with global_map.md
- [x] Critical: complexity costs and interaction edge cases pointed out
- [x] Reusability: related chapters listed
