# Apply These Ideas to Your Project

This book analyzes the engineering decisions inside Claude Code. This chapter distills the most valuable ideas into practical, actionable advice.

> ⚠️ **Evaluate before adopting**: this chapter does not suggest you implement all six ideas—it is a **menu**, not a **set meal**. Every idea has preconditions and signs of over-engineering. Evaluate your project's stage (prototyping, growth, or scaling) before selectively adopting. "Copying the big tech playbook" is one of the most common mistakes engineers make.

> 🌍 **Industry Context**: Claude Code's engineering thinking does not exist in a vacuum—the field of AI application engineering is rapidly converging on a shared set of best practices. For each idea below, we compare how mainstream frameworks implement it, so you can make informed choices rather than blindly following one approach.
>
> - **LangChain / LlamaIndex** standardized "tool calling + Agent loop" as a generic pattern, lowering the barrier to entry. However, during 2024–2025 they were abandoned by many projects due to over-abstraction and "abstraction leakage" (when a framework claims to shield you from low-level details, but those details keep "leaking" out in practice, forcing you to deal with them anyway—like buying a fully automatic dishwasher that still requires manual pre-rinsing). Many teams migrated to lighter alternatives such as LiteLLM / Instructor.
> - **Vercel AI SDK** wrapped streaming output and token management into a frontend-friendly API, embodying the "tokens as first-class citizens" philosophy on the web.
> - **CrewAI / AutoGen** explored different orchestration patterns for multi-Agent collaboration—CrewAI uses a role-based object-oriented approach, while AutoGen uses conversation-driven multi-turn negotiation.
> - **Cursor** also adopted a tool-use + permission loop in its Agent mode, with extensibility expressed through the layered `.cursorrules` file system.
>
> Claude Code's decision to **build its own Agent loop without using any framework** is itself worth contemplating. Its unique value lies in being one of the few systems validated at **true production scale** (millions of DAU, billions of tokens per week), proving these ideas are more than theoretical. But this also means: if your project is far smaller, copying them verbatim may backfire.

---

## Idea 1: Hide Work Inside Waiting Time

> 💡 **Plain English**: think of it like **parallelizing your morning routine**—flip the kettle on (start I/O), brush your teeth and get dressed while it boils (module loading), and by the time you're dressed the water is ready. The key is finding windows where you are "going to wait anyway," and stuffing useful work into them.

**Core formula**: find places where "we must wait for X" → do everything that does not depend on X while waiting.

**In your AI application**:

When you call an LLM API and wait for the response (possibly 3–30 seconds), ask:
- Can you infer the response direction from the first few streamed tokens and preload data early?
- Are there UI updates that do not depend on the AI response and can be done first?
- Are there background tasks (logging, analytics, cache warming) that can run concurrently?

**Minimum viable implementation (MVP level)**:

```javascript
// Bad—serial waiting
const response = await llm.complete(prompt)
await fetchRelatedData(response)

// Better—parallel requests, but this is only "parallelism," not "speculative execution"
const [response, sideData] = await Promise.all([
  llm.complete(prompt),
  prefetchLikelyNeededData()
])
```

**The real difficulty: what if you guess wrong?** The `Promise.all` above is just simple parallel request execution—anyone who has written async JavaScript can do it. True speculative execution is called "speculative" because it **must handle rollback and isolation when the prediction is wrong**. `Promise.all` rejects entirely if any request fails, meaning a speculative request failure will cascade into the main request crashing—unacceptable in production.

**Production-grade implementation**:

```javascript
// True speculative execution requires: independent error isolation + cancellability + result validation
const controller = new AbortController()

// Use allSettled instead of all—speculative task failure must not affect the main flow
const [mainResult, specResult] = await Promise.allSettled([
  llm.complete(prompt),
  prefetchLikelyNeededData({ signal: controller.signal })
])

// The main request must succeed
if (mainResult.status === 'rejected') throw mainResult.reason

// Speculative result: only use if successful and still relevant, otherwise silently discard
if (specResult.status === 'fulfilled' && isStillRelevant(specResult.value, mainResult.value)) {
  usePreloadedData(specResult.value)
} else {
  // Guess was wrong, or the speculative task failed—no problem, fall back to the normal path
  controller.abort()  // Cancel any speculative task still in flight
  const data = await fetchRelatedData(mainResult.value)
  usePreloadedData(data)
}
```

Claude Code's source code follows exactly this approach for speculative execution: using `AbortController` for cancellability, paired with independent error boundaries to ensure speculative failures do not affect the main flow. The complexity is an order of magnitude higher than a naive `Promise.all`.

> 📚 **Course Connection**: CPU branch prediction is a good analogy, but do not forget the other half of the analogy—speculative execution in CPUs works because of the **reorder buffer** (ROB) holding pre-execution results and **precise exception handling** enabling clean rollback. In your code, `AbortController` is your ROB, and `Promise.allSettled` is your precise exception handling. "Speculative execution" without a rollback mechanism is just parallel requesting in name only.

> 🚫 **Over-engineering signal**: if your AI application is a low-frequency internal tool (e.g., fewer than 1,000 requests per day), the latency savings of speculative execution (a few hundred milliseconds) will be virtually imperceptible, while code complexity and debugging difficulty rise significantly. In that scenario, simple serial calls are the better choice. **Rule of thumb: consider speculative execution only when API call latency accounts for > 30% of total interaction time and your user volume is large enough that you care about P95 latency.**

---

## Idea 2: Tokens Are First-Class Citizens

**Core question**: which operations in your system consume tokens? How many each time? Are there unnecessary consumptions?

**Practical checklist**:

- [ ] Is there content in the system prompt that could be omitted? (For read-only operations, writing-style instructions can be removed.)
- [ ] Do you check prompt cache compatibility before every API call?
- [ ] Do you use different model sizes for different subtasks? (Haiku for summarization, Sonnet/Opus for the main task.)
- [ ] Do tool descriptions have length limits? (Prevent one tool description from filling the entire context.)
- [ ] Is there a threshold controlling how often "background AI tasks" trigger? (Prevent memory extraction from running on every response.)

**Cache parameter caveats**:

If your system has multiple AI instances sharing the same prompt cache (e.g., the main request and a background monitor), ensure all instances use **exactly identical** API parameters (model, system, tools, messages, thinking config). Any parameter difference will break cache sharing and can multiply costs.

> ⚠️ **Frequent pitfall: parameter order matters too**. Anthropic's prompt caching checks not only parameter values but also **parameter order**—the same tools list in a different order will cause a cache miss. This is especially dangerous for systems using **dynamic tool sets**: if you dynamically select a subset of tools based on context, and the tools array passed to the API differs in order each time, cache hit rate will plummet. **Solution: always sort parameter arrays by a fixed rule (e.g., alphabetical by tool name) before passing them to the API.**

> 🚫 **Over-engineering signal**: if your monthly API bill is under $100, the ROI of spending time optimizing token consumption is low. Focus on product features instead. **Rule of thumb: systematically optimize tokens only when token costs exceed 15% of total operating costs.**

---

## Idea 3: Design AI as Composable Modules

**Core interface**:

```typescript
type AIQuery = {
  messages: Message[]
  systemPrompt: string
  querySource: string     // identifies the caller, for logging and permissions
  canUseTool: (tool, input) => Permission  // each caller defines its permission boundary
  setAppState?: (updater) => void  // whether shared state may be modified (usually no-op for child AIs)
  tools?: Tool[]          // tool set can be restricted
}
```

Notice the `canUseTool: (tool, input) => Permission` signature—permission checks consider not only "**which** tool" but also "**what parameters** were passed." This means the same `BashTool` can be auto-approved when `input = "ls"` (listing a directory—harmless) while requiring human confirmation when `input = "rm -rf /"` (force-deleting all files on the machine—extremely dangerous). This **parameter-level** permission control is the true essence of Claude Code's permission system.

Do not create separate execution code for each AI role. One unified execution engine + different parameter configurations = dramatically improved maintainability.

**Comparison with frameworks—two philosophies of composition**:

Two fundamentally different routes exist here:

| | Claude Code route | CrewAI / AutoGen route |
|---|---|---|
| **Approach** | unified execution engine + different configurations | each Agent is an independent class / role |
| **Programming paradigm** | functional: one function, different configs | object-oriented: different classes, different behaviors |
| **Pros** | high maintainability, consistent behavior | high flexibility, Agents can have completely different execution flows |
| **Cons** | all Agent capabilities are constrained to the same interface | maintenance cost escalates sharply as Agent count grows |

**When should you break uniformity?** When a sub-Agent needs a completely different execution flow (e.g., the main Agent only needs single-turn tool-use, while the sub-Agent needs a multi-turn autonomous loop), the unified engine becomes a constraint. At that point, consider giving that Agent its own execution path, but keep `querySource` and the permission interface uniform.

**Sub-AI boundaries**:

A sub-AI instance should:
- ✅ have an independent AbortController
- ✅ have a restricted tool set (only the tools it needs)
- ✅ have a clear `querySource` identifier (in a multi-Agent system, `querySource` solves a fundamental problem: when something goes wrong, you can trace back to **which Agent** initiated **which operation chain**—critical for auditability in AI systems)
- ❌ not modify the parent AI's global state (`setAppState` should usually be a no-op, though note: certain state updates such as tool result caching may need exceptions—after distinguishing permissions by `querySource`, Claude Code allows some state updates from sub-Agents)

> 🚫 **Over-engineering signal**: if your project has only one AI call pattern (user asks → AI answers), a composable architecture is unnecessary. **Rule of thumb: introduce a unified engine when you find yourself copy-pasting AI call code and tweaking parameters.**

---

## Idea 4: Multiple Layers of Defense (Fail-Closed)

**Design checklist**:

1. **Which operations are bypass-immune?** Hard-code non-circumventable limits that are not controlled by configuration.
2. **When the judgment is uncertain, what do you choose?** The default should be the more conservative behavior.
3. **What happens on repeated failures?** Automatic degradation (e.g., iron gate falling back to human approval).
4. **How do you distinguish "refuse to execute" from "refuse but tell the model"?** (Exit code semantics: 2 = show to model, other non-zero = show to user.)

**Minimum viable implementation**:

```python
def check_permission(operation, context):
    # Layer 1: hard-coded absolute prohibitions (bypass-immune)
    if operation.path in NEVER_TOUCH_PATHS:
        return Deny("System file - never allowed")
    
    # Layer 2: configured deny rules
    if matches_deny_rules(operation, context.settings):
        return Deny("Policy")
    
    # Layer 3: independent classifier check (note: not letting the operation check itself!)
    if not classifier.is_permitted(operation, context):
        return Deny("Classifier rejected")
    
    # Layer 4: configured allow rules
    if matches_allow_rules(operation, context.settings):
        return Allow()
    
    # Layer 5: default (conservative)
    return Ask("Unknown operation - need approval")
```

> ⚠️ **Security principle: the inspected should not be the inspector**. Layer 3 uses an **independent `classifier`** to judge whether an operation is permitted, rather than letting the `operation` object call `self.is_permitted()`. Why? If the `operation` object is poisoned by prompt injection, its self-check may return a malicious result—like letting the suspect write their own acquittal. Claude Code's actual implementation uses an independent classifier for judgment, not tool self-inspection.

**Production-grade addition: permission caching**. In an Agent loop, multiple tool calls may fire per second; the latency of a 5-layer permission check accumulates. Claude Code mitigates this with a permissions cache—skipping repeated checks for previously judged identical operations. Once your "minimum viable implementation" goes to production, **a caching layer is almost mandatory**.

> 📚 **Course Connection**: fail-closed multi-layer defense is a direct application of the **Defense in Depth** principle from information security courses. The 5-layer permission check above is analogous to firewall rule evaluation in network security: evaluate layer by layer in priority order, with a final default-deny rule. It also maps to security property verification in **formal methods** courses—fail-closed guarantees that "any operation not explicitly permitted is denied," a security invariant that can be formally proven.

> 🚫 **Over-engineering signal**: in a single-user desktop app or internal prototype, a 5-layer permission check is unnecessary complexity. **Rule of thumb: multi-layer defense only matters when your system is exposed externally (users can trigger sensitive behaviors like file operations or network requests via natural language) and user count > 1. For single-user internal tools, one hard-coded deny list + default ask is enough.**

---

## Idea 5: Observability Is Part of the Product

**When designing each feature, simultaneously design its observability**:

- While this feature executes, can users see what it is doing?
- When it fails, is there enough information to diagnose the cause?
- Are there metrics to know whether this feature is actually effective (rather than just assuming it is)?

**In practice**:

```python
# Bad (observability as an afterthought)
result = await ai_classify(tool_call)
# ... later discovered classifier misbehavior, added logs

# Good (observability built-in)
start_time = time.now()
result = await ai_classify(tool_call)
log_event('classifier_result', {
    'tool': tool_call.name,
    'decision': result.decision,
    'duration_ms': time.now() - start_time,
    'reason': result.reason,
})
```

Treat log events as code equally important to feature code, not as "debugging print statements."

**Observability needs unique to AI applications**: the `log_event` + `duration_ms` pattern above has been standard practice since the microservices era of 2015 (supported by OpenTelemetry / Zipkin / Jaeger). The real challenge of AI applications is tracking **AI-specific metrics**:

- **Token consumption**: how many input/output tokens each call used, and cache hit rate
- **Classifier decision chain**: why the permission classifier made this judgment, and on what basis
- **Tool call chain**: how many tool calls a single user request triggered, and the result of each
- **Agent loop depth**: how many iterations the Agent entered before completing the task, and whether there is a risk of infinite loops

LangChain's `CallbackHandler` and Vercel AI SDK's `onToken` / `onFinish` hooks provide partial capabilities, but if you build your own Agent loop, you must instrument these metrics yourself.

> 🚫 **Over-engineering signal**: if your AI application is still validating product hypotheses, `console.log` is sufficient. **Rule of thumb: introduce structured observability the first time you encounter the debugging trap of "the AI is behaving strangely and I don't know why."**

---

## Idea 6: Hooks System as the Standard Template for Extensibility

The hooks/plugin system is one of the oldest patterns in software engineering—from Emacs hooks to Webpack's tapable, from Git hooks to WordPress actions/filters, this pattern has at least a 40-year history. Claude Code did not invent hooks, but it demonstrates how to adapt this classic pattern to the **AI Agent** context.

If your AI system needs to support user-customizable behavior, a hooks system is a mature pattern. Among the five design points below, the first three are basic elements of any hooks system, while the last two are **special requirements** for AI application scenarios:

1. **Define event nodes**: emit events at every meaningful node in the system
2. **Define exit code semantics**: agree on what different exit codes mean (blocking / non-blocking / for AI to see / for user to see)
3. **Secure defaults**: require explicit trust establishment, not open by default
4. **Source labeling** (AI-specific): label each hook with its source, for auditing and prioritization—in a multi-Agent system, you need to know which Agent triggered which hook
5. **Async support** (AI-specific): AI Agent operations are often long-running, so some hooks need to run asynchronously and wake the system via a notification mechanism

> ⚠️ **Exit code semantics demand extreme care**. What if a user-written hook script returns an unexpected exit code? This is a fail-open vs fail-closed design decision. Claude Code chooses **fail-closed**—unknown exit codes are treated as blocking. This sacrifices some user experience (a user's hook might block an operation unexpectedly due to an unrelated bug), but gains safety guarantees. Your project must make this choice based on its own security requirements.

Cursor's `.cursorrules` layered system offers another approach: instead of controlling behavior through exit codes, it uses a priority hierarchy of rule files (project-level > user-level > global-level) to achieve extensibility. If your needs lean more toward "configuration" than "scripting," this declarative approach may be cleaner.

> 🚫 **Over-engineering signal**: if your AI application has fewer than 100 users and no third-party integration needs, a hooks system is pure over-engineering. **Rule of thumb: introduce hooks when you receive the third feature request of "can you automatically do Y before/after X?"**

---

## Final Thoughts

These ideas were not invented by Claude Code—utilizing waiting time comes from CPU speculative execution, token management is analogous to memory management, multi-layer defense comes from security engineering, and hooks come from Unix extension-point design. Frameworks and products such as LangChain, Vercel AI SDK, CrewAI, AutoGen, and Cursor are also practicing similar philosophies in their respective domains, each with its own trade-offs, which we have compared throughout this chapter.

Claude Code's contribution lies not in "being first," but in **systematically combining these existing engineering ideas into one product** and validating their combined effectiveness at real production scale (millions of users, billions of tokens per week). Individually, you may have seen each idea in a textbook or another project; but combining them into a coherent engineering system is Claude Code's unique value as a "living case study."

This "combinatorial value" can be summarized as **systemic effects**: the Dream system makes KAIROS smarter → the Bash AST parser makes the permission system possible → permissions make autonomous execution safe → autonomous execution makes KAIROS feasible → **three-layer compression**[^compact-layers] makes long sessions possible → long sessions make multi-Agent coordination a reality. Remove any one layer, and the effectiveness of the rest drops. That is the essence of excellent architecture—the whole is greater than the sum of its parts.

[^compact-layers]: The "three-layer compression" here is a highly abstract shorthand—referring to **lightweight pruning / medium compression / heavy full-text compression** as the three broad categories. The full technical details are described as "six compression mechanisms" in **Q02 Why Context Compression Requires Six Mechanisms**, and as "five in-call compression layers + sixth defense line reactiveCompact" in **Part 2 Chapter 4 The `queryLoop`**. These three descriptions correspond to different abstraction granularities of the same system—"three layers" is used here to simplify critical analysis, not to contradict other chapters.

This is the greatest value of this codebase: it is not just Claude Code; it is a living case study in engineering AI applications at scale.

---

## Code Landmarks

- `src/main.tsx`: the startup sequence—a complete example of "hiding work inside waiting time"
- `src/services/api/claude.ts`: the parameter injection pattern and `querySource` routing logic—the core implementation of "designing AI as composable modules" (focus on how parameters are dynamically assembled by `querySource`, rather than the `query()` function itself)
- `src/utils/permissions/permissions.ts`: the 10-step permission state machine—the culmination of the multi-layer defense idea