# If I Were to Redesign It

This is a thought experiment: if you were to design a system like Claude Code from scratch, which design decisions would you make differently?

> 💡 **Plain English**: This is like the **post-move-in review after renovating a house**—only after living in it do you realize there aren't enough power outlets in the kitchen, the bathroom door should have been on the other side, or there isn't enough storage space. That doesn't mean the house was poorly built; it just means "if I could do it over, these are the things I'd improve." Claude Code is the same—many design choices have revealed room for improvement in practice.

This is not a dismissal of Claude Code's design—many decisions have reasonable historical justifications. It is an exercise to help deepen your understanding of trade-offs.

> 🌍 **Industry Context**: Redesigning from scratch is not mere speculation—others in the industry have already explored different directions. **Cursor** delegates token-budget management to the editor layer, letting users intuitively see the difference between "long context" and "short context" modes through the UI, partially addressing the token-budget transparency issue discussed in Redesign 1 of this chapter. **Devin** (Cognition Labs) designs its AI Agent as a fully autonomous virtual developer running in a cloud sandbox; its Agent collaboration graph (Redesign 2 in this chapter) is explicitly visible—users can observe in real time what Devin is doing. **Aider**, an open-source terminal AI coding tool, takes the minimal route: no hooks system, no plugin ecosystem, no multi-Agent orchestration, yet its codebase is only a few thousand lines and can be read in an afternoon. These different choices show that behind every "cost" in Claude Code's current design lies an alternative path not taken.

---

## Redesign 1: Make the Token Budget an Explicit Type System

> 💡 **What is a "type system"?** Simply put, it is labeling everything in your code—"this is a number," "this is text," "this is a token budget." If you accidentally use "text" as a "number," the system throws an error while you are still writing the code, rather than waiting for the program to crash at runtime. An "explicit type system" means making the token budget one of those labels, so anyone who writes it wrong gets stopped by the system immediately.

Claude Code considers token costs extensively at runtime (`omitClaudeMd`, `CacheSafeParams`, six layers of compression), but these considerations are scattered throughout the codebase without a unified abstraction.

**Current problems**:
- Engineers adding new features must proactively learn "will this affect cache?" and "how many tokens will this add?"
- This knowledge is implicit, not visible at the code level
- The constraints of `CacheSafeParams` are comments, not compile-time checks

**If redesigning**: introduce a `TokenBudget<T>` type, making token consumption part of the type system:

```typescript
// Pseudocode
type TokenBudget = {
  systemPrompt: Budget
  tools: Budget
  history: Budget
}

function createSubAgent(
  prompt: CacheSafePrompt,  // Compile-time guarantee: same params as parent request
  budget: TokenBudget,
): SubAgent
```

This way, the constraints of `CacheSafeParams` would not be a "DO NOT" in a comment, but a type error.

> ⚠️ **Feasibility analysis: the boundaries of the TypeScript type system**
>
> The pseudocode above shows the ideal state, but we must honestly face a technical fact: **token count is a runtime property, and TypeScript's structural type system cannot check it at compile time.** How many tokens a string contains depends on the specific tokenizer (different models use different tokenizers), the actual runtime content, and even changing API pricing rules. This is fundamentally different from Rust's ownership system—ownership is a fixed semantic rule that can be fully determined at compile time, whereas token counting depends on runtime information and cannot be statically derived.
>
> A more realistic approach is a combination of a **runtime budget tracker + development-time lint rules** ("lint" means automated code-checking tools—like spell-check in Word, but for code conventions). Specifically:
> - Use a `TokenBudgetTracker` class at runtime to trace token consumption by component, providing `.allocate()` / `.consume()` / `.remaining()` APIs
> - Use custom ESLint rules to check "does every function that constructs a prompt accept a budget parameter?"
> - Run integration tests in CI to verify that typical scenarios do not exceed component token budgets
>
> Ironically, this is precisely a more structured version of what Claude Code already does—upgrading the comment constraints of `CacheSafeParams` into lint rules, and consolidating scattered token calculations into a single tracker. The improvement is incremental, not revolutionary.

> **Why didn't Claude Code do this already?** Tokens are a runtime property, LLM pricing changes frequently, and different models use different tokenizers—meaning any compile-time solution would necessarily be incomplete. The Claude Code team chose a more pragmatic path: using comment constraints and code review to manage token-related logic, and devoting engineering effort to more urgent feature iterations. This decision is reasonable during rapid iteration—the cost is that new engineers are prone to pitfalls (not knowing that a certain change will break cache), but the benefit is that the team does not have to maintain a complex budget infrastructure.
>
> **Migration cost**: introducing a centralized budget tracker would require refactoring all prompt construction paths (involving multiple functions in `src/services/api/claude.ts`), estimated at 2–4 weeks of engineering work. The main risk is not the volume of code changes, but that budget allocation strategies themselves require extensive tuning—allocating too much wastes tokens, allocating too little causes truncation, and the optimal allocation varies by user scenario.

> 📚 **Course Connection**: Elevating runtime constraints into checkable rules draws on the idea of **Dependent Types**. But note the distance between ideal and reality: academic dependent types can encode arbitrary constraints, while industrial TypeScript can only offer limited guarantees at the structural level.

---

## Redesign 2: Model AI Instances as an Explicit Directed Graph

There are currently seven kinds of AI instances in the system, and their relationships are hidden inside the `querySource` field and code flow. The main AI can spawn sub-Agents, which can in turn spawn sub-Agents; the SessionMemory AI runs in the background; the Hook Agent is created on demand.

**Current problems**:
- There is no single place to see "how many AIs are currently running and how they relate to each other"
- There is no explicit limit on how deep sub-Agents can nest (only implicit token and time limits)
- When debugging multi-AI collaboration, it is hard to trace which AI did what

**If redesigning**: model AI collaboration as a directed graph (DAG):

```
MainAgent
├── SpeculationAgent (background)
├── SessionMemoryAgent (background, post-sampling)
└── SubAgent[0]
    ├── SubAgent[0.0]
    └── HookAgent (stop condition)
```

This graph should be visible at runtime (via an `/agents` command or similar interface), with each node having explicit lifecycle, resource limits, and cancellation propagation rules.

Claude Code already has `AppState.tasks` to track sub-Agents, and the `querySource` field already provides identity tags for each AI instance—this shows the team has recognized the need for Agent traceability and implemented a first pass. The redesign proposal should build incrementally on this existing foundation, not assume a blank slate.

> ⚠️ **Analogy calibration: call stack, not container orchestration**
>
> We need to correct a potentially misleading analogy. Claude Code's Agents are not long-running services—they are **ephemeral, on-demand sequences of LLM calls**. A SubAgent's lifecycle is typically only a few seconds to a few tens of seconds; creation and teardown are lightweight. Analogizing this pattern to Kubernetes Pod orchestration or Airflow DAG scheduling is comparing lightweight function calls to heavyweight container orchestration, off by several orders of magnitude.
>
> A more fitting analogy is the **function call stack or async task tree**: the main Agent calls a sub-Agent, which may call another sub-Agent, and upon completion the result bubbles back up—this is the recursive function call pattern. DAG visualization still has value (like a debugger's call stack view), but its purpose is to help developers understand calling relationships, not to perform Kubernetes-style resource scheduling and container orchestration.

> **Why didn't Claude Code do this already?** The dynamic creation pattern of Agents means the graph topology is unknowable before runtime—the main Agent decides whether to spawn a sub-Agent based on the complexity of the user request, and the sub-Agent decides whether to subdivide further based on execution progress. Requiring static declaration of Agent topology would limit this flexibility. Moreover, Claude Code's Agents typically nest only 2–3 levels deep, so debugging complexity is limited—in most scenarios, the existing `querySource` tracing mechanism is sufficient.
>
> **Migration cost**: adding a runtime Agent topology view (something like an `/agents` command) is relatively lightweight—mainly maintaining an Agent tree in `AppState` and updating nodes on creation/destruction, estimated at 1–2 weeks of engineering work. But the real cost is ongoing maintenance: every change to Agent spawning logic must synchronize with topology tracking, and the visualization UI must handle render jitter from rapidly created and destroyed Agents.

> 📚 **Course Connection**: Modeling runtime relationships between AI instances as a queryable data structure corresponds to the process tree in **operating systems** courses (the `pstree` command) and trace propagation in **distributed systems** courses (such as OpenTelemetry's span tree). Claude Code's current implicit Agent relationships are like troubleshooting which program is eating all the memory on a computer with no Task Manager—you know something is wrong, but you can't see who is causing it.

---

## Redesign 3: Make CLAUDE.md a Versioned Contract

CLAUDE.md is currently a passive file—Claude reads it on every session startup. But this creates some problems:

**Current problems**:
- After CLAUDE.md is modified, Claude won't use the new content until the next session
- There is no mechanism to tell Claude "this rule is more important than that rule"
- Priority among multiple CLAUDE.md files is determined by loading order—the source comments in `src/utils/claudemd.ts` explicitly state: "Files are loaded in reverse order of priority, i.e. the latest files are highest priority with the model paying more attention to them." In other words, higher-priority files are placed later in the prompt, relying on the model's tendency to pay more attention to later content. This mechanism is deterministic (loading order: global → user → project root → current directory, with nearer directories loaded later), but the "enforcement strength" of priority depends on the model's attention allocation behavior rather than an explicit priority directive—this is indeed fragile.

**If redesigning**: turn CLAUDE.md into a versioned, prioritized, conflict-detecting contract system:

```yaml
# .claude/contracts/security.md
version: "1.2"
priority: critical  # critical / normal / hint
scope: workspace    # global / workspace / directory
conflicts_with:     # report if these contracts conflict
  - "performance"
rules:
  - Never commit to main directly
  - Always run tests before committing
```

This way, the system could detect conflicts at load time, merge by priority, and detect breaking changes when users modify rules.

> ⚠️ **Honest admission**: the core of this proposal—version numbers, priority, scope, conflict detection—is not new in the JavaScript toolchain. ESLint's cascade config, Prettier's config merging, and even 2012's `.editorconfig` all implement directory-level scope and rule merging. The real contribution here is not inventing a new pattern, but applying mature configuration-management practices to an AI Agent instruction system.

> **Why didn't Claude Code do this already?** CLAUDE.md's core value lies in **zero barrier to entry**—it is just a Markdown file that anyone can edit, with no need to learn YAML schemas, priority syntax, or conflict-detection rules. Adding a contract mechanism would introduce learning costs: users would need to understand the difference between `critical` and `normal`, the inheritance logic of `scope`, and the meaning of conflict reports. For most users who only place a single CLAUDE.md at the project root, this complexity is unnecessary. The Claude Code team chose the "simple file + deterministic loading order" approach, covering 90% of use cases with the lowest possible user friction.
>
> **Migration cost**: migrating from Markdown files to YAML contracts would require designing migration tools (automatically converting existing CLAUDE.md files to the new format), providing backward compatibility (the new system must read the old format), and updating all documentation and tutorials. A larger hidden cost is community adaptation—existing CLAUDE.md best practices, community templates, and team standards would all need to be rewritten.

---

## Redesign 4: Hooks Need Finer-Grained Permissions

Today's hooks are binary: either trusted (can execute arbitrary shell commands—"shell commands" are instructions typed into a terminal/command line, such as deleting files or installing software, essentially text-based computer operations) or untrusted (completely disabled).

**Current problems**:
- You cannot say "this hook may only read files, no network requests"
- You cannot say "this hook's shell command timeout is 5 seconds instead of 10 minutes"
- Plugin hooks and user-defined hooks share the same capability boundary

**If redesigning**: add capability declarations to hooks:

```json
{
  "PreToolUse": [{
    "matcher": "Edit",
    "hooks": [{
      "type": "command",
      "command": "lint.sh $CLAUDE_FILE_PATHS",
      "capabilities": {
        "network": false,
        "filesystem": "read-only",
        "timeout": 5000
      }
    }]
  }]
}
```

Ideally, the system would run hooks inside a sandbox, enforcing the declared capability boundaries. This would make the hooks permission model clearer and reduce the blast radius of malicious hooks.

> ⚠️ **Implementation complexity should not be underestimated**
>
> The JSON configuration above looks simple and elegant, but there is a vast engineering chasm between "declaring a capability field" and "enforcing that capability at runtime." To achieve process-level read-only filesystem isolation:
> - On **macOS**, you would need seatbelt profiles (`sandbox-exec`)—but Apple has deprecated this, and documentation is extremely sparse
> - On **Linux**, you would need seccomp-bpf filters or namespace isolation, which are complex to configure and behave inconsistently across distributions
> - On **Windows**, the process-isolation mechanism is completely different and requires a separate implementation path
>
> There is also a **performance cost**: launching a sandbox environment for every hook execution adds significant latency. For a `PreToolUse` hook on the critical path, is an extra 50–200ms of sandbox startup acceptable? During frequent editing operations, these latencies add up.
>
> A more incremental approach: first implement **declarative capability auditing** (declaration only, no enforcement, used for auditing and prompting), then gradually introduce **lightweight isolation** (such as a `--read-only` mounted tmpdir rather than a full sandbox). This mirrors the evolution of Claude Code's existing permission system—first do fail-closed user confirmation, then consider automated sandboxing.

> **Why didn't Claude Code do this already?** The hooks system itself is still in early stages (the simplicity of settings.json configuration shows the team is intentionally keeping complexity low). While users are still exploring how to use hooks, introducing sandbox mechanisms too early would increase configuration complexity, reduce debugging convenience, and might block legitimate use cases because the sandbox is too restrictive. The current "trust user configuration" strategy aligns with Claude Code's overall security philosophy: a hook explicitly configured by the user is equivalent to a command the user types into the terminal, and the security responsibility lies with the user.
>
> **Migration cost**: sandbox implementation requires cross-platform adaptation (three isolation mechanisms for macOS/Linux/Windows), estimated at 4–8 weeks of heavy systems programming. The bigger challenge is long-term maintenance—OS sandbox APIs change frequently (e.g., macOS seatbelt deprecation), requiring continuous follow-up.

---

## Redesign 5: Context Compression Needs User Participation

Today's context compression is fully automatic; users cannot see when it happens or what is lost.

**A better design should**:
- Notify users before compression occurs (like Notion AI prompting "this conversation is getting long, I'm going to compress it")
- Let users mark "this message is important, do not compress it"
- Show "what is the summary" after compression (not just implicit continuation)

SessionMemory is a partial solution, but its nine fixed sections may not capture everything users consider important.

> ⚠️ **This is a UX improvement, not an architecture redesign**
>
> We must honestly admit: the "notify before compression," "let users mark important information," and "show the summary" ideas mentioned above are essentially product-manager requirement descriptions, not architecture-level redesigns. To elevate this to the architecture level, the real question to answer is: **how do you design a context-management architecture that supports "selective compression"?** For example, a tiered context ring buffer—hot data in L1 (fully preserved), warm data in L2 (summarized), cold data in L3 (keyword index only). A user's "important" mark is equivalent to pinning specific information to the L1 tier. But this raises a new question: if users mark 80% of the content as "important," does compression still matter?

> **Why didn't Claude Code do this already?** The automation of context compression is an intentional design decision—it reduces user cognitive load. Most users do not want (and should not need) to understand the technical details of token budgets and context windows. Popping a notification before every compression would interrupt workflow and force users to make a decision they may not have enough information to make well ("which parts of this conversation are most important for the upcoming tasks?"). Claude Code chooses to let SessionMemory automatically extract key information, making a trade-off that favors user experience over user control.
>
> **Migration cost**: UI-level notifications and marking features are relatively lightweight (1–2 weeks). But a bottom-up selective compression architecture (tiered context ring buffer) would require redesigning the entire context-management pipeline, involving coordinated changes to prompt construction, token counting, and cache strategy across multiple modules, estimated at 6–10 weeks.

---

## Summary: Good Design Decisions and What Could Be Better

**Worth keeping and learning from**:
- The tool interface (`Tool.ts`) is clean and extensible
- The multi-layered defense and fail-closed principles of the permission system
- The `querySource` mechanism for distinguishing different AI instances
- The breadth of event coverage in the hooks system

**Worth improving**:
- Make token budgets visible, inferable system constraints
- Make the AI collaboration graph an explicit runtime model
- Add capability declarations and sandboxes to hooks
- Context compression needs more user participation

These improvements do not dismiss the current design; they build on it to make the system's constraints and behaviors more explicit, observable, and inferable. We must honestly admit that each of these "redesigns" carries an element of hindsight. The Claude Code team made pragmatic choices under the pressure of rapid iteration—get the system running, serve users well, and optimize the architecture gradually. This is a reasonable strategy in engineering reality. Some of the improvements we propose (such as typed token budgets) are quite difficult to realize within the existing TypeScript type system, while others (such as Agent DAG visualization) require additional runtime overhead. The tension between ideal design and engineering reality is itself one of the most central topics in software engineering.

---

## Code Anchors

The existing implementations corresponding to the redesign goals discussed in this chapter:

- `src/services/api/claude.ts` — Redesign 1 (token budget explicitness): current token-budget calculations are scattered across multiple functions in this file
- `src/utils/forkedAgent.ts` — Redesign 2 (Agent collaboration graph): current sub-Agent relationships are implicit, lacking a runtime topology model
- `src/hooks/` — Redesign 3 (hook sandbox): current hook execution has no capability declarations and no sandbox isolation
- `src/Tool.ts` — Design worth preserving: the tool interface is clean, extensible, and one of the most successful abstractions in the architecture

---

## Beyond the Current Perspective: From Prompt Engineering to AI Behavioral Engineering

After studying 124 prompt templates, there is a broader evolution worth discussing on its own: what Anthropic practices in Claude Code is no longer "Prompt Engineering" in the usual sense, but something more systematic—**AI Behavioral Engineering**.

**Characteristics of traditional Prompt Engineering**: write a paragraph, test the result, tweak it, test again. Low technical rigor, not reproducible, heavily dependent on personal intuition, no regression mechanism. This was how most teams actually worked in 2022–2023.

**The new paradigm revealed by the Claude Code source code** manifests in four dimensions:

**1. Exhaustive enumeration of anti-patterns**

Take the verification Agent's system prompt as an example: instead of telling the AI to "be careful," it exhaustively lists every observed failure mode, with concrete examples and counter-examples. This "negative knowledge base" construction is closer to **security engineering** (threat modeling) thinking than writing thinking—first list everything that could go wrong, then defend against each one.

**2. Persona design**

The exploration Agent and planning Agent system prompts are not merely functional descriptions—they construct a complete **cognitive framework**: what are this AI's values, what default tendencies should it have when facing ambiguity, and in what situations should it stop execution rather than push forward. This is the methodology of personality psychology applied to AI behavior design.

**3. Cognitive science mapping**

The Dream system's prompts (`autoDream/promptTemplates.ts`) use neuroscience metaphors to describe the memory consolidation process—not just functional specifications, but building a cognitive model of "how an AI should simulate human memory consolidation." Mapping cognitive science concepts onto AI behavior design is an interdisciplinary, intentional methodological choice.

**4. Quantitative validation**

As noted in the previous chapter, the eval comments in `src/memdir/memoryTypes.ts` (`H1: 0/2 → 3/3`, `H2: 0/2 → 3/3`, etc.) show that Anthropic quantitatively evaluates every prompt change. This introduces scientific experimental methods into software engineering—a practice almost nonexistent in traditional software development (you would not write in an ordinary function's comments, "this algorithm improved F1 from 0.67 to 0.91 on the benchmark test set").

**What this shift means**

| Dimension | Prompt Engineering (old) | AI Behavioral Engineering (new) |
|-----------|--------------------------|---------------------------------|
| Knowledge form | Implicit personal skill | Explicit, codifiable rules |
| Validation method | Subjective feel | Quantitative eval test sets |
| Design approach | Inside-out ("I feel this is good") | Outside-in (first define failure modes, then defend) |
| Interdisciplinary borrowing | Almost none | Security engineering, cognitive science, personality psychology |
| Iteration record | None | Historical accuracy and known failures preserved in comments |

This evolution is not just "more professional Prompt Engineering"—it represents an engineering culture shift: **from "I wrote a paragraph and it feels good" to "I have 12 eval cases, 3/12 passed before the change, 11/12 passed after, and the remaining 1 failure is documented in Known gap."**

For teams building AI products, this shift has direct practical implications: it is not "hire someone who can write prompts," but **establish an eval-driven prompt iteration mechanism**—even starting with 5 test cases is better than zero. The Claude Code source code shows what this mechanism looks like at production scale.

> 💡 **Plain English**: Traditional Prompt Engineering is like a **chef seasoning by experience**—stir-fry a dish, taste it, "hmm, a bit more salt." AI Behavioral Engineering is more like **an R&D department developing a product formula**—design 20 taste testers, record scores before and after each formula adjustment, and note on the recipe card, "increased ingredient #3 from 5g to 7g, blind-test average score rose from 6.2 to 8.1, but still needs improvement for nut-allergic individuals (known gap)." Both are "seasoning," but one is a craft, the other is a science.