# Multiple Lines of Defense Aren't Paranoia—They're Necessary

In Claude Code, "security" isn't a module; it's a design philosophy woven through the entire system.

> 🌍 **Industry Context**: Multi-layered security defense has a classic name in information security—**Defense in Depth**, formally introduced by the U.S. National Security Agency (NSA) in the early 2000s, though its roots trace back to layered military positional defense. In software security, both OWASP's security design principles and Microsoft's SDL (Security Development Lifecycle) list defense in depth as a core tenet. In the AI coding assistant space: **Cursor** employs a sandbox plus user confirmation model, and its 2025 Agent mode has introduced tool-level permission approvals (including allowlist/blocklist in yolo mode); **Aider** is comparatively weak on security, relying mainly on manual user diff review; **GitHub Copilot**, because it only does code completion (without executing commands), has a fundamentally smaller attack surface—not because its "security model is lighter," but because it doesn't need most of these six layers at all. Claude Code chose the "more powerful but more dangerous" path—it can execute arbitrary Bash commands and read/write the filesystem—and this fundamental capability-risk trade-off explains why it needs more defenses than its peers. It isn't that Claude Code invented defense in depth; it's that the capability scope of AI Agents (especially code execution privileges) transformed defense in depth from a "recommended practice" into a "survival necessity."

---

## The Same Question, Answered at Six Different Levels

Take "should this Bash command be allowed?" as an example. The system makes this judgment across six distinct layers:

**Layer 1: Global Policy (policySettings)**

Enterprise MDM or local configuration files can declare at the system level that certain tool categories are never permitted. This layer runs before all other judgments and cannot be bypassed. In the source code, this corresponds to step 1a of `hasPermissionsToUseToolInner()`—the entire tool is intercepted by a deny rule.

**Layer 2: Tool Self-Check (checkPermissions)**

Every tool implements its own `checkPermissions()` method. The Bash tool checks commands against known dangerous patterns via character-by-character string matching (e.g., `rm -rf`—force recursive deletion of all files, a single command that can wipe an entire hard drive; `chmod 777`—making files fully open so anyone can read, write, and execute them, etc.; see `bashPermissions.ts`). The Edit tool checks whether a path falls inside sensitive directories like `.git/`, `.claude/`, `.vscode/`, or `.idea/`, or matches dangerous filenames like `.bashrc` or `.gitconfig` (the `DANGEROUS_DIRECTORIES` and `DANGEROUS_FILES` lists in `filesystem.ts`).

> **Note**: What we're describing here is "pattern matching," not "intent understanding"—`checkPermissions()` contains no AI reasoning whatsoever; it's purely rule-driven string inspection.

**Layer 3: Safety Check (Bypass-Immune)**

Within the permission system sits a special Safety Check layer that executes **before** the `bypassPermissions` check (source code step 1g):

```typescript
// permissions.ts step 1g:
// Safety checks (e.g. .git/, .claude/, .vscode/, shell configs) are
// bypass-immune — they must prompt even in bypassPermissions mode.
// checkPathSafetyForAutoEdit returns {type:'safetyCheck'} for these paths.
if (
  toolPermissionResult?.behavior === 'ask' &&
  toolPermissionResult.decisionReason?.type === 'safetyCheck'
) {
  return toolPermissionResult
}
```

This means that even if the user enables the highest-privilege "bypass all permission checks" mode, the following paths still cannot be auto-modified:

- **Dangerous directories**: `.git/`, `.vscode/`, `.idea/`, `.claude/` (the entire directory tree, not just config files)
- **Dangerous files**: `.gitconfig`, `.gitmodules`, `.bashrc`, `.bash_profile`, `.zshrc`, `.zprofile`, `.profile`, `.ripgreprc`, `.mcp.json`, `.claude.json`
- **Claude's own config**: all files under `.claude/settings.json`, `.claude/settings.local.json`, `.claude/commands/`, `.claude/agents/`, `.claude/skills/`

This isn't protection for "two or three paths"—it's a comprehensive, hardcoded safety net covering multiple attack vectors. We'll explore separately later why this layer represents the most profound design thinking in the entire model.

**Layer 4/5: User Confirmation or Auto-Mode Classifier (Runtime XOR)**

Here, we need to be candid about an important fact: **Layer 4 (user confirmation) and Layer 5 (auto-mode classifier) are mutually exclusive at runtime; they are not stacked.**

- In default mode, if the preceding layers yield no definitive answer, the system pops up a permission confirmation dialog, leaving the final decision to the user—this is Layer 4.
- In auto mode, an AI classifier replaces user confirmation to judge whether a tool invocation is safe—this is Layer 5.

The two never take effect simultaneously. Strictly speaking, at any given moment, at most five layers of defense are actively working. From a "design completeness" perspective, the system does cover six distinct safety-check mechanisms; but the claim that "six layers stack, with each layer catching the holes in the one above it" does not hold between Layers 4 and 5—they are two implementations of the same decision point, each serving different usage scenarios.

> 💡 **Plain English**: This is more like an airport security checkpoint with two lanes—a staffed regular lane (user confirmation) and a self-service express lane (AI classifier). You don't go through both at once, but regardless of which lane you choose, the X-ray machine (tool self-check) and the air marshal (bypass-immune) are still on duty.

**The Iron Gate Rule of Layer 5**

The auto-mode classifier also has a safety valve: `denialTracking.ts` tracks the classifier's refusal count. If refusals hit 3 consecutive (`maxConsecutive: 3`) or 20 cumulative (`maxTotal: 20`) rejections, the system automatically falls back to manual approval mode. This is the so-called "Iron Gate"—when the auto-approval mechanism behaves abnormally, the system downgrades to the more conservative human-in-the-loop mode.

We need to distinguish two completely different scenarios: **classifier actively judges a refusal** (it understood the operation and deemed it unsafe) versus **classifier unavailable** (API error, network issue). The former triggers the refusal-counting logic above; the latter is governed by the `tengu_iron_gate_closed` feature flag—defaulting to fail-closed (direct denial), but can be switched via remote config to fail-open (fallback to user confirmation).

**Layer 6: Hooks (Extensible Defenses)**

Users can inject custom safety checks before tool execution via the `PreToolUse` hook:

```bash
# Block any file deletion in the production environment
if echo "$INPUT_ARGS" | grep -q "production"; then
  echo "Operations not allowed in the production directory" >&2
  exit 2
fi
```

Hooks are the only layer defined by the user. This means Claude Code admits it cannot anticipate all business-scenario security needs, and chooses to hand the power of defining defenses to the user. This is the philosophy of **Security as a Platform**—the same design philosophy as AWS IAM letting users define custom policies, or Kubernetes Admission Webhooks letting users inject custom admission logic.

---

## Why Are These Layers Needed?

> 💡 **Plain English**: These defenses are more like an **airport security screening process**—check-in desk verifies your passport (enterprise policy) → metal detector at the security gate (tool self-check) → X-ray scans your luggage (Safety Check) → manual bag inspection or self-service lane (user confirmation / AI classifier) → secondary verification at the boarding gate (Hooks). Not every layer is stopping the same thing—the passport checks identity, the X-ray checks for prohibited items, and manual inspection handles uncertain cases.

**Each layer addresses a different threat model:**

| Layer | Threat Defended Against | Always Active at Runtime |
|-------|------------------------|--------------------------|
| policySettings | Compliance requirements, enterprise policy | Yes |
| checkPermissions | AI misoperation (pattern-matching known dangerous commands) | Yes |
| bypass-immune | Worst-case scenario: assets that must be protected even in bypass mode | Yes |
| User confirmation | High-risk decisions that AI shouldn't make alone | Default mode only |
| Classifier | Automatic review in auto mode | Auto mode only |
| hooks | User-defined business-logic security rules | Yes (if configured) |

No single layer can cover all threats. Enterprise policy cannot predict every business scenario; tool self-check cannot understand the semantics of an operation; user confirmation is absent in auto mode; hooks require proactive user configuration...

The value of these layers doesn't lie in the number "six stacked on top of each other." It lies in the fact that each layer defends against a different type of threat—what rules can't cover goes to AI judgment, what AI can't judge goes to humans, and certain bottom lines cannot be crossed by anyone.

---

## bypass-immune: When the User Themselves Is the Threat Source

bypass-immune is the most engineering-wise part of the entire defense model, and it deserves deep understanding.

Traditional security models assume the user is trustworthy—the permission system exists to "help the user do the right thing." But bypass-immune acknowledges an AI Agent era reality: **users may enable the highest-privilege mode without understanding the consequences, and the system needs to protect users from themselves.**

Imagine this scenario: the user is annoyed by repeatedly popping permission confirmations and enables `bypassPermissions` mode. Now, if the AI is misled by a malicious prompt injection into modifying `.git/config` to add a malicious remote repository, or modifying `.bashrc` to inject backdoor commands—without bypass-immune, these operations would sail right through, because the user has actively turned off all safety checks.

The design of bypass-immune answers a profound question: **There are some things that should not become editable just because the user said "I trust you."**

This idea has clear precedents in operating systems:

- **macOS SIP (System Integrity Protection)**: Even the root user cannot modify system directories like `/System` or `/usr` (except `/usr/local`). Disabling SIP requires rebooting into recovery mode—deliberately designed to be "impossible to turn off during normal operations."
- **Linux immutable file attribute**: Files marked with `chattr +i` cannot be modified even by root without first running `chattr -i`. This is an intentional speed bump—the extra step forces an administrator to confirm "I really do want to modify this file."
- **Windows UEFI Secure Boot**: Firmware-level boot verification that the operating system itself cannot bypass.

Claude Code's bypass-immune applies the same philosophy to AI Agents. Looking at the source, its implementation is scattered across multiple points in the permission pipeline:

1. **Step 1g** (`hasPermissionsToUseToolInner`): Before the `bypassPermissions` mode check, decisions of type `safetyCheck` are returned directly as `ask`, bypassing subsequent auto-allow logic.
2. **Auto-mode entry** (`hasPermissionsToUseTool`): Even in auto mode, paths with `safetyCheck` and `classifierApprovable: false` are not handed to the classifier—these paths are immune to all automatic approval paths.
3. **`checkPathSafetyForAutoEdit()`** (`filesystem.ts`): Specifically determines which paths are "dangerous," returning a `classifierApprovable` flag to distinguish "requires human confirmation" from "classifier can handle."

**If you're building your own AI Agent**, the principle for determining the bypass-immune list is: find the files whose modification would cause the security model itself to fail. For Claude Code, `.claude/settings.json` controls permission rules, `.git/config` controls code provenance, and `.bashrc` controls the shell environment—modifying these files is like pulling the bottom block from a Jenga tower, causing all other security layers to rest on a compromised foundation.

---

## Fail Closed

> 📚 **Course Connection**: Fail Closed vs. Fail Open is a foundational concept in **computer networking** and **operating systems** courses. Firewalls default to a "deny all" policy (fail closed); OAuth tokens reject access after expiration; TLS certificate validation failures drop the connection—all of these are fail-closed.

In security engineering, fail closed is a **baseline requirement**, not an innovation—a security system that fails to do this doesn't deserve to be called secure. That Claude Code implements fail-closed is correct, but what's really worth discussing isn't the fact that "it does fail-closed" itself, but rather **the specific implementation details of fail-closed in an AI Agent scenario.**

From the source code, Claude Code's fail-closed implementation has three distinct strategies, corresponding to different failure scenarios:

**Scenario 1: Classifier API Unavailable** (network error, service outage)

```typescript
// permissions.ts: handling when classifier is unavailable
if (classifierResult.unavailable) {
  if (getFeatureValue_CACHED_WITH_REFRESH(
    'tengu_iron_gate_closed', true,
    CLASSIFIER_FAIL_CLOSED_REFRESH_MS  // refreshes every 30 minutes
  )) {
    // Fail closed: deny outright, do not fall back to user confirmation
    return { behavior: 'deny', ... }
  }
  // Fail open: fall back to normal permission flow (user confirmation)
  return result
}
```

Note that `tengu_iron_gate_closed` is a remote feature flag defaulting to `true` (fail-closed). This means Anthropic can remotely switch to fail-open mode if the classifier experiences a large-scale outage, preventing auto mode from completely freezing for all users. This is operational pragmatism—pure fail-closed at massive scale risks turning "the safety mechanism itself" into a single point of failure for availability.

**Scenario 2: Classifier Context Window Overflow** (conversation history too long)

This is a deterministic error—conversations only get longer, and won't recover on their own. The system skips the Iron Gate logic entirely and falls back to user confirmation. In headless (unattended) mode, it throws an `AbortError` to terminate the task.

**Scenario 3: Classifier Repeatedly Refuses** (3 consecutive or 20 cumulative refusals)

This implies one of two possibilities: either the task is genuinely doing something dangerous, or the classifier is being overly conservative. Either way, automated processing is no longer reliable. The system downgrades to manual approval, letting a human decide.

> 💡 **Plain English**: It's like an autonomous vehicle takeover logic—emergency brake when sensors can't read the road (classifier unavailable); hand the steering wheel back to the human when the system keeps second-guessing itself (repeated refusals). Different failure modes require different degradation strategies.

---

## Performance Cost: How Long Do Six Layers Take?

The classic dilemma of defense in depth is that more layers mean more security but also more slowness. How much latency do six safety checks (especially Layer 5's AI classifier inference) add?

**Layers 1–4 (policy / self-check / Safety Check / rule matching) have virtually zero overhead.** They are all in-memory string matches and rule lookups, executing in microseconds.

**Layer 5 (AI classifier) is the sole heavyweight.** Every time a tool requiring safety approval is invoked in auto mode, the system must make an additional LLM API call (`sideQuery`). From the telemetry fields in the source code, we can see that the system precisely tracks classifier performance metrics:

```typescript
// Telemetry records in permissions.ts
classifierInputTokens: classifierResult.usage?.inputTokens,
classifierOutputTokens: classifierResult.usage?.outputTokens,
classifierDurationMs: classifierResult.durationMs,
classifierCostUSD,  // cost per classifier call in USD
```

The system does two things to mitigate this overhead:

1. **Fast-path bypass for the classifier.** Before calling the AI classifier, the system checks two fast paths: `acceptEdits` mode simulation (if the operation would have been allowed in that mode anyway, such as editing files within the working directory, it's approved directly) and the safe-tool allowlist (`isAutoModeAllowlistedTool`, where read-only tools like Read are approved directly). A large proportion of daily operations are intercepted by these two paths and never trigger a classifier API call.

2. **Two-stage classifier design.** The source code's `classifyYoloActionXml` implements a two-stage judgment: Stage 1 is a fast preliminary assessment (fewer tokens), and only if the initial judgment is uncertain does it proceed to Stage 2 for deep analysis. The `cache_control` parameter allows system prompts and conversation prefixes to be cached and reused within the same session (1-hour TTL), further reducing repeated token consumption.

**The cumulative latency effect of the classifier is significant.** In auto-mode scenarios with frequent Bash command execution (e.g., consecutive `git add`, `npm install`, compile and run), every command not on the fast path adds an LLM round-trip delay. In the source code, `addToTurnClassifierDuration()` tracks the cumulative classifier time within each turn, and this number is reported to analytics alongside the main loop's token consumption—indicating that Anthropic itself is closely watching this overhead ratio.

This is a real trade-off: auto mode trades speed for safety. Users choose auto mode to reduce interruptions from manual confirmations, but each classifier call adds latency. In safety-critical scenarios this cost is worth it, but for low-risk batch operations, configuring allow rules to exploit the fast path is key to optimizing the experience.

---

## The Same Pattern in the Plugin System

Plugin security is also multi-layered:

1. **Workspace trust**: plugin hooks don't execute before trust is established
2. **Name protection**: prevents spoofing of official names
3. **Blocklist**: known malicious plugins are denied outright
4. **Enterprise control**: `allowManagedHooksOnly` can ban all user plugins
5. **Priority downgrade**: plugin hooks always have lower priority than user settings

---

## From Claude Code to Your System: A Three-Tier Implementation Guide

If you're building a system that needs to control AI Agent behavior, here are layered implementation recommendations based on source code analysis—not a vague "more layers are better than one," but concrete priorities based on team size and risk level.

### Minimum Viable Security Model (Three Layers, for Individual Developers / Small Teams)

**First Priority: Tool Self-Check (corresponds to checkPermissions)**

Every tool implements its own `checkPermissions()` method, using pattern matching to intercept known dangerous operations. This is the highest ROI layer—low implementation cost (pure string matching), but it catches most misoperations.

Follow Claude Code's approach: maintain a `DANGEROUS_FILES` and `DANGEROUS_DIRECTORIES` list. You don't need to copy Claude Code's list verbatim, but the core principle is the same—identify the files in your system whose modification would cause the security model itself to fail.

**Second Priority: bypass-immune Paths**

Hardcode a non-bypassable protection list. The criterion: if file X were tampered with, would your safety checks still work? If the answer is "no," X belongs on the bypass-immune list. For most systems, this should at least include:
- Security configuration files themselves (e.g., files defining permission rules)
- Version control config (`.git/config`)
- Shell/environment config (`.bashrc`, `.zshrc`, etc.)

**Third Priority: User Confirmation Dialog**

Pop a confirmation for all operations not on the "known safe" list. This is the simplest safety net—when uncertain, hand it to a human.

### Full Security Model (Six Layers, for Enterprise / High-Risk Deployments)

Add to the minimum model:
- **Enterprise policy layer** (policySettings): let operations teams control available tool scopes via MDM or centralized config
- **AI classifier layer** (auto mode): use a separate LLM call for semantic-level safety judgment of operations. Requires extra API cost and latency, but can understand contexts that rule matching cannot cover
- **Hooks extensibility layer**: hand the power of defining security rules to the people who best understand the business—the end users

### Implementation Cost Reference

| Layer | Dev Time Estimate | Runtime Overhead | Maintenance Complexity |
|-------|------------------|------------------|------------------------|
| Tool self-check | 1–2 days | Near zero | Low (update dangerous-pattern lists) |
| bypass-immune | Half a day | Near zero | Low (list rarely changes) |
| User confirmation | 1–3 days | None (waits for user input) | Low |
| Enterprise policy | 1–2 weeks | Near zero | Medium (policy management UI) |
| AI classifier | 2–4 weeks | High (one LLM call per invocation) | High (prompt tuning, false positive handling) |
| Hooks system | 1–2 weeks | Depends on hook implementation | Medium (needs docs and examples) |

Key decision point: if your Agent doesn't need auto mode (all sensitive operations are user-confirmed), the AI classifier layer can be omitted entirely—it was specifically designed for the "unattended automatic execution" scenario.

---

## Code Landmarks

- `src/utils/permissions/permissions.ts` — Core permission pipeline: `hasPermissionsToUseToolInner()` implements the full step 1a–2b flow, while `hasPermissionsToUseTool()` handles auto/dontAsk mode transitions and classifier invocation at the outer layer
- `src/utils/permissions/denialTracking.ts` — Iron Gate rule: tracks refusal counts (thresholds at 3 consecutive / 20 cumulative), `shouldFallbackToPrompting()` decides whether to trigger downgrade
- `src/utils/permissions/yoloClassifier.ts` — Auto-mode classifier: entry point `classifyYoloAction()`, two-stage XML classifier `classifyYoloActionXml()`, including cache_control and fast-path optimizations
- `src/utils/permissions/filesystem.ts` — bypass-immune implementation: `checkPathSafetyForAutoEdit()`, `DANGEROUS_FILES`, `DANGEROUS_DIRECTORIES` lists, and the concrete judgment logic in `isDangerousFilePathToAutoEdit()`
- `src/utils/permissions/permissionSetup.ts` — Permission initialization: parses tool permission rules from CLI arguments and config files
- `src/hooks/` — Hook system: `PreToolUse` / `PermissionRequest` and other events, forming the programmable extension layer for permissions