The Voice System Deep Dive

This chapter unpacks Claude Code's experimental voice-input subsystem: how "hold-to-talk" converts speech into text and seamlessly injects it into the existing text-input pipeline.

---

> **🌍 Industry Context**: Voice input is emerging as a competitive differentiator for AI coding tools. **GitHub Copilot Voice** (previewed in 2022) was the earliest attempt, using the VS Code Speech extension to turn voice into code, but it was English-only and is no longer updated. **Cursor** integrated voice input in version 0.40+ (using the Whisper API), supporting voice interaction in Composer mode. **Aider** supports voice through a third-party `aider-voice` plugin, but it is not an official feature. **Codex (OpenAI)** currently does not support voice input. The key technical challenge of implementing voice input in a CLI terminal is that, unlike a browser, the terminal has no Web Audio API; it must rely on external recording tools (SoX) or platform-native modules. This is the fundamental reason Claude Code's voice architecture is more complex than GUI-editor solutions.

---

## Chapter Guide

The voice system is an experimental subsystem in Claude Code 2.1.88 that lets users replace keyboard typing with "hold-to-talk" voice input. Spoken content is transcribed into text and then submitted to Claude.

**OS Analogy**: The voice system resembles an OS **input-method framework** (such as macOS Input Method Kit or Linux IBus): it intercepts one physical input (voice), processes it in the middle (STT speech recognition), converts it into another input (text), and finally injects it into the existing text-input pipeline. The entire process is completely transparent to the downstream conversation system.

> 💡 **Plain English**: The voice system is like a **smart speaker (Xiao Ai / Alexa)**—first wake it up (activate with a hotkey), then speak (STT converts speech to text), and finally it understands and executes your command (submits to Claude). The difference is that Claude Code does not always listen; you must actively hold down a key.

## Architectural Distribution

Unlike Bridge or Buddy, the voice-system code is not centralized in one directory; it is scattered across multiple locations:

| File | Location | Responsibility |
|------|----------|--------------|
| `voiceModeEnabled.ts` | `src/voice/` | Feature gating and permission checks |
| `voice.ts` | `src/commands/voice/` | `/voice` command implementation (toggle) |
| `voice.tsx` | `src/context/` | Voice state context (React Context) |
| `useVoice.ts` | `src/hooks/` | Core recording + STT connection hook |
| `useVoiceEnabled.ts` | `src/hooks/` | Enabled-state hook |
| `useVoiceIntegration.tsx` | `src/hooks/` | Key-hold detection + text-injection integration hook |
| `voiceStreamSTT.ts` | `src/services/` | Anthropic `voice_stream` endpoint STT client |
| `voiceKeyterms.ts` | `src/services/` | Voice keywords (code-term correction) |
| `VoiceIndicator.tsx` | `src/components/` | Voice recording status indicator |
| `VoiceModeNotice.tsx` | `src/components/` | Voice-mode notice component |

## 1. Triple-Layered Feature Gating

The voice system's enablement checks are among the strictest of all subsystems, involving three independent gate layers.

### 1.1 Compile-Time Gate

`voiceModeEnabled.ts` lines 20-23:

```typescript
export function isVoiceGrowthBookEnabled(): boolean {
  return feature('VOICE_MODE')
    ? !getFeatureValue_CACHED_MAY_BE_STALE('tengu_amber_quartz_disabled', false)
    : false
}
```

`feature('VOICE_MODE')` is a Bun compile-time constant. If the `VOICE_MODE` feature is not enabled at build time, this function is dead-code-eliminated (DCE), and none of the voice-system code enters the final artifact.

GrowthBook's `tengu_amber_quartz_disabled` is an **inverted kill-switch**—`false` by default means "not disabled," and setting it to `true` emergency-shuts off the voice feature. Note that `tengu_amber_quartz` is an obfuscated name that actually refers to the voice feature.

### 1.2 Authentication Gate

```typescript
export function hasVoiceAuth(): boolean {
  if (!isAnthropicAuthEnabled()) return false
  const tokens = getClaudeAIOAuthTokens()
  return Boolean(tokens?.accessToken)
}
```

Voice requires Anthropic OAuth authentication—API Key, Bedrock, Vertex, and Foundry are all ineligible. The reason is that STT uses the claude.ai `voice_stream` endpoint, which is only open to OAuth users.

### 1.3 Runtime Complete Check

```typescript
export function isVoiceModeEnabled(): boolean {
  return hasVoiceAuth() && isVoiceGrowthBookEnabled()
}
```

The execution order of the three checks is: compile time (`feature('VOICE_MODE')`) → runtime auth (OAuth) → runtime remote switch (GrowthBook).

## 2. /voice Command Implementation

`src/commands/voice/voice.ts` implements the `/voice` slash command as a **toggle switch**:

```typescript
export const call: LocalCommandCall = async () => {
  // 1. master switch check
  if (!isVoiceModeEnabled()) {
    if (!isAnthropicAuthEnabled()) {
      return { type: 'text', value: 'Voice mode requires a Claude.ai account.' }
    }
    return { type: 'text', value: 'Voice mode is not available.' }
  }

  const isCurrentlyEnabled = currentSettings.voiceEnabled === true

  // 2. disable branch (simple)
  if (isCurrentlyEnabled) {
    updateSettingsForSource('userSettings', { voiceEnabled: false })
    logEvent('tengu_voice_toggled', { enabled: false })
    return { type: 'text', value: 'Voice mode disabled.' }
  }

  // 3. enable branch (requires pre-checks)
  // 3a. check recording hardware availability
  const recording = await checkRecordingAvailability()
  if (!recording.available) return { type: 'text', value: recording.reason }

  // 3b. check voice_stream API availability
  if (!isVoiceStreamAvailable()) return { ... }

  // 3c. check recording tool (SoX)
  const deps = await checkVoiceDependencies()
  if (!deps.available) return { ... }

  // 3d. request microphone permission in advance
  if (!(await requestMicrophonePermission())) return { ... }

  // 4. all checks passed, enable
  updateSettingsForSource('userSettings', { voiceEnabled: true })
}
```

The pre-checks on enablement are very thorough: recording hardware → API → SoX tool → microphone permission. If any step fails, a precise error message is returned.

## 3. Voice State Management

### 3.1 React Context Store

`src/context/voice.tsx` uses a custom Store pattern to manage voice state:

```typescript
export type VoiceState = {
  voiceState: 'idle' | 'recording' | 'processing'  // three-state machine
  voiceError: string | null                         // error message
  voiceInterimTranscript: string                    // real-time transcription
  voiceAudioLevels: number[]                        // audio levels (visualization)
  voiceWarmingUp: boolean                           // warming-up flag
}
```

Three-state machine model:
```
idle --[hold key]--> recording --[release key]--> processing --[result received]--> idle
  ^                                                  |
  |___________[error or timeout]_____________________|
```

### 3.2 External Sync Store

Using the standard React 18+ `useSyncExternalStore` + selector pattern lets consuming components re-render only when the slice of state they care about changes:

```typescript
export function useVoiceState<T>(selector: (state: VoiceState) => T): T {
  const store = useVoiceStore()
  const get = () => selector(store.getState())
  return useSyncExternalStore(store.subscribe, get, get)
}
```

This is the standard approach in the React ecosystem for managing external state (used internally by Zustand, Jotai, and similar libraries); we will not elaborate further here.

## 4. Key-Hold Detection and Recording Integration

### 4.1 Hold-Detection Algorithm

`useVoiceIntegration.tsx` implements a fine-grained key-detection logic to distinguish a **hold** from **normal typing**.

Key constants:

```typescript
const RAPID_KEY_GAP_MS = 120;          // auto-repeat threshold (terminal auto-repeat 30-80 ms)
const MODIFIER_FIRST_PRESS_FALLBACK_MS = 2000;  // modifier first-press timeout
const HOLD_THRESHOLD = 5;              // bare character keys need 5 rapid repeats to activate
const WARMUP_THRESHOLD = 2;            // show warmup feedback on the 2nd repeat
```

The detection strategy splits into two cases:

**Modifier combinations** (e.g., Ctrl+Space): activate on the first press, because modifier combos are never normal typing.

**Bare character keys** (e.g., Space): require 5 rapid repeats to activate, preventing normal typing from being misclassified as a "hold."

The default configuration is a bare space key—note that all modifiers are `false`, meaning the default trigger is **rapidly tapping Space 5 times**, not a modifier combo:

```typescript
const DEFAULT_VOICE_KEYSTROKE: ParsedKeystroke = {
  key: ' ', ctrl: false, alt: false, shift: false, meta: false, super: false
}
```

Users can change this in settings to a modifier combo (e.g., `ctrl: true`), at which point the strategy switches to "activate on first press."

The **progressive-feedback design of `WARMUP_THRESHOLD`** deserves special attention. `WARMUP_THRESHOLD = 2` means that when the user taps Space twice, warmup feedback already fires (`voiceWarmingUp: true` triggers a UI hint), without having to wait until the 5th tap for any response. This is a **perceived-latency optimization**—by the second tap the user knows "the system heard me and is getting ready," rather than tapping five times with zero feedback. In a terminal environment, where rich visual feedback (hover tips, animation transitions, etc.) is unavailable, this early feedback has an outsized impact on user experience.

### 4.2 Keyboard Event Matching

The `matchesKeyboardEvent` function (lines 61-73) handles various terminal keyboard-event pitfalls:

```typescript
function matchesKeyboardEvent(e: KeyboardEvent, target: ParsedKeystroke): boolean {
  const key = e.key === 'space' ? ' ' 
            : e.key === 'return' ? 'enter' 
            : e.key.toLowerCase()
  if (key !== target.key) return false
  if (e.ctrl !== target.ctrl) return false
  if (e.shift !== target.shift) return false
  // KeyboardEvent.meta in the terminal conflates alt/option (ESC-prefix limitation)
  if (e.meta !== (target.alt || target.meta)) return false
  if (e.superKey !== target.super) return false
  return true
}
```

The comment reveals a terminal limitation: `KeyboardEvent.meta` cannot distinguish between Alt and Meta (because the terminal uses an ESC prefix for both), so they are treated as the same key.

## 5. STT Speech Recognition

### 5.1 voice_stream Endpoint

Speech recognition uses Anthropic's `voice_stream` endpoint, streaming over WebSocket.

From the client code in `voiceKeyterms.ts`, we can see that the keyword list is passed to the server in Deepgram's `keywords` parameter format (`term:boost` weight pairs), and the WebSocket initialization message field names (e.g., `interim_results`, `keywords`) closely match the Deepgram Streaming API. This strongly suggests the backend STT engine uses Deepgram, but no direct Deepgram domain or SDK reference exists in the source code, so this judgment is a **reasonable inference rather than a confirmed fact**.

### 5.2 Complete STT Pipeline Data Flow

From the user holding down the key to text appearing in the input box, the entire STT pipeline passes through the following stages:

```
microphone → audio capture (SoX/native module) → PCM encoding → WebSocket binary frames → voice_stream endpoint
                                                                                             ↓
input box ← text injection ← final transcript ← interim transcript ← STT engine recognition ← server-side decoding
```

**Stage 1: Audio Capture**. The `useVoice` hook starts a recording process in the `recording` state—macOS uses a native audio module that directly calls the Core Audio API, while Linux uses the SoX command-line tool (`rec` command) for capture. Audio is sampled in PCM (Pulse-Code Modulation) format, which is uncompressed raw audio data.

**Stage 2: WebSocket Streaming**. `voiceStreamSTT.ts` establishes the WebSocket connection to the `voice_stream` endpoint. When initializing the connection, it sends a config message containing:
- `language`: the user's selected language code (BCP-47 format)
- `keywords`: the domain keyword list from `voiceKeyterms.ts` and their weights
- `interim_results: true`: requests intermediate recognition results from the server

Audio data is sent to the server in real time as binary frames, rather than waiting until recording ends to upload in one batch—this is the key difference between streaming STT and batch STT. Users can see words appear gradually while they are still speaking.

**Stage 3: Result Reception and State Updates**. The server returns recognition results as JSON messages over the same WebSocket connection. `voiceStreamSTT.ts` parses these messages, distinguishes interim and final results, and updates the corresponding fields in `VoiceState`. After the user releases the key, the client sends an end signal; the server finishes processing the last audio segment and returns the final transcript.

**Stage 4: Text Injection**. `useVoiceIntegration.tsx` watches `VoiceState` changes and, upon receiving the final transcript, injects the text into the `InputPrompt` text box, merging seamlessly with text entered from the keyboard.

> 💡 **Plain English**: This flow is like **live subtitles**—while watching a foreign film, a translator types captions as they listen (interim) and then proofreads the final version (final). The difference in Claude Code is that the "subtitles" are your own words, and once confirmed they become instructions for Claude.

### 5.3 Connection Management and Error Recovery

The WebSocket connection in `voiceStreamSTT.ts` must handle several edge cases:

- **Authentication**: Uses the OAuth access token in the `Authorization` header during the WebSocket handshake; when the token expires, it must be refreshed before reconnecting.
- **Connection drop**: When network fluctuation causes the WebSocket to disconnect, the current recording session ends, the state machine returns to `idle`, and the user must hold the key again to trigger a new session.
- **Unsupported language**: If the client sends a language code the server does not allow, the WebSocket closes with status code 1008 (Policy Violation).

There is no auto-reconnect mechanism here—this is a pragmatic design choice. Voice input is a short interaction (usually a few seconds to a few tens of seconds), unlike chat scenarios that require a long-lived connection. If the connection drops, having the user hold the key again to start a new session aligns better with the hold-to-talk interaction expectation than silently reconnecting in the background.

### 5.4 Multilingual Support

`useVoice.ts` lines 42-89 define a mapping from language names to BCP-47 codes:

```typescript
const LANGUAGE_NAME_TO_CODE: Record<string, string> = {
  english: 'en',
  spanish: 'es', español: 'es', espanol: 'es',
  french: 'fr', français: 'fr', francais: 'fr',
  japanese: 'ja', 日本語: 'ja',
  german: 'de', deutsch: 'de',
  korean: 'ko', 한국어: 'ko',
  hindi: 'hi', हिन्दी: 'hi',
  russian: 'ru', русский: 'ru',
  chinese: 'zh', 中文: 'zh',   // note: confirm server-side support
  // ... 20+ languages total
}
```

Each language supports both the English name and the native name (e.g., `français` / `francais` / `french` all map to `fr`).

The critical constraint is noted in the comment:

```
// This list must be a SUBSET of the server-side supported_language_codes allowlist
// (GrowthBook: speech_to_text_voice_stream_config).
// If the CLI sends a code the server rejects, the WebSocket closes with
// 1008 "Unsupported language" and voice breaks.
```

If the client sends a language code the server does not support, the WebSocket closes with status 1008—so the client must conservatively stick to known-supported languages.

> **📚 Course Connection**: The streaming STT interim/final dual-result model is more accurately analogized to **Optimistic UI rendering**—interim results are shown immediately to eliminate perceived latency, and when the final result arrives it **fully replaces** (rather than merges or converges with) the interim content. This differs from "eventual consistency" in distributed systems: eventual consistency emphasizes multiple replicas eventually converging to the same state, whereas STT interim results are directly overwritten by the final result.

### 5.5 Real-Time Intermediate Results and Audio Levels

As described in Section 5.2, STT returns two kinds of results:
- **interim transcript**: real-time intermediate results displayed in the input box (users can see words "growing"), corresponding to the `voiceInterimTranscript` field in `VoiceState`
- **final transcript**: the final confirmed result, which fully replaces the interim content before being submitted to Claude

A distinction worth emphasizing: `voiceAudioLevels` in `VoiceState` is **locally captured** audio-level data on the client side, used to drive the waveform visualization animation in the `VoiceIndicator` component. It is a completely independent data stream from the STT transcript— the former comes from real-time amplitude sampling of the local microphone input, while the latter comes from the server's speech-recognition result. Even if the WebSocket connection is delayed or interrupted, the audio-level animation continues to display normally because it does not depend on the server response.

## 6. Domain Vocabulary Correction (voiceKeyterms)—The Core Competency of Voice Coding

In programming scenarios, STT misrecognition of code terms is the **number-one killer of voice-input experience**. A general-purpose voice model hearing "git" as "get", "npm" as "and PM", or "kubectl" as something entirely unrelated—these errors quickly make developers abandon voice input. `voiceKeyterms.ts` is the key module that solves this problem.

### 6.1 Composition of the Keyword List

`voiceKeyterms.ts` maintains a list of high-frequency programming terms, covering multiple categories:

- **Tools and runtimes**: `npm`, `yarn`, `pnpm`, `bun`, `deno`, `webpack`, `vite`, `eslint`, `prettier`
- **Version control**: `git`, `GitHub`, `GitLab`, `rebase`, `cherry-pick`, `stash`
- **Languages and frameworks**: `TypeScript`, `JavaScript`, `React`, `Vue`, `Svelte`, `Next.js`, `Rust`, `Go`
- **Claude Code terminology**: `Claude`, `Anthropic`, `MCP`, `tool use`, `slash command`

### 6.2 Weight Passing Mechanism

Keywords are passed to the server-side STT engine via the WebSocket initialization message in `term:boost` format:

```typescript
// keyword format example (Deepgram keywords parameter format)
keywords: ["TypeScript:2", "npm:1.5", "git:2", "React:1.5", ...]
```

The `boost` value controls how strongly the STT engine prefers this keyword when deciding between multiple candidate words. A higher boost means that when the speech is ambiguous (e.g., "git" and "get" sound extremely similar), the engine is more likely to choose the word from the keyword list.

### 6.3 Why This Is a Core UX Differentiator

For a general voice assistant, misrecognizing "git" as "get" is a minor blemish—context usually lets the user figure out the intended meaning. But for **programming voice input**, every term can directly become code or a command:

- If a user says "git rebase main" and it is recognized as "get re-base main," the injected prompt will produce a completely different understanding.
- If a user says "use pnpm to install dependencies" and "pnpm" is recognized as "p-n-p-m" or a meaningless word, Claude may fail to execute correctly.

This makes domain vocabulary correction a **must-have foundational capability** rather than a nice-to-have. Claude Code solves the out-of-the-box problem with a prebuilt keyword list, but there is currently no user-customizable keyword interface—for teams using niche toolchains (e.g., specific internal CLI tools), this is a noteworthy extension point.

## 7. Conditional Loading and Dead Code Elimination

`useVoiceIntegration.tsx` lines 21-33 uses the `feature('VOICE_MODE') ? require(real) : { stub }` pattern for conditional loading:

```typescript
const voiceNs: {
  useVoice: typeof import('./useVoice.js').useVoice;
} = feature('VOICE_MODE')
  ? require('./useVoice.js')
  : {
      useVoice: ({ enabled: _e }) => ({
        state: 'idle' as const,
        handleKeyEvent: (_fallbackMs?: number) => {}
      })
    };
```

This is the standard pattern for feature flags + conditional `require`, serving two core purposes:

1. **React Hooks rule compliance**: React requires that Hooks not appear inside conditional branches. With the stub function, the component can unconditionally call `useVoice`, avoiding violations.
2. **Test spy compatibility**: Capturing the module namespace object (`voiceNs`) rather than destructuring the exported function ensures `spyOn(voiceNs, 'useVoice')` can intercept correctly. This solves the common pitfall in ESM live-binding environments where direct destructured imports break spies.

## 8. Recording Toolchain

Voice recording depends on external tools:

- **macOS**: uses a native audio module
- **Linux / others**: requires SoX (Sound eXchange)

The `/voice` command checks these dependencies when enabling; if missing, it provides installation instructions:

```typescript
const deps = await checkVoiceDependencies()
if (!deps.available) {
  const hint = deps.installCommand
    ? `\nInstall audio recording tools? Run: ${deps.installCommand}`
    : '\nInstall SoX manually for audio recording.'
  return { type: 'text', value: `No audio recording tool found.${hint}` }
}
```

## Critical Analysis

### Strengths

1. **Progressive enablement**: compile-time gate → auth gate → remote switch → hardware check → tool check → permission check. Six layers of defense ensure users never encounter the awkward "feature is on but doesn't work" experience.
2. **Precise key-hold detection**: distinguishing modifier combos (immediate activation) from bare character keys (require holding) avoids false triggers, which is notoriously hard to get right in a terminal environment.
3. **Conditional loading**: the combination of compile-time feature flag + runtime require/noop stub yields zero overhead when disabled, while remaining compliant with React Hooks rules.
4. **Multilingual support**: accepting both English and localized names (e.g., `日本語`, `русский`) lowers the configuration barrier.

### Weaknesses

1. **Scattered code**: voice-related code is spread across 10+ files in 6 directories, lacking a unified entry point or architecture document, making it hard for new developers to understand.
2. **Obfuscated naming**: `tengu_amber_quartz_disabled` prevents outsiders from guessing the feature's meaning, but it also hurts internal code readability.
3. **Mandatory OAuth dependency**: voice is fully tied to Anthropic OAuth; enterprise users on API Key cannot use voice. This restriction could be lifted if the STT endpoint were independent of the auth system.
4. **`src/voice/` contains only one file**: the entire `src/voice/` directory only holds `voiceModeEnabled.ts` (55 lines). The directory has little reason to exist and could be merged elsewhere.
5. **SoX dependency**: relying on SoX on Linux is an old-school choice. Modern Linux desktop environments have native recording APIs via PipeWire/PulseAudio, and SoX is often missing in container/WSL environments.
6. **Terminal keyboard limitations**: because the terminal cannot distinguish Alt from Meta, available voice shortcut combinations are restricted—an inherent flaw of terminal applications, but one worth documenting explicitly.

### Technical Positioning

The voice system is clearly in an **early experimental stage**—scattered code organization, a single-file directory, and compile-time feature gating all indicate this feature is still iterating rapidly. Voice input is no longer novel in AI coding tools (GitHub Copilot Voice and Cursor both have precedents), but implementing a complete hold-to-talk + streaming STT + text-injection pipeline inside a CLI terminal makes Claude Code the most complete solution to date. When your hands are occupied with code in the terminal, voice becomes a natural supplementary input modality—the entire industry is exploring this direction.