Telemetry and Observability Deconstructed

Telemetry is not an afterthought in Claude Code—it is a product feature. More than 7,400 lines of code implement **five independent data-egress channels** (standard OTLP, BigQuery metrics, Beta detailed tracing, Datadog event logs, and 1P First-Party event reporting), serving five distinct consumers: enterprise users integrating with their own observability platforms, Anthropic product teams analyzing usage patterns, internal debug teams visualizing agent hierarchies, operations teams monitoring in real time, and compliance auditors archiving events. This chapter dissects the architecture of these five channels, the eight OTel Counters, the Span hierarchy, the privacy data-governance mechanisms, and the Perfetto Chrome Trace–based agent visualization system.

> **Source locations**: `src/utils/telemetry/` (4 core files, 3,363 lines), `src/services/analytics/` (9 files, 4,040 lines)

> 💡 **Plain English**: The telemetry system is like a car equipped with five different cameras. The dashcam is for the owner (OTLP enterprise export), the black box is for the service center (BigQuery product metrics), the high-speed action camera is for engineers (Beta detailed tracing), real-time telemetry goes to a remote monitoring center (Datadog event logs), and the complete trip log is archived in the cloud (1P event reporting). Each camera records independently without interfering with the others, yet all are governed by a single "privacy switch"—you can turn off any part you do not want recorded at any time.

### 🌍 Industry Context: Observability Practices for AI Tools

Observability is standard practice in distributed systems, but AI programming tools vary widely in their investment:

- **LangChain/LangSmith**: The industry benchmark for AI observability. LangSmith offers complete trace visualization, token-usage tracking, and latency analysis. Claude Code's Span hierarchy is conceptually very similar to LangSmith's Run Tree—both model AI call chains as parent-child Span trees.
- **Cursor**: Internal telemetry is invisible to users, with no OTLP enterprise-export capability. Users only see coarse-grained token-usage stats.
- **Aider**: Provides command-level logs and token counting, but no structured trace system; debugging relies on text logs.
- **GitHub Copilot**: The server side has comprehensive telemetry (used to improve the model), but client-side observability is limited; enterprise admins view usage statistics through the GitHub Admin Console.
- **OpenTelemetry (OTel)**: Claude Code's choice of OTel as the enterprise telemetry standard is the correct industry decision—OTel is a graduated CNCF project supported by Datadog, Grafana, Honeycomb, and other major platforms. Codex (OpenAI) currently offers no OTel integration.
- **Weights & Biases / Helicone**: Domain-specific observability platforms for AI that provide prompt tracing, cost analysis, and more, but require additional integration.

Claude Code's five parallel data channels (OTLP + BigQuery + Beta Tracing + Datadog + 1P) represent a fairly complex implementation among AI programming tools. OTel integration is a baseline requirement for enterprise software in 2025; Claude Code's support demonstrates it is "enterprise-ready." The choice of Perfetto Chrome Trace is also pragmatic—it reuses a mature performance-analysis toolchain instead of building a custom visualizer. It is worth noting that these five channels are more likely the organic result of multiple iterations than a one-time unified architecture—each path has its own serialization format, batching policy, retry logic, and sampling mechanism.

---

## Overview

Claude Code's telemetry system is not an afterthought—it is **a product feature**. Across more than 7,400 lines of code, the system implements **five independent data-egress channels**, eight OTel Counters, a carefully designed Span hierarchy (interaction→llm_request→tool→execution), a type-system–based privacy data-governance mechanism, and Perfetto Chrome Trace–based agent visualization. Each channel has its own consumer:

| Channel | Consumer | Code volume |
|---------|----------|-------------|
| OTLP exporters | Enterprise users (self-hosted Datadog/Grafana/Honeycomb) | `instrumentation.ts` 825 lines |
| BigQuery metrics | Anthropic product teams (usage-pattern analysis) | Inside `instrumentation.ts` |
| Beta detailed tracing | Internal debug teams (Span-level debug traces) | `betaSessionTracing.ts` 491 lines |
| Datadog event logs | Operations teams (real-time monitoring and alerting) | `datadog.ts` 307 lines |
| 1P First-Party event reporting | Compliance auditing + product analysis (protobuf event archive) | `firstPartyEventLoggingExporter.ts` 806 lines + `firstPartyEventLogger.ts` 449 lines |

---

> **[Figure placeholder 3.8-A]**: The five data-egress channels—OTLP exporters, BigQuery metrics, Beta detailed tracing, Datadog event logs, and 1P event reporting—shown in parallel, annotated with activation conditions, data formats, transport protocols, and privacy levels.

> **[Figure placeholder 3.8-B]**: Span hierarchy diagram—interaction → llm_request / tool → blocked_on_user / execution parent-child relationships.

---

## 1. Initialization and the Five Data-Egress Channels

### 1.1 Startup Flow

`initializeTelemetry()` (`instrumentation.ts:421-701`) is the telemetry entry point; it activates the first three paths conditionally. `initializeAnalyticsSink()` (`sink.ts`) activates the remaining two paths at application startup. The complete comparison of all five channels:

| Channel | Activation condition | Data format | Transport protocol | Batching policy | Failure handling | Privacy level |
|---------|---------------------|-------------|-------------------|-----------------|------------------|---------------|
| OTLP exporters | `CLAUDE_CODE_ENABLE_TELEMETRY` env var | OTel spans/metrics/logs | OTLP HTTP | Metrics 60s, Logs 5s, Traces 5s | OTel SDK standard retry | User-controlled |
| BigQuery metrics | 1P API (non-Bedrock/Vertex/**Foundry**) or C4E or Teams | OTel metrics | Periodic reader | 5 minutes | — | Anthropic internal |
| Beta detailed tracing | `ENABLE_BETA_TRACING_DETAILED=1` + `BETA_TRACING_ENDPOINT` + org allowlist | OTel spans + custom attributes | OTLP HTTP (defaults to Honeycomb, but endpoint is configurable) | BatchSpanProcessor | OTel SDK standard | ant-only thinking |
| Datadog event logs | `tengu_log_datadog_events` GrowthBook gate + not killswitched | JSON logs | HTTPS POST (`DATADOG_LOGS_ENDPOINT`) | Flush every 15s or 100 messages | Fire-and-forget (failure only logs an error) | **Strips `_PROTO_*` fields** |
| 1P First-Party event reporting | Not `isAnalyticsDisabled()` + not killswitched | Protobuf (`ClaudeCodeInternalEvent`) | HTTPS POST (`/api/event_logging/batch`) | OTel BatchLogRecordProcessor (5s or 200 records) | **Persistent retry** (JSONL local storage + quadratic backoff) | **Receives full `_PROTO_*` fields** |

This table reveals a critical architectural trait: **the five channels operate independently**. Each path has its own serialization format, batching policy, retry logic, and sampling mechanism—more the result of successive layering than a unified pipeline with multiple sinks. Likely reasons: (a) historical independent evolution, (b) different reliability requirements (1P has persistent retry; Datadog is fire-and-forget), and (c) different data-sensitivity levels (`_PROTO_*` fields only travel on 1P).

### 1.2 BigQuery Metrics Activation Conditions

`isBigQueryMetricsEnabled()` (`instrumentation.ts:336-347`) is enabled only for:
- 1P API customers (**excluding** Bedrock, Vertex, and **Foundry** users—`config.ts:24`'s `isAnalyticsDisabled()` explicitly excludes `CLAUDE_CODE_USE_FOUNDRY`)
- Claude for Enterprise (C4E) users
- Claude for Teams users

Note that Foundry is Anthropic's enterprise deployment option. Excluding it means product behavior of enterprise users deployed through Foundry is invisible to Anthropic product teams—an intentional data-isolation boundary.

### 1.3 OTLP Exporter Configuration

`getOTLPExporterConfig()` (`instrumentation.ts:749-825`) supports:
- mTLS (mutual TLS)
- CA certificate chains
- HTTP proxies
- Dynamic headers (generated via the `otelHeadersHelper` script)

### 1.4 Datadog Event Log Channel

`datadog.ts` (307 lines) implements an independent event-log pipeline that batches and reports events via `DATADOG_LOGS_ENDPOINT` (`https://http-intake.logs.us5.datadoghq.com/api/v2/logs`).

**Key design details**:

- **Event allowlist**: `DATADOG_ALLOWED_EVENTS` is a hardcoded Set containing roughly 30 `tengu_*` events (e.g., `tengu_init`, `tengu_api_error`, `tengu_exit`, `tengu_tool_use_success`) and the `chrome_bridge_*` series. Only allowlisted events are sent—this is both a security boundary and a cost control.
- **Batch sending**: Flush every 15 seconds or 100 messages (`DEFAULT_FLUSH_INTERVAL_MS = 15000`, `MAX_BATCH_SIZE = 100`).
- **Privacy isolation**: Datadog is a general-access backend, so all events sent to it are processed through `stripProtoFields()` in `sink.ts`—any PII-tagged fields prefixed with `_PROTO_*` are fully stripped.
- **Hardcoded client token**: `DATADOG_CLIENT_TOKEN` is embedded directly in source. It is a client token (not an API key), so the security risk is limited, but once the source enters public circulation the value is exposed.
- **GrowthBook gate control**: Remote on/off switch via `tengu_log_datadog_events`.

### 1.5 1P First-Party Event Reporting Channel

`firstPartyEventLoggingExporter.ts` (806 lines) + `firstPartyEventLogger.ts` (449 lines) implement the most complex data-reporting channel.

**Reliability mechanisms**—the only channel among the five with full failure recovery:

1. **Append-only failed-event storage**: Events that fail to send are appended as JSONL to a local file (`~/.claude/telemetry/1p_failed_events.<uuid>.jsonl`) and can be re-read after process restart.
2. **Quadratic backoff**: Retry intervals grow as `baseDelay * attempt^2` after failures, preventing backend overload.
3. **Retry-queue flush on success**: When an export succeeds—indicating the endpoint is healthy—the system immediately attempts to send previously failed events.
4. **401 auth degradation**: On authentication errors, the system retries without authentication—a defensive strategy to prevent event loss during OAuth token expiration windows.
5. **Large-batch chunking**: Oversized event sets are split into smaller batches to avoid overly large request bodies.

**Data format**: The 1P channel uses Protobuf (`ClaudeCodeInternalEvent`). Events are converted via `to1PEventFormat()` and sent to the `/api/event_logging/batch` endpoint.

### 1.6 Sink Unified Scheduling (`sink.ts`)

`sink.ts` (114 lines) is the unified scheduling layer for the Datadog and 1P channels. Its core logic:

```
logEventImpl(eventName, metadata)
  ├── shouldSampleEvent(eventName) → sampling decision
  ├── shouldTrackDatadog() → Datadog channel
  │   └── trackDatadogEvent(eventName, stripProtoFields(metadata))  // strips PII
  └── logEventTo1P(eventName, metadataWithSampleRate)  // keeps full _PROTO_* fields
```

Note the critical comment on lines 66-67: Datadog is a "general-access backend," so `stripProtoFields()` strips PII-tagged fields; 1P receives the full payload, and the exporter routes `_PROTO_*` keys into privileged protobuf columns.

### 1.7 Event Sampling System

`firstPartyEventLogger.ts:57-80` implements dynamic sampling remotely controlled via GrowthBook:

- **Config name**: `tengu_event_sampling_config`—a JSON object mapping each event type to a `sample_rate` (a number between 0 and 1).
- **Unconfigured events default to 100% sampling**—recorded in full by default.
- **Events with a 0 sample rate are completely dropped**—a variant of a remote killswitch.
- **Sample rate attached to metadata**: Sampled events have a `sample_rate` field appended to their metadata, allowing downstream analysis to extrapolate totals.

This means Anthropic can adjust telemetry granularity **without shipping a new release**—dynamically balancing cost against information completeness. This remotely controllable sampling architecture offers far greater operational flexibility than a static sampling rate.

### 1.8 Sink Killswitch Mechanism

`sinkKillswitch.ts` (25 lines) implements an emergency shutdown via the GrowthBook dynamic config `tengu_frond_boric`, which can remotely disable specific data channels.

```typescript
// Obfuscated name to prevent discovery and abuse by users
const SINK_KILLSWITCH_CONFIG_NAME = 'tengu_frond_boric'
export type SinkName = 'datadog' | 'firstParty'

export function isSinkKilled(sink: SinkName): boolean {
  const config = getDynamicConfig_CACHED_MAY_BE_STALE<
    Partial<Record<SinkName, boolean>>
  >(SINK_KILLSWITCH_CONFIG_NAME, {})
  return config?.[sink] === true
}
```

**Design points**:

- **Fail-open policy**: If the config is missing or malformed, the default is an empty object `{}`, meaning all sinks remain on. This is an intentional choice—better to over-send data than silently drop it.
- **Obfuscated naming**: `tengu_frond_boric` is semantically meaningless, deliberately chosen to avoid discovery and manual disabling by end users in GrowthBook configurations.
- **Purpose**: This is an incident-response mechanism. When a data pipeline breaks (e.g., the Datadog endpoint is down and causing client timeouts), the channel can be severed immediately without restarting clients.

## 2. The Eight Counters

`state.ts:952-989` defines the core metrics:

| Counter | Metric name | Unit | Dimensions |
|---------|-------------|------|------------|
| `sessionCounter` | `claude_code.session.count` | count | — |
| `locCounter` | `claude_code.lines_of_code.count` | count | `type`: added/removed |
| `prCounter` | `claude_code.pull_request.count` | count | — |
| `commitCounter` | `claude_code.commit.count` | count | — |
| `costCounter` | `claude_code.cost.usage` | USD | — |
| `tokenCounter` | `claude_code.token.usage` | tokens | — |
| `codeEditToolDecisionCounter` | `claude_code.code_edit_tool.decision` | count | `decision`: accept/reject, `tool`: Edit/Write/NotebookEdit |
| `activeTimeCounter` | `claude_code.active_time.total` | seconds | — |

The dimension choices reveal product priorities: the **accept/reject rate of code-edit tools** has its own Counter (`codeEditToolDecisionCounter`), indicating the team closely monitors the quality of AI-generated code changes.

## 3. Span Hierarchy Architecture

### 3.1 Standard Spans

`sessionTracing.ts` (928 lines) defines a three-tier Span hierarchy:

```
claude_code.interaction (root span)
├── user_prompt (attribute, omitted when privacy controls are on)
├── interaction.sequence (monotonically increasing counter)
│
├── claude_code.llm_request (parent: interaction via otelContext)
│   ├── model, speed ('fast'/'normal')
│   ├── query_source (agent name)
│   ├── input_tokens, output_tokens
│   ├── cache_read_tokens, cache_creation_tokens
│   ├── ttft_ms (time-to-first-token latency)
│   └── [Beta] response.model_output, response.thinking_output
│
├── claude_code.tool (parent: interaction)
│   ├── tool_name
│   ├── claude_code.tool.blocked_on_user
│   │   ├── decision, source
│   │   └── duration_ms
│   └── claude_code.tool.execution
│       ├── success, error
│       └── duration_ms
│
└── claude_code.hook (Beta only, parent: tool or interaction)
    ├── hook_event, hook_name
    └── num_success, num_blocking, num_non_blocking_error
```

### 3.2 WeakRef + AsyncLocalStorage Span Lifecycle Management

`sessionTracing.ts:66-75` implements a carefully designed memory-management strategy that is highly unusual in standard OTel usage:

```
activeSpans: Map<string, WeakRef<SpanContext>>   // weak reference
strongSpans: Map<string, SpanContext>             // strong reference
interactionContext: AsyncLocalStorage<SpanContext> // ALS as GC root
toolContext: AsyncLocalStorage<SpanContext>        // ALS as GC root
```

**Design logic**:
- **interaction and tool Spans** are stored in `AsyncLocalStorage` (ALS). ALS acts as a GC root—as long as the current asynchronous context is alive, the SpanContext cannot be collected. `activeSpans` only holds `WeakRef`s; when ALS is cleared (`enterWith(undefined)`) and no other code holds a reference, the GC can reclaim the SpanContext and the WeakRef automatically becomes stale.
- **LLM request, blocked-on-user, tool execution, and hook Spans** are not managed in ALS, so they need the `strongSpans` Map to hold strong references—otherwise the GC might reclaim the SpanContext before `endLLMRequestSpan()` is called.

This is a **GC-aware Span management pattern**: it leverages JavaScript's garbage collector to automatically clean up Spans that are no longer needed, instead of manually tracking the lifecycle of every Span. For Claude Code's long-running CLI processes (especially cron-driven sessions that may run for days), this design significantly reduces the risk of memory leaks.

### 3.3 Other Key Design Decisions

- **SPAN_TTL_MS = 30 minutes** (`sessionTracing.ts:79`): Orphan spans are cleaned up every 60 seconds—a second line of defense beyond WeakRef.
- **`isEnhancedTelemetryEnabled()`** (`sessionTracing.ts:126-143`) priority: env override > ant build > GrowthBook cached gate.
- **Parallel request handling**: `endLLMRequestSpan()` handles parallel requests via an explicit span parameter (`sessionTracing.ts:353-401`).

> 📚 **Course Connection (Distributed Systems)**: The Span hierarchy directly maps to the **distributed tracing** concept taught in distributed-systems courses. The interaction→llm_request→tool→execution parent-child relationship is essentially a call tree, consistent with the Trace/Span model introduced in Google's Dapper paper (2010). OTel's Context Propagation mechanism—passing the parent Span ID to child Spans through `otelContext`—is equivalent to causality tracking in distributed systems.

## 4. Privacy and Data Governance System

The most common mistake in telemetry systems is inadvertently recording user code or file paths. Claude Code has established a multi-layered privacy governance mechanism—from TypeScript's type system to runtime field routing.

### 4.1 Type-System–Encoded Privacy Review—`AnalyticsMetadata_I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS`

`analytics/index.ts:19` defines a TypeScript `never` type:

```typescript
export type AnalyticsMetadata_I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS = never
```

This is not a runtime check—it is **a development-time forcing function**. Whenever a developer wants to record string-typed metadata in an analytics event, they must explicitly cast it `as AnalyticsMetadata_I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS`. The deliberately inconvenient length of this type name forces the developer to pause and ask: "Am I absolutely sure this string does not contain code or file paths?"

> 💡 **Plain English**: This is like having to hand-write "I confirm this large cash withdrawal is my own decision" at a bank counter—it does not actually stop you from withdrawing the money, but it makes you think for an extra second before you act.

### 4.2 `_PROTO_*` Field Routing—Privileged Channel for PII Data

The same module defines a second type tag:

```typescript
export type AnalyticsMetadata_I_VERIFIED_THIS_IS_PII_TAGGED = never
```

Metadata keys prefixed with `_PROTO_` are PII (personally identifiable information) tagged fields. The routing rules for these fields are:

1. **1P channel** (`firstPartyEventLoggingExporter.ts`): Receives the full payload; the exporter hoists `_PROTO_*` keys to top-level protobuf fields, which have privileged access controls in BigQuery.
2. **Datadog channel** (`sink.ts`): Calls `stripProtoFields()` before sending to **fully strip** all `_PROTO_*` keys—Datadog never sees unmasked PII.
3. **Other channels**: Channels not routed through the sink (OTLP, Beta tracing) have their own independent privacy controls (`isEnhancedTelemetryEnabled()`, ant-only restrictions, etc.).

The `stripProtoFields()` function (`index.ts:45-58`) includes an optimization: when the payload contains no `_PROTO_*` keys, it returns the original reference without copying, avoiding unnecessary object allocations on the high-frequency event path.

### 4.3 Three-Level Privacy Control Matrix

`privacyLevel.ts` defines three privacy levels that form a control matrix against channel activation states:

| Privacy level | OTLP | BigQuery | Beta Tracing | Datadog | 1P |
|---------------|------|----------|--------------|---------|-----|
| `default` | On when user-enabled | Conditionally enabled | Conditionally enabled | Gate-controlled | On |
| `no-telemetry` | Off | Off | Off | Off | Off |
| `essential-traffic` | Off | Off | Off | Off | Off |

An important philosophical note: "whoever owns the telemetry controls it." OTLP is enabled by the user (via `CLAUDE_CODE_ENABLE_TELEMETRY`), so `DISABLE_TELEMETRY` does not affect it—a user's choice to export their own data is not overridden by Anthropic's privacy settings. In contrast, `isAnalyticsDisabled()` turns off Datadog and 1P because those are consumed by Anthropic.

### 4.4 `isAnalyticsDisabled()` Multi-Condition Check

`config.ts`'s `isAnalyticsDisabled()` returns true if any of the following conditions are met:
- `NODE_ENV === 'test'` (test environment)
- `CLAUDE_CODE_USE_BEDROCK` (AWS Bedrock deployment)
- `CLAUDE_CODE_USE_VERTEX` (Google Vertex deployment)
- `CLAUDE_CODE_USE_FOUNDRY` (Anthropic Foundry enterprise deployment)
- `isTelemetryDisabled()` (user explicitly disabled telemetry)

This means user data from all third-party cloud deployments (Bedrock/Vertex/Foundry) **does not** flow into Anthropic's Datadog and 1P channels—this is both a privacy guarantee and a compliance boundary.

## 5. Beta Detailed Tracing—System Prompt Hashing

### 5.1 Activation Conditions

`isBetaTracingEnabled()` (`betaSessionTracing.ts:78-98`) requires three conditions to be met simultaneously:
1. `ENABLE_BETA_TRACING_DETAILED=1`
2. `BETA_TRACING_ENDPOINT` is set
3. SDK/headless mode **or** the organization is on the `tengu_trace_lantern` GrowthBook gate allowlist

**Accurate statement about the export destination**: Beta tracing uses the standard `@opentelemetry/exporter-trace-otlp-http`, with the endpoint configured via the `BETA_TRACING_ENDPOINT` environment variable. The 60KB truncation comment mentions a Honeycomb limit, but it is more accurate to say "the default configuration points to Honeycomb, but the protocol is standard OTLP"—the endpoint can target any OTLP-compatible backend.

### 5.2 System Prompt Hash Deduplication

`betaSessionTracing.ts:129-131`:

```typescript
hashSystemPrompt() → 'sp_<12-char SHA256>'
```

Within the same session, identical system prompts are recorded only once (tracked via a `seenHashes` Set). This avoids transmitting a massive system prompt on every turn—a significant bandwidth savings for long sessions. This is essentially a memoization pattern: "do not send the same string twice."

### 5.3 New-Context Incremental Tracing

`betaSessionTracing.ts:334-399`: Beta tracing does not transmit the full context every time; instead it computes the **incremental delta**:

- Tracks `lastReportedMessageHash` independently per `querySource` (agent name)
- Only transmits user messages added since the last report
- Separates `<system-reminder>` tag contents from regular context

### 5.4 Privacy Rules

| Data type | Visibility |
|-----------|------------|
| System prompt (hash + preview) | All users |
| Model output | All users |
| Thinking output | **ant-only** |
| Tool schema | All users (hash deduplication) |

Note: `user_prompt` is **always recorded** when Beta tracing is enabled (via `addBetaInteractionAttributes()`), subject only to 60KB truncation. The actual privacy control is `isEnhancedTelemetryEnabled()`, which determines whether standard enhanced telemetry is active, plus the `ant-only` restriction on thinking output.

### 5.5 Content Truncation

`truncateContent()` (`betaSessionTracing.ts:103-117`)—60KB limit; comments note this is a Honeycomb safety boundary.

### 5.6 Session Compaction Coordination

`clearBetaTracingState()` (`betaSessionTracing.ts:65-68`) is called after session compaction—because compaction rewrites message history, previous hash-based incremental tracking must be reset.

## 6. Perfetto Chrome Trace (ant-only)—A 1,120-Line Agent Visualization System

`perfettoTracing.ts` (1,120 lines) is not a simple trace-format converter—it is a complete agent runtime visualization system implementing the Chrome Trace Event format, viewable directly in `ui.perfetto.dev` or Chrome's `chrome://tracing`.

### 6.1 Activation

Controlled via environment variables (`feature('PERFETTO_TRACE')` is eliminated in non-ant builds):
- `CLAUDE_CODE_PERFETTO_TRACE=1`: Enabled; trace file written to `~/.claude/traces/trace-<session-id>.json`
- `CLAUDE_CODE_PERFETTO_TRACE=<path>`: Enabled; trace file written to the specified path
- `CLAUDE_CODE_PERFETTO_WRITE_INTERVAL_S=<seconds>`: Periodic writing (default is only on exit)

### 6.2 Agent Hierarchy Mapping—The Process/Thread ID Model

This is the most ingenious design in the Perfetto implementation. The Chrome Trace format was originally built for multi-process browsers (one process per tab); Claude Code cleverly maps the agent hierarchy onto this model:

- **Process ID (pid)**: The main agent uses pid=1; each sub-agent (teammate in swarm mode) gets an incrementing process ID—managed by `getProcessIdForAgent()`.
- **Thread ID (tid)**: The DJB2 hash of the agent name is used as the thread ID (`stringToNumericHash(agentName)`), ensuring events from agents with the same name align on the same "thread track."
- **Metadata events**: `process_name` and `thread_name` metadata events are generated for each agent; the Perfetto UI uses these to label tracks.

In team swarm mode, parallel execution across multiple agents appears as **multi-process parallelism** on the Perfetto timeline—you can visually see which agents are working simultaneously, what each is doing, and which tools they invoke.

### 6.3 100,000 Event Limit and Eviction Policy

`MAX_EVENTS = 100_000` (`perfettoTracing.ts:106`)—this seemingly arbitrary number has real operational reasoning. Source comments explain:

> "Cron-driven sessions run for days; 22 push sites x many turns would otherwise grow unboundedly"

This reveals an important Claude Code usage pattern: **cron-driven long-running sessions**, potentially operating across 22 code repositories for days on end. At roughly 300 bytes per event, the 100K limit translates to about 30MB—enough to debug any scenario.

**Eviction policy**: When the event count hits the limit, the oldest half of events are dropped (amortized O(1)). Note that metadata events (process/thread names) are stored separately in the `metadataEvents` array and are **not subject to eviction**—the Perfetto UI needs them to label tracks correctly.

### 6.4 Stale Span Cleanup

Similar to the SPAN_TTL mechanism in `sessionTracing.ts`, Perfetto has its own stale-span cleanup:

- `STALE_SPAN_TTL_MS = 30 minutes`
- `pendingSpans` is checked every 60 seconds
- Expired spans are automatically ended and marked `evicted: true`—still visible in Perfetto, but clearly labeled

### 6.5 Periodic Writing

The `CLAUDE_CODE_PERFETTO_WRITE_INTERVAL_S` environment variable allows setting a periodic write interval. This is critical for long-running scenarios—if writing only happened on exit, a process crash would lose all trace data. Each periodic write is a **full snapshot** (not incremental), meaning it contains the entirety of `metadataEvents + events`.

### 6.6 Recorded Event Types

The Perfetto trace records the following event types, each mapped to a different Chrome Trace Event phase:

- **API requests**: TTFT (time-to-first-token), TTLT (time-to-last-token), prompt length, cache stats, model ID, speculative flag
- **Tool execution**: Tool name, execution duration, token usage
- **User waiting**: User input wait time (blocked-on-user)
- **Agent hierarchy**: Parent-child agent relationships, agent lifecycle

Even as an ant-only feature, Perfetto's **multi-agent visualization approach** is highly valuable for practitioners—mapping agent hierarchies onto Chrome Trace's process/thread model is an elegant, zero-cost visualization solution.

## 7. GrowthBook Remote Control Plane

GrowthBook is more than a feature-gate platform—it is the **remote control plane** for the entire telemetry system. Here are all telemetry-related GrowthBook configurations:

| Config name | Type | Function |
|-------------|------|----------|
| `tengu_trace_lantern` | Feature Gate | Beta tracing organization allowlist |
| enhanced telemetry gate | Feature Gate (cached) | Standard enhanced telemetry switch |
| `tengu_log_datadog_events` | Feature Gate | Datadog event log switch |
| `tengu_event_sampling_config` | Dynamic Config (JSON) | Per-event-type sampling rate configuration |
| `tengu_frond_boric` | Dynamic Config (JSON) | Sink killswitch (obfuscated name) |
| `tengu_1p_event_batch_config` | Dynamic Config (JSON) | 1P batch-send parameters (delay, batch size, retry count, etc.) |

**Degradation strategy**: All GrowthBook query functions carry the `_CACHED_MAY_BE_STALE` suffix, indicating locally cached values are used. When GrowthBook is unreachable:
- Feature Gate: uses the last successfully fetched cached value
- Dynamic Config: uses code-level defaults (`{}` or specific default objects)
- **Fail-open design**: killswitch missing defaults to all sinks on; sampling config missing defaults to full recording

This means GrowthBook availability does not become a single point of failure for the telemetry system—but after prolonged unreachability, cached values may become significantly stale.

## 8. Design Trade-offs and Assessment

**Strengths**:
1. **Type-system–encoded privacy review**—the `AnalyticsMetadata_I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS` type tag and `_PROTO_*` field routing elevate privacy compliance from runtime checks to development-time forced awareness. This is the design that most reflects security-engineering thinking in the entire telemetry system.
2. **Complete failure recovery for 1P events**—persistent storage + quadratic backoff + auth degradation is far richer than OTel's standard `BatchSpanProcessor`.
3. **Remotely controllable event sampling**—via GrowthBook's `tengu_event_sampling_config`, telemetry granularity can be adjusted dynamically without a release.
4. **Sink killswitch mechanism**—incident-response–grade defensive design with obfuscated naming to prevent abuse.
5. The existence of `codeEditToolDecisionCounter` shows the team is optimizing AI code-edit quality in a data-driven way.
6. Perfetto maps agent hierarchies onto Chrome Trace's process/thread model—zero-cost reuse of a mature performance-analysis toolchain.
7. **WeakRef + AsyncLocalStorage Span lifecycle management**—a GC-aware design that reduces memory-leak risk in long-running scenarios.
8. Session compaction coordination ensures tracing-state consistency.
9. OTLP configuration supports mTLS + dynamic headers, meeting enterprise security requirements.

**Costs**:
1. **Maintenance burden of five independent pipelines**—each path has its own serialization format, batching policy, retry logic, and sampling mechanism; there is no unified event bus + multi-sink architecture. Likely reasons: historical evolution, differing reliability requirements, differing data-sensitivity levels.
2. The 30-minute SPAN_TTL means early spans in ultra-long sessions may be cleaned up—potentially losing debugging information.
3. Beta tracing's ant-only restriction on thinking output indicates privacy boundaries are still being cautiously evaluated.
4. The 60KB truncation is a compromise driven by Honeycomb system limits—particularly large tool outputs will be truncated.
5. BigQuery metrics exclude Bedrock/Vertex/Foundry users—product behavior for these users is a blind spot.
6. Perfetto being ant-only means enterprise users cannot access agent-hierarchy visualization—a missed opportunity for a high-value debugging feature.
7. **Implicit dependency on GrowthBook as the remote configuration center**—Beta tracing allowlist, enhanced telemetry gate, event sampling rates, and sink killswitch all depend on GrowthBook. Although there is a `_CACHED_MAY_BE_STALE` degradation strategy, behavior after prolonged unreachability and cache expiration is not clearly defined.

---

*Quality self-check:*
- [x] Coverage: five data-egress channels + 8 Counters + Span hierarchy + WeakRef lifecycle + privacy data governance + Beta tracing + Perfetto (extended) + GrowthBook remote control plane + Sink Killswitch + event sampling
- [x] Fidelity: all line numbers and code snippets verified against source; line counts consistent with `wc -l`
- [x] Depth: `_PROTO_*` field routing, 1P failure-recovery mechanism, Perfetto eviction policy, WeakRef span management
- [x] Critique: identifies five-pipeline maintenance cost, GrowthBook single-point dependency, SPAN_TTL limitation, missed opportunity of Perfetto ant-only restriction
