# Dictation pipeline

> End-to-end flow from hotkey press through audio capture (16 kHz PCM), Cartesia WebSocket STT, optional TranscriptRewriter polish, FocusedEditable target resolution, PasteService Cmd+V insertion, and TranscriptHistoryStore persistence. Latency stages and prewarm behavior.

- Repository: cartesia-ai/InkIt
- GitHub: https://github.com/cartesia-ai/InkIt
- Human docs: https://www.grok-wiki.com/public/docs/cartesia-ai-inkit-18975554254b
- Complete Markdown: https://www.grok-wiki.com/public/docs/cartesia-ai-inkit-18975554254b/llms-full.txt

## Source Files

- `InkIt/AppCoordinator.swift`
- `InkIt/AudioCaptureService.swift`
- `InkIt/CartesiaStreamingClient.swift`
- `InkIt/TranscriptRewriter.swift`
- `InkIt/PasteService.swift`
- `InkIt/TranscriptHistoryStore.swift`
- `InkIt/ContextSnapshot.swift`

---

---
title: "Dictation pipeline"
description: "End-to-end flow from hotkey press through audio capture (16 kHz PCM), Cartesia WebSocket STT, optional TranscriptRewriter polish, FocusedEditable target resolution, PasteService Cmd+V insertion, and TranscriptHistoryStore persistence. Latency stages and prewarm behavior."
---

`AppCoordinator` owns the dictation hot path: `HotkeyManager` callbacks start and stop a take, `AudioCaptureService` streams 16 kHz PCM into `CartesiaStreamingClient`, the final transcript optionally passes through `TranscriptRewriter`, `FocusedEditable` verifies an editable target at release, `PasteService` inserts via Cmd+V, and `TranscriptHistoryStore` persists the row with per-stage latency. State transitions (`DictationState`) and HUD fields (`liveTranscript`, `audioReady`, `inputLevel`) mirror that pipeline on the main actor.

## Pipeline overview

```mermaid
sequenceDiagram
    participant HK as HotkeyManager
    participant AC as AppCoordinator
    participant AU as AudioCaptureService
    participant STT as CartesiaStreamingClient
    participant RW as TranscriptRewriter
    participant FE as FocusedEditable
    participant PS as PasteService
    participant HS as TranscriptHistoryStore

    HK->>AC: onPress
    AC->>AC: resolve pasteTargetApp, TargetAppSnapshot
    AC->>FE: enableWebAccessibility(pid)
    AC->>RW: prewarm() (if polish enabled)
    AC->>STT: connect()
    AC->>AU: start(onChunk → sendAudio)
    AU-->>AC: onReady → audioReady = true
    STT-->>AC: onTranscriptUpdate → liveTranscript

    HK->>AC: onRelease (hold mode) or second press (toggle)
    AC->>AC: releaseTime = now, state = finalizing
    AC->>AU: stop()
    AC->>STT: finalizeAndClose()
    STT-->>AC: onClosed(finalText)

    AC->>RW: rewriteWithoutContext (optional)
    AC->>FE: current()
    alt editable focus
        AC->>PS: paste(text, targetApp)
        PS-->>AC: completion(true)
        AC->>HS: add(text, latency, polish)
    else no editable focus
        AC->>HS: add(text, latency, polish)
        AC->>AC: heldInHistory notice
    end
```

| Stage | Owner | Trigger | Output |
| --- | --- | --- | --- |
| Hotkey | `HotkeyManager` → `AppCoordinator` | Press / release (hold) or toggle press | `startDictation()` / `stopDictation()` |
| Audio | `AudioCaptureService` | `start()` after guards pass | 16 kHz mono `pcm_s16le` chunks |
| STT | `CartesiaStreamingClient` | `connect()` at press; `finalizeAndClose()` at release | Cumulative transcript via WebSocket |
| Polish | `TranscriptRewriter` | After `onClosed` when `correctionEnabled` and key present | Corrected text or raw fallback |
| Focus | `FocusedEditable` | After polish, before paste | `isEditable` + verified `NSRunningApplication` |
| Paste | `PasteService` | Editable focus confirmed | Cmd+V into target app |
| History | `TranscriptHistoryStore` | After polish (paste or held) | SwiftData `TranscriptRecord` row |

## Hotkey press: start dictation

`handleHotkeyPress()` branches on `SettingsStore.dictationMode`:

- **Hold** — press sets `isHotkeyHeld = true` and calls `startDictation()`.
- **Toggle** — press starts recording when idle; a second press while `.recording` calls `stopDictation()`.

`startDictation()` runs only from `.idle`, `.heldInHistory`, or `.error`. In-flight states (`.recording`, `.finalizing`, `.rewriting`, `.pasting`) ignore duplicate starts.

Pre-flight guards, in order:

1. Non-empty `cartesiaAPIKey` (Keychain).
2. Microphone permission (`permissions.hasMicrophone`).
3. Accessibility permission (`permissions.hasAccessibility`).

On success the coordinator sets `state = .recording`, `audioReady = false`, clears `liveTranscript`, optionally plays a start sound, and resolves the paste target:

- Frontmost app if it is not InkIt.
- Otherwise `lastExternalApp` (seeded at launch from frontmost or first regular non-InkIt app).

`TargetAppSnapshot.capture(from:)` records bundle ID, localized name, and PID for logging and future per-app behavior. It does not read on-screen content.

<Note>
Onboarding trial (`beginOnboardingTrial`) routes the final transcript to the Try It box only when InkIt is frontmost. Live preview is suppressed during trial; words appear all at once on release. Trial skips polish, paste, and LLM prewarm.
</Note>

## Audio capture (16 kHz PCM)

`AudioCaptureService` uses `AVAudioEngine` with a tap on the input node. Each buffer is:

1. Converted to **16 kHz, mono, 16-bit signed little-endian PCM** (`pcm_s16le`).
2. Delivered on a dedicated `inkit.audio` queue to the `onChunk` callback.
3. Metered on the main queue as a normalized 0…1 peak level (`onLevel` → `inputLevel`).

Device selection reads `settings.preferredInputDeviceUID` before `start()`. A pinned UID that is unplugged falls back to the system default input.

### `audioReady` and Bluetooth delay

`audioReady` stays `false` until the mic is genuinely producing signal:

- **Signal path** — first tap buffer with level above `readyLevelThreshold` (0.03) fires `onReady` once per take.
- **Fallback** — `readyFallbackDelay` (0.6 s) fires `onReady` anyway so a silent room does not leave the HUD stuck on "preparing."

Bluetooth mics often need 200–500 ms to switch from A2DP output to HFP input; digital silence during that window is lost at the hardware level. The HUD waits for `audioReady` so users do not speak into the dead gap.

`stop()` removes the tap, stops the engine, and **`queue.sync`** drains in-flight conversions before clearing the converter — preventing loss of the tail of the last word.

## Cartesia WebSocket STT

A fresh `CartesiaStreamingClient` is created per take with the user's Cartesia API key.

<ParamField body="endpoint" type="wss URL" required>
`wss://api.cartesia.ai/stt/turns/websocket`
</ParamField>

<ParamField body="query params" type="object" required>
`model=ink-2`, `encoding=pcm_s16le`, `sample_rate=16000`, `cartesia_version=2026-03-01`
</ParamField>

<ParamField body="headers" type="object" required>
`X-API-Key`, `Cartesia-Version: 2026-03-01`
</ParamField>

Audio flows as binary WebSocket frames. `turn.update` and `turn.eager_end` update the in-flight turn; completed `turn.end` events append to `completedTurns`. `onTranscriptUpdate` drives `liveTranscript` unless onboarding suppresses preview.

### Pending audio buffer

Audio captured before the server's `connected` event is held in `pendingAudio` and flushed in order in `handleConnected()`. Early frames are not sent until the session is ready, avoiding leading-word loss.

### Live transcript HUD

`client.onTranscriptUpdate` sets `liveTranscript` on the main actor. The notch HUD shows cumulative text during `.recording`.

## Hotkey release: finalize

`stopDictation()` runs only from `.recording`:

1. `releaseTime = DispatchTime.now()` — anchor for all latency measurements.
2. `state = .finalizing`.
3. Optional stop sound.
4. `audio.stop()` — last PCM chunks flushed.
5. `client.finalizeAndClose()` — sends `{"type":"close"}` when connected.

The client completes on the final `turn.end` after close (happy path), socket close, or a grace timer (3.0 s with content, 2.0 s when empty). `onClosed` delivers the joined, trimmed transcript.

Empty transcripts collapse to `.idle` with no history row and no error.

STT failures route through `handleSTTFailure`, which may set `cartesiaKeyInvalid` or `cartesiaOutOfCredits` and surfaces `failure.notchMessage` via `.error` state.

## Optional polish (`TranscriptRewriter`)

After `onClosed`, `correctedTranscript(raw:targetSnapshot:)` runs when both `settings.correctionEnabled` and a non-empty key for `settings.rewriteProvider` are present. Otherwise the raw transcript passes through with `PolishOutcome.off`.

When polish runs:

1. `state = .rewriting`.
2. `rewriteWithoutContext(transcript:)` POSTs to the configured `LLMProvider` endpoint (OpenAI-compatible chat completions, or Anthropic Messages API).
3. On success — polished text plus `original` for diffing (`PolishOutcome.polished`).
4. On failure — raw text pasted anyway (`PolishOutcome.failed`) with `PolishFailure` reason for the history tooltip.

Polish never blocks delivery: the user always gets at least the raw transcript.

Successful Cartesia transcription clears `cartesiaKeyInvalid` / `cartesiaOutOfCredits`. Polish outcomes update `polishKeyInvalid` / `polishOutOfCredits` on auth or billing failures.

## Focus resolution (`FocusedEditable`)

Paste target is resolved at **press**, but editability is verified at **release** via `FocusedEditable.current()`:

- Runs off the main thread under a **1.5 s** AX budget (`AX.run`).
- Fast path: system-wide focused element with settable `AXValue`.
- Chromium/Electron fallback: app-element descent, `AXFocused` subtree scan (max 1500 nodes), system-wide container descent.

If no editable field is focused, the transcript is saved to history with `pasteMs: 0`, `showHeldInHistoryNotice()` sets `.heldInHistory` for 2.5 s, and Cmd+V is not synthesized.

## Paste (`PasteService`)

When focus is editable, `state = .pasting` and `PasteService.paste(text:targetApp:)`:

1. Snapshots existing pasteboard items.
2. Writes dictated text with session tag `com.cartesia.InkIt.PasteSession` and transient type `org.nspasteboard.TransientType` (clipboard managers skip the entry).
3. Activates `targetApp` only when it is not already frontmost.
4. Waits `clipboardSettleDelay` (80 ms), plus `activationFocusDelay` (120 ms) when activation was needed.
5. Synthesizes Cmd+V via `CGEvent`.
6. Reports completion immediately (perceived paste latency ends here).
7. Restores the prior pasteboard after `clipboardRestoreDelay` (400 ms) if the session tag still matches.

Paste uses `focus.app ?? capturedTargetApp` — the release-verified app wins over the press-time target.

## History persistence (`TranscriptHistoryStore`)

`history.add(_:original:latency:polish:failure:)` inserts a `TranscriptRecord` into SwiftData and mirrors it in the published `entries` array (newest first).

<ResponseField name="Latency" type="object">
Per-stage milliseconds from hotkey release: `transcribeMs` (release → final transcript), `polishMs` (transcript → polish done), `pasteMs` (polish → paste completion). `totalMs` exposes only `transcribeMs + polishMs` for user-facing stats; `pasteMs` is recorded but excluded from totals.
</ResponseField>

<ResponseField name="Entry" type="object">
`text`, `timestamp`, optional `latency`, `original` (pre-polish), `polish` outcome (`off` / `polished` / `failed`), optional `failure` details.
</ResponseField>

On-disk persistence falls back to an in-memory SwiftData container if the on-disk store cannot open; rows survive the session but not relaunch. `lifetimeWords` in UserDefaults advances only when a row durably persists.

## Latency stages

All measurements use monotonic `DispatchTime` from `releaseTime` (set in `stopDictation()`).

| Segment | Start | End | Stored as |
| --- | --- | --- | --- |
| Transcribe | Hotkey release | `onClosed` transcript arrival | `transcribeMs` |
| Polish | Transcript arrival | Polish completion | `polishMs` (0 when off) |
| Paste | Polish completion | Paste completion callback | `pasteMs` |

Onboarding trial captures only `transcribeMs` in `lastTrialLatency` (no polish or paste).

Home "avg time to text" and history breakdowns use `totalMs = transcribeMs + polishMs`.

## Prewarm behavior

Three prewarm paths overlap work during the hold window so post-release latency stays low:

### LLM connection prewarm

At key-press, when polish is enabled and a provider key exists (and not onboarding trial), `AppCoordinator` constructs a `TranscriptRewriter` and calls `prewarm()`:

- Issues a `HEAD` request to `provider.endpoint` with a 2.5 s timeout.
- Warms DNS, TCP, and TLS on the same `URLSession` instance reused for the polish POST.
- Stored in `warmRewriter` and consumed once in `correctedTranscript`; reset on every `startDictation`.

### Chromium accessibility prewarm

At key-press, `FocusedEditable.enableWebAccessibility(pid:)` sets `AXManualAccessibility` and `AXEnhancedUserInterface` on the target app off the main thread. Chromium/Electron apps build their accessible tree lazily; without this, the release-time focus check can race the tree and wrongly hold transcripts in History.

### STT pending-audio buffer

Not a network prewarm, but the same principle: audio during WebSocket handshake is buffered and flushed after `connected`, so the first spoken words are not dropped while the socket initializes.

<Warning>
`warmRewriter` is skipped if settings change mid-hold or prewarm was not started. `correctedTranscript` falls back to a fresh `TranscriptRewriter` instance — polish still runs, but without connection reuse.
</Warning>

## Error and edge paths

| Condition | Behavior |
| --- | --- |
| Empty / whitespace transcript | `.idle`, no history row |
| STT named failure (401, 402, 429, offline, 5xx mid-hold) | `.error` with notch message; may set service-issue flags |
| STT `.unknown` with content | Deliver transcript silently |
| STT post-close 5xx with no content | Collapse to empty (rapid tap) |
| Polish failure | Raw text; `PolishOutcome.failed` in history |
| No editable focus at release | History only; `.heldInHistory` notice |
| Paste failure | `.error("Paste failed")` |
| Audio start failure | `.error`; STT session cancelled |

Duplicate InkIt bundle instances are detected at launch and surfaced as `lastError` — Accessibility grants apply per bundle path.

## Module map

```text
HotkeyManager
    └─ AppCoordinator (orchestrator, DictationState, HUD bindings)
           ├─ AudioCaptureService ──► CartesiaStreamingClient (WebSocket STT)
           ├─ TranscriptRewriter (optional polish)
           ├─ FocusedEditable + TargetAppSnapshot (target resolution)
           ├─ PasteService (clipboard + Cmd+V)
           └─ TranscriptHistoryStore (SwiftData persistence + Latency)
```

## Related pages

<CardGroup>
<Card title="Dictation state machine" href="/dictation-state-machine">
`DictationState` transitions, HUD coupling, and error dwell behavior.
</Card>
<Card title="Cartesia STT reference" href="/cartesia-stt-reference">
WebSocket contract, server events, `STTFailure` classification, and close-path metrics.
</Card>
<Card title="Paste and focus reference" href="/paste-and-focus-reference">
Clipboard session tagging, timing constants, and AX focus walk details.
</Card>
<Card title="Configure polish" href="/configure-polish">
Enable polish, pick provider and model, and interpret `PolishUIState`.
</Card>
<Card title="Input device selection" href="/input-device-selection">
Pin a microphone UID, Bluetooth profile switch, and `audioReady` thresholds.
</Card>
</CardGroup>
