# Dictation state machine

> DictationState lifecycle: idle, recording, finalizing, rewriting, pasting, heldInHistory, and error. Transitions on hotkey press/release, STT completion, polish outcome, paste result, and empty-transcript collapse. audioReady and liveTranscript HUD coupling.

- Repository: cartesia-ai/InkIt
- GitHub: https://github.com/cartesia-ai/InkIt
- Human docs: https://www.grok-wiki.com/public/docs/cartesia-ai-inkit-18975554254b
- Complete Markdown: https://www.grok-wiki.com/public/docs/cartesia-ai-inkit-18975554254b/llms-full.txt

## Source Files

- `InkIt/AppCoordinator.swift`
- `InkIt/NotchHUD.swift`
- `InkIt/SettingsStore.swift`
- `InkIt/FeedbackSoundPlayer.swift`

---

---
title: Dictation state machine
description: DictationState lifecycle from hotkey press through STT, polish, paste, and recovery paths — including audioReady and liveTranscript HUD coupling.
---

`DictationState` is the single source of truth for where a dictation take sits in its lifecycle. `AppCoordinator` owns the enum, publishes it to SwiftUI observers, and drives every transition from hotkey input, Cartesia STT completion, optional polish, paste outcome, and empty-transcript collapse. The notch HUD and menu bar read the same state but surface different subsets: the island stays live only during `.recording`, while post-release work runs mostly in the menu bar.

## States

`DictationState` is an `Equatable` enum with seven cases. Associated data appears only on `.error(String)`.

| State | Meaning | Menu bar label | Notch HUD |
| --- | --- | --- | --- |
| `idle` | No dictation in flight | `Ink` | Hidden (collapsed into notch) |
| `recording` | Hotkey active; audio streaming to Cartesia | `● Ink` | Live island with preparing cue or waveform |
| `finalizing` | Key released; audio stopped; STT session closing | `… Ink` | Hidden |
| `rewriting` | Optional LLM polish in progress | `✎ Ink` | Hidden |
| `pasting` | Cmd+V insertion into focused app | `↩ Ink` | Hidden |
| `heldInHistory` | Transcript saved to History; no editable field at release | `⬇ Ink` | Notice: "Saved to History" |
| `error(String)` | Recoverable failure with short message | `⚠ Ink` | Error notice with warning glyph |

`finalizing`, `rewriting`, and `pasting` are intentionally silent in the notch. The island contracts on hotkey release so polishing and pasting do not compete with the user's cursor; the menu bar still reflects those phases.

## State diagram

```mermaid
stateDiagram-v2
    [*] --> idle

    idle --> recording: hotkey start (from idle / heldInHistory / error)
    heldInHistory --> recording: hotkey start
    error --> recording: hotkey start

    recording --> finalizing: hotkey stop (hold release or toggle press)
    recording --> error: permission / API / audio / STT failure

    finalizing --> idle: empty transcript collapse
    finalizing --> rewriting: polish enabled + key present
    finalizing --> pasting: polish off or skipped, editable focus OK
    finalizing --> heldInHistory: no editable field at release
    finalizing --> idle: onboarding trial (no paste)

    rewriting --> pasting: editable focus OK
    rewriting --> heldInHistory: no editable field

    pasting --> idle: paste success
    pasting --> error: paste failed

    heldInHistory --> idle: 2.5s timer
    error --> idle: 1.5s dwell after release
```

## Hotkey-driven transitions

Hotkey semantics depend on `DictationMode` in `SettingsStore`:

<ParamField body="hold" type="DictationMode">
Press starts dictation; release stops it and begins the post-release pipeline (`finalizing` → polish/paste).
</ParamField>

<ParamField body="toggle" type="DictationMode">
First press starts; second press stops and pastes. Release is ignored — only the next press ends the take.
</ParamField>

`startDictation()` accepts entry only from `idle`, `heldInHistory`, or `error`. Any other state (`recording`, `finalizing`, `rewriting`, `pasting`) returns immediately, preventing double-starts mid-flight.

On a successful start:

1. `state` becomes `.recording`
2. `audioReady` resets to `false`
3. `liveTranscript` clears
4. Optional start feedback sound plays
5. Paste target and Chromium AX prewarm run at press time
6. Cartesia WebSocket connects and `AudioCaptureService` begins streaming 16 kHz PCM

`stopDictation()` requires `.recording`. It records `releaseTime` for latency metrics, sets `.finalizing`, plays the stop cue, stops audio, and calls `finalizeAndClose()` on the STT client.

In hold mode, releasing while already in `.error` arms the error dismiss timer instead of calling `stopDictation()` — the user has finished reacting to the failure.

## Post-release pipeline

When Cartesia delivers the final transcript via `onClosed`, the coordinator branches:

### Empty transcript collapse

If the trimmed final text is empty, state returns directly to `idle` with no history row and no notch notice. This covers silence, rapid tap-and-release, and benign STT disconnects that `CartesiaStreamingClient.reportFailureOrCollapse` routes to a silent close.

### Onboarding trial path

When `routesFinalTranscriptToOnboarding` is active and InkIt is frontmost, the raw transcript lands in `liveTranscript`, trial latency is captured, and state returns to `idle` — no polish, no paste.

### Polish branch

When correction is enabled and an LLM API key exists, `correctedTranscript` sets `.rewriting` and awaits `TranscriptRewriter.rewriteWithoutContext`. A connection warmed at key-press (`warmRewriter`) is consumed here. Polish failure never blocks the user: the coordinator falls back to the raw transcript and records the failure on the history row.

### Focus check and paste

Before pasting, `FocusedEditable.current()` re-checks whether an editable field is focused at release. If not, the transcript is added to `TranscriptHistoryStore`, `showHeldInHistoryNotice()` sets `.heldInHistory` for 2.5 seconds, then auto-clears to `idle`.

If focus is valid, state becomes `.pasting` and `PasteService` synthesizes Cmd+V. Success writes history and returns to `idle`; failure calls `setError("Paste failed")`.

## Error handling

`setError(_:)` is the central failure path:

- Clears paste-target snapshots
- Sets `lastError` and `.error(message)`
- Stops audio and cancels the STT client
- Arms a cancelable 1.5s dismiss via `armErrorDismiss()`

While `isHotkeyHeld` is true in hold mode, the error notice stays on screen; dismiss re-arms until release. After release, the 1.5s tail gives a readable confirmation before returning to `idle`.

`handleSTTFailure` maps `STTFailure` to notch copy (`notchMessage`) and persists `cartesiaKeyInvalid` / `cartesiaOutOfCredits` for fixable causes. STT errors can surface mid-hold so the user stops talking into a dead session.

Pre-flight guards in `startDictation()` also route to `.error` for missing Cartesia key, microphone denial, or missing Accessibility (with throttled Settings re-open).

## audioReady and the preparing cue

`audioReady` is separate from `DictationState` but tightly coupled to `.recording` in the notch HUD.

<ResponseField name="audioReady" type="Bool">
Published by `AppCoordinator`. `false` at dictation start; `true` after `AudioCaptureService.onReady` fires once per take.
</ResponseField>

`AudioCaptureService` signals readiness when input level exceeds `readyLevelThreshold` (0.03) or, as a backstop, after `readyFallbackDelay` (0.6 s). The threshold distinguishes real signal from the digital silence Bluetooth mics emit during A2DP→HFP profile switch (~200–500 ms).

In `NotchHUDView.liveContent`:

- `audioReady == false` → `HUDPreparingDot` (soft pulsing dot)
- `audioReady == true` → `HUDWaveform` driven by `inputLevel`

The still→moving transition is the "start speaking" cue. It prevents users from talking into the dead gap before hardware capture is live.

## liveTranscript coupling

<ResponseField name="liveTranscript" type="String">
Interim STT text from `CartesiaStreamingClient.onTranscriptUpdate`. Cleared on each `startDictation()`.
</ResponseField>

The notch HUD does **not** render `liveTranscript` — only the waveform/preparing cue. Interim words surface elsewhere:

- **Normal dictation:** streaming updates during `.recording` (when not suppressed)
- **Onboarding Try It:** `suppressLivePreview` holds interim text back; the final transcript lands all at once in `TryItPracticeCard` via `onChange(of: transcript)`
- **Post-trial:** final raw text may populate `liveTranscript` before the card copies it into the editable field

`TryItPracticeCard` treats `.finalizing`, `.rewriting`, and `.pasting` collectively as "finalizing" for UI enablement — the send button stays disabled until the pipeline completes.

## Observable surfaces

| Surface | What it reads |
| --- | --- |
| Notch HUD | `state` (live / notice / error only), `audioReady`, `inputLevel` |
| Menu bar | `state` → `menuBarLabel`, `menuBarIconName`, `statusText`, `statusColor` |
| Try It card | `state`, `liveTranscript` |
| Home / History | Written on successful paste or held-in-history path, not on empty collapse |

## Design constraints

- **Press-time vs release-time targets:** Paste target is resolved at hotkey press (with AX prewarm), but editability is verified at release. Stale focus yields `heldInHistory`, not a blind paste.
- **Captured closures:** `onClosed` captures `pasteTargetApp` and `contextTargetSnapshot` at recording start so mid-flight `setError` cannot wipe a valid target.
- **Restart from notices:** `heldInHistory` and `error` are intentionally restartable — pressing the hotkey again means "dictate again" without waiting for auto-dismiss.
- **Latency anchor:** `releaseTime` (monotonic `DispatchTime`) anchors transcribe, polish, and paste millisecond breakdowns on history rows.

## Related pages

<Card href="/dictation-pipeline" title="Dictation pipeline" icon="workflow">
End-to-end flow from hotkey through audio capture, STT, polish, paste, and history persistence.
</Card>

<Card href="/hotkey-bindings" title="Hotkey bindings" icon="keyboard">
Hold vs toggle semantics, Carbon hotkeys, and Fn/Globe event-tap registration.
</Card>

<Card href="/paste-and-focus-reference" title="Paste and focus reference" icon="clipboard">
FocusedEditable checks, paste timing, and the heldInHistory fallback when no field is focused.
</Card>

<Card href="/input-device-selection" title="Input device selection" icon="mic">
Bluetooth profile switch delay and how readyLevelThreshold / readyFallbackDelay backstop audioReady.
</Card>
