# Durable Execution & Checkpointing

> Compares Eve's workflow-backed session/turn/step model with automatic step checkpoints and parked work, against Flue's pluggable run and agent-execution stores, dispatch queue, and workflow run IDs exposed over HTTP.

- Repository: vercel/eve-with-withastro-flue
- GitHub: https://github.com/vercel/eve
- Human wiki: https://www.grok-wiki.com/public/wiki/vercel-eve-with-withastro-flue-43b600348681
- Complete Markdown: https://www.grok-wiki.com/public/wiki/vercel-eve-with-withastro-flue-43b600348681/llms-full.txt

## Source Files

- `vercel-eve:docs/concepts/execution-model-and-durability.md`
- `vercel-eve:docs/concepts/default-harness.md`
- `vercel-eve:packages/eve/src/client/session.ts`
- `withastro-flue:packages/runtime/src/runtime/run-store.ts`
- `withastro-flue:packages/runtime/src/sql-run-store.ts`
- `withastro-flue:packages/runtime/src/agent-execution-store.ts`
- `withastro-flue:packages/runtime/src/runtime/dispatch-queue.ts`

---

<details>
<summary>Relevant source files</summary>

The following files were used as context for generating this wiki page:

- [vercel-eve/docs/concepts/execution-model-and-durability.md](vercel-eve/docs/concepts/execution-model-and-durability.md)
- [vercel-eve/docs/concepts/default-harness.md](vercel-eve/docs/concepts/default-harness.md)
- [vercel-eve/docs/concepts/sessions-runs-and-streaming.md](vercel-eve/docs/concepts/sessions-runs-and-streaming.md)
- [vercel-eve/packages/eve/src/client/session.ts](vercel-eve/packages/eve/src/client/session.ts)
- [vercel-eve/packages/eve/src/execution/workflow-runtime.ts](vercel-eve/packages/eve/src/execution/workflow-runtime.ts)
- [vercel-eve/packages/eve/src/execution/turn-workflow.ts](vercel-eve/packages/eve/src/execution/turn-workflow.ts)
- [vercel-eve/packages/eve/src/execution/durable-session-store.ts](vercel-eve/packages/eve/src/execution/durable-session-store.ts)
- [vercel-eve/packages/eve/src/execution/subagent-adapter.ts](vercel-eve/packages/eve/src/execution/subagent-adapter.ts)
- [withastro-flue/AGENTS.md](withastro-flue/AGENTS.md)
- [withastro-flue/packages/runtime/src/runtime/run-store.ts](withastro-flue/packages/runtime/src/runtime/run-store.ts)
- [withastro-flue/packages/runtime/src/sql-run-store.ts](withastro-flue/packages/runtime/src/sql-run-store.ts)
- [withastro-flue/packages/runtime/src/agent-execution-store.ts](withastro-flue/packages/runtime/src/agent-execution-store.ts)
- [withastro-flue/packages/runtime/src/runtime/dispatch-queue.ts](withastro-flue/packages/runtime/src/runtime/dispatch-queue.ts)
- [withastro-flue/packages/runtime/src/node/agent-coordinator.ts](withastro-flue/packages/runtime/src/node/agent-coordinator.ts)
- [withastro-flue/packages/runtime/src/runtime/flue-app.ts](withastro-flue/packages/runtime/src/runtime/flue-app.ts)

</details>

# Durable Execution & Checkpointing

Both Eve and Flue solve the same underlying problem — agent work must survive crashes, redeploys, and long waits for human input — but they choose different durability boundaries and storage models. Eve wraps the agent loop in a managed Workflow SDK execution graph where every model step is an automatic checkpoint. Flue splits agent sessions from workflow runs and exposes pluggable stores, a dispatch admission queue, and HTTP-addressable run IDs that clients can inspect independently of the agent conversation model.

Understanding where each framework checkpoints, what it replays on recovery, and how clients resume work is the key to integrating either system safely.

## Conceptual split: one durable conversation vs. layered stores

| Dimension | Eve (`vercel-eve`) | Flue (`withastro-flue`) |
|-----------|-------------------|-------------------------|
| Primary durable unit | **Session** — a long-lived conversation spanning days | **Agent session** (persistent harness state) and **workflow run** (finite job) are separate |
| Checkpoint grain | **Step** — one model call plus its tool calls | **Turn journal phase** inside a **submission attempt**; workflow runs tracked as `RunRecord` |
| Execution engine | Workflow SDK (`start`, `resumeHook`, `"use step"`) owned by the runtime | Coordinator + `AgentSubmissionStore` leases; workflow modules get their own `runId` |
| Resume handle | `continuationToken` (park/resume) + `sessionId` (stream) | `dispatchId` for dispatched agent input; opaque `run_<ulid>` for workflow runs |
| Message ordering | No durable per-session FIFO; client should serialize sends | Per-session submission queue: at most one runnable head per session |
| Storage model | Framework-owned workflow persistence | Pluggable `PersistenceAdapter` (`RunStore`, `AgentExecutionStore`, `EventStreamStore`) |
| HTTP inspection | `GET /eve/v1/session/<sessionId>/stream` (NDJSON) | `GET /runs/:runId` (Durable Streams); `GET /runs/:runId?meta` (JSON run record) |

Sources: [vercel-eve/docs/concepts/execution-model-and-durability.md:9-36](), [withastro-flue/AGENTS.md:7-20](), [withastro-flue/packages/runtime/src/agent-execution-store.ts:336-353]()

## Eve: workflow-backed session / turn / step model

### Three nesting levels

Eve organizes durable work as:

1. **Session** — the whole conversation or task; survives process restarts and redeploys without configuration.
2. **Turn** — one user message and all work it triggers until the agent responds.
3. **Step** — a durable checkpoint inside a turn (one model call and the tool calls it makes).

Every turn runs as a durable workflow. The runtime checkpoints progress and serializes durable state at each step boundary. Agent code inside tools and the sandbox feels synchronous even though the session underneath is durable.

Sources: [vercel-eve/docs/concepts/execution-model-and-durability.md:9-16]()

### Runtime architecture

Eve's workflow runtime uses a long-lived **driver workflow** that owns the event stream and dispatches each turn as a **child workflow run**:

```text
Client                    Eve runtime                         Workflow SDK
  |                            |                                    |
  |-- POST /session ---------->| start(workflowEntry) ------------->| driver run (pinned deployment)
  |<-- sessionId, token ------|                                    |
  |-- GET .../stream --------->| event stream (NDJSON)              |
  |                            | start(turnWorkflow) -------------->| child turn run (latest deployment)
  |                            |   turnStep() -> "use step" ------->| checkpoint boundary
  |                            |   park -> resumeHook(token) ----->| suspend (no compute)
  |-- POST + continuationToken>| resumeHook ----------------------->| resume parked turn
```

The driver is pinned to the deployment that called `start()`, while child turn workflows route to the latest deployment. Session snapshots travel inside workflow step results as the atomic persistence boundary for program memory.

Sources: [vercel-eve/packages/eve/src/execution/workflow-runtime.ts:74-98](), [vercel-eve/packages/eve/src/execution/turn-workflow.ts:24-40](), [vercel-eve/packages/eve/src/execution/durable-session-store.ts:1-16]()

### Checkpoint and replay semantics

On crash, timeout, or redeploy mid-turn:

- **Completed steps never re-run** — Eve replays the recorded result.
- **A step interrupted mid-execution re-runs** — non-idempotent side effects (charges, emails) need idempotency keys or approval gates.

There is nothing to configure; Eve owns the workflow lifecycle and sessions are durable by default. Workflow primitives (`start()`, `resumeHook()`, etc.) are implementation details; channels, tools, and hooks never touch them directly.

Sources: [vercel-eve/docs/concepts/execution-model-and-durability.md:18-24](), [vercel-eve/packages/eve/src/execution/subagent-adapter.ts:100-112]()

### Parked work

Some work must wait: human tool approval, `ask_question`, OAuth sign-in, or a long-running subagent. At those points the turn **parks durably**. The workflow suspends and holds no compute until the awaited input arrives. When it does, the conversation resumes exactly where it left off.

Parked states surface on the event stream as `input.requested`, `authorization.required`, or `session.waiting`. Clients resume by POSTing to the session endpoint with the current `continuationToken`; a stale token is rejected.

OAuth and terminal callbacks use unguessable workflow hook tokens on framework-owned routes (`/eve/v1/callback/:token`, `/eve/v1/connections/:name/callback/:token`) that call `resumeHook(token, payload)`.

Sources: [vercel-eve/docs/concepts/execution-model-and-durability.md:26-28](), [vercel-eve/docs/concepts/sessions-runs-and-streaming.md:8-15,45-61](), [vercel-eve/packages/eve/src/protocol/routes.ts:77-99]()

### Client session state

The Eve client tracks `continuationToken`, `sessionId`, and `streamIndex` across `send()` calls. Serialize `ClientSession.state` to persist and resume later. HITL input responses trigger up to 10 delivery retries when the target session is temporarily not found.

Sources: [vercel-eve/packages/eve/src/client/session.ts:30-53,262-296]()

### What Eve does not guarantee

Eve does **not** maintain a durable FIFO queue of user messages per session. The `continuationToken` is a resume handle for the session's current workflow hook, not a general message-queue address. For deterministic behavior, send one user turn at a time and wait for `session.waiting` before sending the next message. Channels that receive bursts should keep their own per-session queue.

Sources: [vercel-eve/docs/concepts/execution-model-and-durability.md:30-36]()

## Flue: pluggable stores, dispatch queue, and workflow run IDs

### Terminology: runs are workflow-only

Flue distinguishes agent conversations from workflow jobs:

- **Agent path**: persistent instances, harnesses, sessions, operations, and turns. Direct prompts and `dispatch()` inputs live here; they correlate by `dispatchId`, not `runId`.
- **Workflow path**: `workflows/<name>.ts` exports `run(...)`; each invocation gets a unique `ctx.id === runId`.

`GET /runs/:runId` and `flue logs` inspect **workflow runs only**.

Sources: [withastro-flue/AGENTS.md:7-20]()

### Persistence adapter bundle

Users configure durability by exporting a `PersistenceAdapter` from `db.ts`. At startup the framework calls `migrate()` (if present), then `connect()` to obtain three stores:

| Store | Responsibility |
|-------|----------------|
| `executionStore.sessions` | Agent session snapshots |
| `executionStore.submissions` | Durable submission lifecycle, turn journals, stream chunks, leases |
| `runStore` | Workflow run records (`active` / `completed` / `errored`) |
| `eventStreamStore` | Append-only durable event streams for agents and runs |

Adapters exist for SQLite, Postgres, MySQL, Redis, MongoDB, and Cloudflare Durable Object SQLite. Schema versioning is enforced at boot — an unknown or newer stored version fails loudly before any read/write.

Sources: [withastro-flue/packages/runtime/src/agent-execution-store.ts:336-397](), [withastro-flue/packages/runtime/src/sql-run-store.ts:1-8,121-143]()

### RunStore: workflow run lifecycle

`RunStore` persists one record per workflow run:

```typescript
// withastro-flue/packages/runtime/src/runtime/run-store.ts
export type RunStatus = 'active' | 'completed' | 'errored';

export interface RunStore {
  createRun(input: CreateRunInput): Promise<void>;   // idempotent, first-writer-wins
  endRun(input: EndRunInput): Promise<void>;
  getRun(runId: string): Promise<RunRecord | null>;
  lookupRun(runId: string): Promise<RunPointer | null>;
  listRuns(opts?: ListRunsOpts): Promise<ListRunsResponse>;
}
```

`createRun` uses `INSERT OR IGNORE` semantics so a replayed `runId` never resurrects a terminal record back to `active`. The SQL adapter stores rows in `flue_runs` with indexed listing by workflow name and status.

Sources: [withastro-flue/packages/runtime/src/runtime/run-store.ts:3-127](), [withastro-flue/packages/runtime/src/sql-run-store.ts:38-66,121-143]()

### Agent submission store and turn journal

Agent durability lives in `AgentSubmissionStore`, a backend-neutral contract covering submission admission, turn journals, stream chunk segments, attempt markers, and lease management.

**Submission states**: `queued` → `running` → `settled`

**Turn journal phases** (checkpoint progression inside one attempt):

```text
before_provider → provider_started → tool_request_recorded → committed
```

Each journal tracks `checkpointLeafId`, optional `toolRequest`, `streamKey`, and commit metadata (`committedLeafId`). Methods like `beginTurnJournal`, `updateTurnJournalPhase`, `commitTurnJournal`, and `replaceTurnJournalAttempt` implement compare-and-set semantics so concurrent coordinators cannot double-commit or steal ownership.

Default durability knobs: `DURABILITY_DEFAULT_MAX_ATTEMPTS = 10`, `DURABILITY_DEFAULT_TIMEOUT_MS = 3_600_000` (1 hour), `LEASE_DURATION_MS = 30_000` (30 seconds).

Sources: [withastro-flue/packages/runtime/src/agent-execution-store.ts:19-26,92-215,256-312]()

### Dispatch queue

`DispatchQueue.enqueue()` admits work durably through the coordinator:

```typescript
// withastro-flue/packages/runtime/src/node/agent-coordinator.ts (createNodeDispatchQueue)
async enqueue(input: DispatchInput): Promise<DispatchReceipt> {
  const admission = await coordinator.admitDispatch(input);
  // exact replay → original receipt; conflicting replay → throw
  // admission persisted in SQL; processing is async via claim loop
}
```

Admission is idempotent keyed by `dispatchId`. The coordinator runs a claim loop with lease heartbeats, reconciles interrupted submissions from a previous process on startup, and enforces **at most one runnable head per session** — later queued work in the same session waits until earlier submissions settle.

Sources: [withastro-flue/packages/runtime/src/runtime/dispatch-queue.ts:3-13](), [withastro-flue/packages/runtime/src/node/agent-coordinator.ts:56-88,369-398](), [withastro-flue/packages/runtime/src/agent-execution-store.ts:228-256]()

### Stream persistence policy

Flue selectively persists streamed events:

- **Buffered** (~3 s flush): `text_delta`, `thinking_start`, `thinking_delta`, `thinking_end` — avoids one storage write per chunk.
- **Excluded entirely**: `turn_request` — would grow storage quadratically and expose full prompts to every stream reader.

Interrupted-stream recovery reads throttled `StreamChunkWriter` segments; `message_end` carries the complete message for history replay.

Sources: [withastro-flue/packages/runtime/src/runtime/run-store.ts:129-165]()

### HTTP surface for workflow runs

The mounted `flue()` sub-app exposes:

| Route | Purpose |
|-------|---------|
| `POST /workflows/:name` | Start a workflow run; default `202` with `streamUrl` and `runId`; `?wait=result` for sync JSON |
| `GET/HEAD /runs/:runId` | Durable Streams protocol read (catch-up, long-poll, SSE) |
| `GET /runs/:runId?meta` | Plain JSON `RunRecord` (status, payload, result, timing) |
| `POST /agents/:name/:id` | Agent prompt admission (`202` + `streamUrl`); not a workflow run |

Run IDs are opaque `run_<ulid>` values. Clients observe live and historical run events at a stable URL without knowing which agent or instance owns the run.

Sources: [withastro-flue/packages/runtime/src/runtime/flue-app.ts:253-296](), [withastro-flue/packages/runtime/test/routing.test.ts:753-794]()

## Side-by-side lifecycle diagrams

### Eve turn lifecycle (workflow-owned)

```mermaid
stateDiagram-v2
    [*] --> TurnStarted: user message delivered
    TurnStarted --> StepRunning: turnWorkflow → turnStep
    StepRunning --> StepCompleted: model + tools finish
    StepCompleted --> StepRunning: more tool rounds
    StepCompleted --> Parked: HITL / OAuth / subagent wait
    Parked --> StepRunning: resumeHook(continuationToken)
    StepCompleted --> TurnCompleted: terminal reply
    TurnCompleted --> SessionWaiting: conversation mode
    TurnCompleted --> SessionCompleted: task mode done
    SessionWaiting --> TurnStarted: next message + token
    StepRunning --> StepFailed: unrecoverable error
    StepFailed --> SessionFailed
```

Sources: [vercel-eve/docs/concepts/sessions-runs-and-streaming.md:37-63](), [vercel-eve/packages/eve/src/execution/turn-workflow.ts:38-56]()

### Flue agent submission lifecycle (store-owned)

```mermaid
stateDiagram-v2
    [*] --> Queued: admitDispatch / admitDirect
    Queued --> Running: claimSubmission (lease acquired)
    Running --> JournalBeforeProvider: beginTurnJournal
    JournalBeforeProvider --> ProviderStarted: updateTurnJournalPhase
    ProviderStarted --> ToolRecorded: tool_request_recorded
    ToolRecorded --> Committed: commitTurnJournal
    Committed --> Settled: completeSubmission / failSubmission
    Running --> Queued: requeueSubmissionBeforeInputApplied
    Running --> Running: replaceTurnJournalAttempt (recovery)
    Settled --> [*]
```

Sources: [withastro-flue/packages/runtime/src/agent-execution-store.ts:30-31,94-98,168-215]()

## Portable ideas and integration pitfalls

**What transfers well**

- **Separate resume from inspect handles.** Eve's `continuationToken` vs. `sessionId` and Flue's `dispatchId` vs. `runId` both avoid overloading one identifier for wake-up and observation.
- **Explicit park semantics.** Both frameworks suspend compute during waits rather than polling; clients must understand the parked/waiting signal before sending follow-ups.
- **Idempotent admission.** Flue's `createRun` and `admitDispatch` first-writer-wins patterns mirror Eve's completed-step replay guarantees — safe retries require stable keys.
- **Checkpoint at model boundaries.** Eve steps and Flue turn journal phases both anchor durability where LLM/provider state transitions occur.

**Pitfalls when porting patterns**

| Pitfall | Eve behavior | Flue behavior |
|---------|-------------|---------------|
| Treating `runId` as session ID | Eve uses `sessionId` for both stream and session scope | `runId` is workflow-only; agent work uses instance + session + `dispatchId` |
| Burst message sends | Best-effort fold at workflow boundaries; no durable FIFO | Per-session submission queue with single runnable head |
| Assuming all stream events are persisted | Full NDJSON protocol events are durable in workflow history | `turn_request` excluded; deltas buffered 3 s |
| Custom storage | Not exposed — Workflow SDK owns persistence | `PersistenceAdapter` is the extension point |
| Mid-step side effects | Interrupted steps re-run | Recovery via `replaceTurnJournalAttempt` + attempt markers + lease reconciliation |

## Summary

Eve optimizes for **opinionated, zero-config durability**: every agent session is already a Workflow SDK graph with automatic step checkpoints, parked turns that release compute, and a simple client contract (`continuationToken` + `sessionId`). Authors write tools and state, not workflow code.

Flue optimizes for **composable, store-backed durability**: agent submissions flow through an admitted dispatch queue with turn journals, leases, and startup reconciliation, while workflow runs get first-class `runId` records and HTTP stream endpoints backed by interchangeable databases. The framework makes the split between long-lived agent sessions and finite workflow jobs explicit.

Choose Eve when you want the runtime to own checkpoint boundaries end-to-end. Choose Flue when you need to pick your database, inspect workflow runs over HTTP independently, and reason separately about dispatched agent work versus orchestrated workflow invocations.
