# Enrichment workflows

> How enrichment agents read source metadata, ground on external docs or code, emit OKF bundles or mdcode artifacts, and hand off to kcmd push for catalog publication.

- Repository: GoogleCloudPlatform/knowledge-catalog
- GitHub: https://github.com/GoogleCloudPlatform/knowledge-catalog
- Human docs: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5
- Complete Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/llms-full.txt

## Source Files

- `okf/src/enrichment_agent/runner.py`
- `agents/enrichment/src/agent_runner.py`
- `agents/enrichment/src/modes/table_mode.py`
- `agents/enrichment/src/modes/doc_mode.py`
- `toolbox/enrichment/README.md`
- `samples/enrichment/README.md`

---

---
title: "Enrichment workflows"
description: "How enrichment agents read source metadata, ground on external docs or code, emit OKF bundles or mdcode artifacts, and hand off to kcmd push for catalog publication."
---

Knowledge Catalog ships four enrichment surfaces that share a read → ground → emit → publish pattern: the OKF `enrichment-agent` CLI writes vendor-neutral OKF bundles from BigQuery (plus an optional web pass); `agents/enrichment/src/agent_runner.py` writes mdcode workspaces through read-only `kcmd init` / `pull` / `reference`; `toolbox/enrichment` exposes a TypeScript `kcagent enrich` with pluggable MCP tools and skills; and `samples/enrichment` demonstrates download → enrich → publish against catalog APIs. None of the agents call `kcmd push` — publication is always your step after reviewing local output.

## Enrichment surfaces

| Surface | Entry point | Source metadata | Output artifact | Publication |
| --- | --- | --- | --- | --- |
| OKF enrichment agent | `enrichment-agent enrich` | BigQuery API via `BigQuerySource` | OKF bundle directory (`--out`) | Exchange or import; not wired to `kcmd push` |
| Catalog enrichment agent | `agent_runner.py` | `kcmd init` + `pull` or `reference` | mdcode workspace (`catalog.yaml` + `catalog/`) | `kcmd push` |
| Toolbox agent | `kcagent enrich` | `kcmd pull` snapshot in `--catalog-path` | Updated mdcode in workspace | `kcmd push` |
| Python sample | `python3 -m enrichment.enrich` | Downloaded snapshot (`enrichment.download`) | Updated metadata directory | `python3 -m enrichment.publish` or `kcmd push` |

<Note>
Model and cloud configuration are BYOC/BYOK: pass `--project`, `--location`, and `--model` to the catalog agent; OKF uses `--model` with ADC billing project. No provider is hardcoded beyond what you configure at runtime.
</Note>

## Shared workflow pattern

Every enrichment path follows the same lifecycle: discover concepts or entries from a source, attach external grounding, generate enriched prose or structured aspects, optionally refine, then publish.

```mermaid
flowchart TB
  subgraph sources["Source metadata"]
    BQ["BigQuery API / INFORMATION_SCHEMA"]
    KC["kcmd pull / reference"]
  end

  subgraph grounding["External grounding"]
    Drive["Google Drive / local .md"]
    Web["Web crawl seeds"]
    GH["GitHub MCP repo exploration"]
    FB["User-feedback proposals"]
    Usage["BQ query-history signal"]
  end

  subgraph agents["Enrichment agents"]
    OKF["okf/enrichment_agent"]
    CAT["agents/enrichment agent_runner"]
    TB["toolbox/kcagent"]
  end

  subgraph artifacts["Local artifacts"]
    OKFB["OKF bundle"]
    MDC["mdcode workspace"]
  end

  subgraph publish["Publication (user step)"]
    PUSH["kcmd push"]
    API["catalog API publish"]
  end

  BQ --> OKF
  KC --> CAT
  KC --> TB
  Drive --> CAT
  Drive --> TB
  Web --> OKF
  GH --> CAT
  FB --> CAT
  Usage --> CAT
  OKF --> OKFB
  CAT --> MDC
  TB --> MDC
  MDC --> PUSH
  MDC --> API
```

## OKF bundle workflow

The OKF enrichment agent (`okf/src/enrichment_agent/`) implements a two-pass pipeline controlled by `EnrichmentRunner`:

1. **BQ pass** — `BigQuerySource.list_concepts()` enumerates dataset and table concepts (wildcard shard families collapse to one concept per prefix). For each concept, `build_bq_agent` runs an ADK agent with tools to `read_concept`, `sample_rows`, and `write_concept_doc`.
2. **Web pass** (optional) — When `--web-seed` or `--web-seed-file` is set, `build_web_agent` crawls outward from seeds with hard limits (`--web-max-pages`, `--web-max-depth`, host/path constraints). Fetched pages enrich existing concepts or land in `references/<slug>`.
3. **Index regeneration** — `regenerate_indexes()` rebuilds progressive-disclosure `index.md` files across the bundle.

<Steps>
<Step title="Run BQ-then-web enrichment">

```bash
enrichment-agent enrich \
  --source bq \
  --dataset my-project.my_dataset \
  --out ./bundle \
  --web-seed https://cloud.google.com/bigquery/docs \
  --model gemini-2.5-pro
```

Use `--concept tables/events_` to scope a single concept. Pass `--no-web` to skip the web pass.

</Step>
<Step title="Inspect the bundle">

Each concept becomes a markdown file with YAML frontmatter (`type`, `title`, `description`, `timestamp`, optional `resource`). Run `enrichment-agent visualize --bundle ./bundle` to emit `viz.html`.

</Step>
</Steps>

OKF output is designed for version control, agent context loading, and cross-system exchange — not direct Dataplex push. See [Open Knowledge Format](/open-knowledge-format) for bundle semantics.

## mdcode catalog workflow

`agent_runner.py` dispatches three modes. Mode is inferred when `--mode` is empty: `--dataset` implies `table`, otherwise `doc`. `context_overlay` must be set explicitly.

### Table mode

Table mode discovers BigQuery tables exclusively through kcmd:

1. `kcmd init --bigquery-dataset <project>.<dataset>` + manifest declaring schema, overview, and queries aspects.
2. `kcmd pull` writes `catalog/<project>.<dataset>/<table>.yaml` with live schema.
3. Grounding docs are fetched from `--folders` / `--docs` (Drive or local markdown), summarized, and relevance-routed per table (threshold 0.5).
4. Optional `INFORMATION_SCHEMA` usage signal, doc-extracted SQL, and user-feedback `golden_sql` merge into `<table>.queries.md`.
5. Per-table `<table>.overview.md` sidecars are written; pulled `.dataplex-types.global.overview.md` duplicates are removed to prevent silent overwrite on push.

<ParamField body="--dataset" type="string" required>
Fully qualified `project.dataset`.
</ParamField>

<ParamField body="--folders" type="list">
Drive folder URLs/IDs and/or local markdown directories for grounding.
</ParamField>

<ParamField body="--glossaries" type="list">
Dataplex glossaries as `project.location.glossaryId`. Enables column→term linking via `LinkingAgent`; `kcmd push` reconciles `entryLinks.definition`.
</ParamField>

<ParamField body="--include_usage" type="bool" default="true">
Fetch BQ query-history patterns into the `queries` aspect. Requires `dataplex.entryGroups.useQueriesAspect` permission on push.
</ParamField>

### Doc mode

Doc mode builds a knowledge-base entry group:

1. `kcmd init --entry-group <project>.<location>.<entryGroupId>` + pull existing KB entries as seed inputs.
2. Recursive depth-weighted crawl of `--docs` (depth 0 spine) and `--folders` (depth 1 children), max depth 2.
3. Map-reduce summarization: per-doc neutral cards (cache-aware via `KC_ENRICH_CACHE_MODE=summary`), then topic-shaped batch reduction.
4. `EnumerationAgent` produces categories and entries; each entry gets deterministic YAML (`dataplex-types.global.generic` aspect) plus `<id>.overview.md` under `catalog/<category>/`.

Pre-existing KC overviews are preserved as writer grounding — the agent extends rather than drops published content unless contradicted.

### Context overlay mode

Context overlay mirrors table mode but separates ownership:

- 1P BigQuery entries arrive read-only via `kcmd reference` as `<table>.ref.yaml` + `<table>.ref.overview.md`.
- A new overlay entry per table is created in `--entry-group` as `<table>.yaml` + `<table>.overview.md`.
- Only overlay pairs are pushable; `.ref.*` mirrors stay read-only.

Use this when you need richer descriptions without modifying live `@bigquery` entries.

<RequestExample>

```bash
export PYTHONPATH=agents/enrichment/src

python3 agents/enrichment/src/agent_runner.py \
  --mode=table \
  --dataset=my-project.my_dataset \
  --folders=https://drive.google.com/drive/folders/ABC123 \
  --topic="E-commerce analytics" \
  --project=my-gcp-project \
  --location=global \
  --model=gemini-2.5-pro \
  --output_dir=/tmp/enrich_out
```

</RequestExample>

## Context sources

All catalog-agent modes accept overlapping grounding inputs. Priority matters when sources conflict.

| Source | Flags | Modes | Behavior |
| --- | --- | --- | --- |
| Google Drive / local markdown | `--docs`, `--folders` | table, doc, context_overlay | Doc mode crawls recursively; table mode relevance-routes per table |
| GitHub repository | `--repo`, `--repo_ref`, `--repo_subdir`, `--mcp_config` | all | GitHub MCP explores code; doc mode seeds KB entries; table modes join router pool |
| User feedback | `--feedback_dir`, `--feedback_files` | all | `{proposals: [...]}` JSON; **highest priority**, overrides Drive and usage signals |
| BQ usage history | `--include_usage`, `--usage_window_days`, `--usage_scope` | table, context_overlay | `INFORMATION_SCHEMA.JOBS_BY_*` patterns into `queries` aspect |
| Dataplex glossaries | `--glossaries` | table | Reference pull + column linking into entry YAML |
| Web URLs | `--web-seed`, `--web-seed-file` | OKF only | Bounded crawl with host/path guards |

<Warning>
User-feedback proposals with `golden_sql` emit into the `queries` aspect with `source: USER` and take precedence in sidecar ordering. Feedback in doc mode is prepended globally to every entry writer prompt because proposals target table/column FQNs, not KB entry IDs.
</Warning>

## Output artifacts

### OKF bundle layout

:::files
bundle/
├── index.md
├── datasets/
│   └── my_dataset.md
├── tables/
│   └── events_.md
└── references/
    └── some-external-doc.md
:::

Concept documents carry required frontmatter keys and markdown body sections (schema, sample rows, citations). `write_concept_doc` enforces completeness and merges with existing on-disk content via `read_existing_doc`.

### mdcode workspace layout

Table mode writes under the bq-dataset scope:

:::files
output_dir/
├── catalog.yaml
└── catalog/
    └── my-project.my_dataset/
        ├── orders.yaml
        ├── orders.overview.md
        └── orders.queries.md
:::

Doc mode nests by enumeration category:

:::files
output_dir/
├── catalog.yaml
└── catalog/
    └── customer-360/
        ├── orders-entry.yaml
        └── orders-entry.overview.md
:::

Context overlay adds `.ref.*` mirrors alongside overlay pairs. Trajectory files (`trajectory.json`) and `refine_session.json` support evaluation and interactive refinement.

## Refinement before publication

After the initial run, refine without re-reading sources:

<ParamField body="--interactive" type="bool">
Stay in a `refine>` REPL reusing loaded `EnrichmentSession` context.
</ParamField>

<ParamField body="--refine_instruction" type="string">
Apply one refinement turn from saved `refine_session.json`, then exit. Used by webapp persist+re-invoke flows.
</ParamField>

Refinement operations are `rewrite` (regenerate selected overviews) and `answer` (Q&A without file changes). Table-mode re-enumeration recategorizes only — entries are pinned 1:1 to dataset tables. Doc-mode re-enumeration can add, remove, or move entries.

## Publication handoff

Enrichment agents stop at local artifacts. Publishing is explicit:

<Steps>
<Step title="Review local output">

```bash
cd /tmp/enrich_out
kcmd status          # see pending aspect changes
git diff catalog/    # or diff against a prior pull
```

</Step>
<Step title="Push to Knowledge Catalog">

```bash
kcmd push                    # upload publishing.aspects from catalog.yaml
kcmd push --dry-run          # validate without writing
kcmd push --validate-only    # schema check only
```

Set `CLOUDSDK_CORE_PROJECT` and authenticate via `gcloud auth application-default login`. The `publishing` section in `catalog.yaml` controls which aspects and entry links reconcile — reference layers (`.ref.*` in overlay mode) are never pushed.

</Step>
<Step title="Verify in catalog">

Confirm overview and queries aspects appear on target entries. If `queries` push fails with 403, check `dataplex.entryGroups.useQueriesAspect` permission. If overview push reports success but content is unchanged, verify no duplicate `.dataplex-types.global.overview.md` sidecar overwrote your `.overview.md` file.

</Step>
</Steps>

The Python sample (`samples/enrichment`) follows the same handoff with `python3 -m enrichment.publish --dir <output>` as an alternative to `kcmd push` for demonstration datasets.

<Tip>
Run `python -m eval --output-dir <path>` against generated mdcode to score structural validity, hallucination risk, and cross-run consistency before pushing to production entry groups.
</Tip>

## Toolbox customizable agent

`toolbox/enrichment` packages `kcagent`, a TypeScript agent that enriches an existing kcmd workspace with custom MCP tools and skills:

```bash
kcmd init --bigquery-dataset <project>.<dataset>
kcmd pull
kcagent enrich --catalog-path . --tools-path tools --prompt-path prompt.md
kcmd push
```

Configure `tools/mcp.json` (for example `md-fileset` for local markdown corpora) and `tools/skills/*/SKILL.md` to describe tool usage. This path suits organizations that need custom source connectors without modifying the Python catalog agent.

## Mode selection

| Goal | Recommended path |
| --- | --- |
| Portable, git-friendly knowledge exchange | OKF `enrichment-agent enrich` |
| Enrich live BigQuery table overviews in-place | Catalog agent `table` mode + `kcmd push` |
| Build a knowledge base from Google Docs | Catalog agent `doc` mode + `kcmd push` |
| Richer docs without touching `@bigquery` entries | Catalog agent `context_overlay` mode |
| Custom MCP tools and prompt-driven enrichment | `toolbox/kcagent enrich` |
| Learn the API publish flow | `samples/enrichment` download → enrich → publish |

## Related pages

<CardGroup>
<Card title="Produce OKF bundles" href="/produce-okf-bundles">
Run the OKF enrichment agent with BigQuery sources, web crawl seeds, and concept scoping.
</Card>
<Card title="Run catalog enrichment agent" href="/run-catalog-enrichment-agent">
Execute table, doc, or context_overlay modes with Drive, GitHub, feedback, and glossary inputs.
</Card>
<Card title="Publish enriched metadata" href="/publish-enriched-metadata">
Push mdcode workspaces with kcmd and reconcile entry links without modifying reference layers.
</Card>
<Card title="Sync catalog metadata" href="/sync-catalog-metadata">
Initialize kcmd workspaces, pull snapshots, and understand the mdcode layout agents write into.
</Card>
<Card title="Evaluate enrichment output" href="/evaluate-enrichment-output">
Score runs with dynamic golden-free metrics or golden-based evaluation before publication.
</Card>
<Card title="Toolbox enrichment demo" href="/toolbox-enrichment-demo">
End-to-end TypeScript demo with kcmd, kcagent, and md-fileset MCP tools.
</Card>
</CardGroup>
