# Run the catalog enrichment agent

> Execute table, doc, or context_overlay modes with Drive, local markdown, GitHub, feedback, glossary, and usage-signal inputs; refine output interactively before publishing.

- Repository: GoogleCloudPlatform/knowledge-catalog
- GitHub: https://github.com/GoogleCloudPlatform/knowledge-catalog
- Human docs: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5
- Complete Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/llms-full.txt

## Source Files

- `agents/enrichment/README.md`
- `agents/enrichment/src/agent_runner.py`
- `agents/enrichment/src/engine.py`
- `agents/enrichment/src/modes/context_overlay_mode.py`
- `agents/enrichment/src/tools/kcmd_tools.py`
- `agents/enrichment/src/refine.py`
- `agents/enrichment/src/tools/github_tools.py`

---

---
title: Run the catalog enrichment agent
description: Execute table, doc, or context_overlay modes with Drive, local markdown, GitHub, feedback, glossary, and usage-signal inputs; refine output interactively before publishing.
---

The catalog enrichment agent generates **Metadata as Code** (mdcode) for Google Cloud Knowledge Catalog (Dataplex). It reads source material—Google Drive documents, local Markdown, BigQuery metadata, optional GitHub repositories, user-feedback proposals, and query-usage signals—and writes enriched YAML and Markdown artifacts under a local output directory. The agent talks to the catalog **only through `kcmd`** (read-only `init`, `pull`, and `reference`); you publish with `kcmd push`.

Entry point: `agents/enrichment/src/agent_runner.py`.

## Choose a mode

Three enrichment flows are available. Mode is selected with `--mode` or inferred when omitted (`--dataset` implies `table`; otherwise `doc`). `context_overlay` is never inferred—you must pass it explicitly.

| Mode | Target | What it produces |
|------|--------|------------------|
| `table` | BigQuery dataset (`--dataset`) | Enriched overviews and `queries` aspects on live `@bigquery` table entries |
| `doc` | Entry group (`--entry_group`) | Knowledge-base entries from crawled docs (map-reduce → enumerate → write) |
| `context_overlay` | Dataset + entry group | New overlay entries per table in an editable group; 1P tables pulled read-only via `kcmd reference` |

<AccordionGroup>
<Accordion title="Table mode — enrich live BigQuery entries">

`kcmd init --bigquery-dataset` and `kcmd pull` scaffold the workspace. The agent routes Drive or local Markdown documents to each table, writes enriched `<table>.overview.md` sidecars, and optionally emits `<table>.queries.md` from `INFORMATION_SCHEMA` query history plus SQL extracted from routed docs. With `--glossaries`, columns are mapped to Dataplex glossary terms and field-level `links.definition` are injected.

</Accordion>
<Accordion title="Doc mode — build a knowledge base from documents">

Crawls Google Docs (and optional Drive folders or local Markdown directories), map-reduces them through a topic lens, enumerates canonical entries, and fans out per-entry overview writers. Requires `--entry_group` to already exist—the agent does not create entry groups.

</Accordion>
<Accordion title="Context-overlay mode — enrich without touching live tables">

Like table mode for routing and writing, but 1P BigQuery entries are pulled read-only via `kcmd reference` as `<table>.ref.yaml` mirrors. One new generic overlay entry per table is created in your editable `--entry_group`. The `queries` aspect attaches to the overlay, not the live table.

</Accordion>
</AccordionGroup>

## Prerequisites

<Steps>
<Step title="Build kcmd">

```bash
cd agents/mdcode
npm install
npm run build   # -> agents/mdcode/dist/kcmd
```

The agent resolves `kcmd` automatically at `agents/mdcode/dist/kcmd` (override with `$KCMD_BIN`). Add `dist` to `PATH` only if you plan to run `kcmd push` yourself.

</Step>
<Step title="Install Python dependencies">

```bash
python3 -m venv ~/.venv/kc-enrich
source ~/.venv/kc-enrich/bin/activate
pip install -r agents/enrichment/src/requirements.txt
```

`google-cloud-bigquery` powers usage signals; `mcp` is needed only for a local stdio GitHub MCP server (the default hosted remote works without it).

</Step>
<Step title="Authenticate">

```bash
gcloud auth application-default login \
  --scopes='openid,https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/drive.readonly'
```

Vertex AI project, location, and model are supplied per run via flags—nothing is hardcoded.

</Step>
</Steps>

## Required flags

Every invocation requires these flags regardless of mode:

<ParamField body="--project" type="string" required>
Google Cloud project hosting the Vertex AI model.
</ParamField>

<ParamField body="--model" type="string" required>
Vertex AI model id for reasoning-heavy steps (e.g. `gemini-2.5-pro`). High-volume structured steps use `KC_LIGHT_MODEL` when set, otherwise the main model.
</ParamField>

<ParamField body="--output_dir" type="string" required>
Local directory for the generated mdcode tree, `trajectory.json`, and `refine_session.json`.
</ParamField>

<ParamField body="--location" type="string" default="global">
Vertex AI location (e.g. `us-central1`).
</ParamField>

<ParamField body="--topic" type="string" default="Metadata enrichment">
Free-text use case that steers enrichment and doc-mode topic reduction.
</ParamField>

Mode-specific requirements:

| Flag | `doc` | `table` | `context_overlay` |
|------|:-----:|:-------:|:-----------------:|
| `--dataset` | — | required | required |
| `--entry_group` | required | — | required |
| `--folders` | optional | optional | optional |
| `--docs` | optional | — | optional |
| `--tables` | — | — | optional |
| `--include_usage` | — | optional (default `true`) | optional (default `true`) |
| `--glossaries` | — | optional | — |
| `--feedback_dir` / `--feedback_files` | optional | optional | optional |
| `--repo` / `--repo_ref` / `--repo_subdir` | optional | optional | optional |
| `--interactive` / `--refine_instruction` | optional | optional | optional |

See [Enrichment agent flags reference](/enrichment-agent-flags) for the full flag matrix.

## Configure source inputs

### Google Drive and local Markdown

`--folders` and `--docs` accept a comma-separated mixed list. Each entry is classified format-first:

1. `http://` / `https://` → Google Drive (Doc or folder URL)
2. Ends in `.md` / `.markdown` → local Markdown file
3. Path-shaped (`/abs`, `./rel`, `~/path`, or contains `/`) → local directory (read recursively) or file
4. Bare name that exists on disk → local
5. Otherwise → Google Drive ID

In **doc mode**, a local `.md` in `--docs` is a depth-0 spine doc; a directory contributes depth-1 children. In **table** and **context_overlay** modes, local files join the relevance-router candidate pool alongside Drive documents.

### BigQuery usage signal

For `table` and `context_overlay` modes, `--include_usage` (default `true`) fetches `INFORMATION_SCHEMA` query history and emits `<table>.queries.md` sidecars conforming to the Dataplex `queries` aspect.

<ParamField body="--usage_window_days" type="integer" default="30">
Days of query history to aggregate.
</ParamField>

<ParamField body="--usage_scope" type="enum" default="auto">
`auto` tries `JOBS_BY_PROJECT` then falls back to `JOBS_BY_USER`; `project` requires project-wide access; `user` reads only the caller's queries.
</ParamField>

<ParamField body="--anonymize_users" type="boolean" default="false">
Replace user emails with stable SHA hashes in the usage signal.
</ParamField>

### Glossary column linking (table mode only)

<ParamField body="--glossaries" type="string">
Comma-separated Dataplex glossaries as `project.location.glossaryId`. Maps BigQuery columns to glossary terms and injects field-level `links.definition` into entry YAML.
</ParamField>

### User-feedback proposals (all modes)

<ParamField body="--feedback_dir" type="string">
Directory of feedback files (`.md`/`.json`) walked recursively. Each file holds JSON shaped `{"proposals": [...]}`.
</ParamField>

<ParamField body="--feedback_files" type="string">
Explicit comma-separated feedback file paths; combinable with `--feedback_dir`.
</ParamField>

Feedback is the **highest-priority context source**—proposals override conflicting information from Drive docs, semantic search, or `INFORMATION_SCHEMA`-derived patterns. In table and overlay modes, proposals route per-table by `target_asset.name` FQN; `eval_candidate.golden_sql` from valid proposals becomes a `[Source: User Feedback]` entry in the `queries` aspect.

### GitHub source code (all modes)

<ParamField body="--repo" type="string">
GitHub repo as `owner/name` or URL. A code-understanding agent explores the repo via the GitHub MCP server and distills code component cards.
</ParamField>

<ParamField body="--repo_ref" type="string">
Branch, tag, or SHA (default: repo default branch).
</ParamField>

<ParamField body="--repo_subdir" type="string">
Path prefix to scope exploration (e.g. `src/server`).
</ParamField>

<ParamField body="--mcp_config" type="string">
Path to `mcp.json` describing the GitHub MCP server. Falls back to `KC_ENRICH_MCP_CONFIG`, then the hosted remote server. Select server entry with `KC_ENRICH_GITHUB_MCP_SERVER` (default `github_remote`).
</ParamField>

Set a Personal Access Token before running:

```bash
export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_...
```

<CodeGroup>

```json title="mcp.json — remote and local servers"
{
  "mcpServers": {
    "github_remote": {
      "type": "http",
      "url": "https://api.githubcopilot.com/mcp/",
      "headers": {"Authorization": "Bearer ${GITHUB_PERSONAL_ACCESS_TOKEN}"}
    },
    "github": {
      "type": "stdio",
      "command": "github-mcp-server",
      "args": ["stdio"],
      "env": {"GITHUB_PERSONAL_ACCESS_TOKEN": "${GITHUB_PERSONAL_ACCESS_TOKEN}"}
    }
  }
}
```

</CodeGroup>

In **doc mode**, distinct components surface as their own knowledge-base entries. In **table** and **context_overlay** modes, cards join the relevance router's candidate pool so code that reads or writes a table grounds that table's overview and queries aspect.

## Run the agent

<Steps>
<Step title="Set PYTHONPATH">

```bash
export PYTHONPATH=agents/enrichment/src
```

</Step>
<Step title="Run a mode">

<Tabs>
<Tab title="Table">

```bash
python3 agents/enrichment/src/agent_runner.py \
  --mode=table \
  --dataset=my-proj.analytics \
  --folders=https://drive.google.com/drive/folders/<id>,./local_md_corpus \
  --topic="Customer 360 data" \
  --project=my-gcp-project \
  --location=us-central1 \
  --model=gemini-2.5-pro \
  --output_dir=/tmp/enrich_out
```

</Tab>
<Tab title="Doc">

Create the entry group first:

```bash
gcloud dataplex entry-groups create myEntryGroup \
  --project=my-gcp-project --location=us-central1
```

Then run:

```bash
python3 agents/enrichment/src/agent_runner.py \
  --mode=doc \
  --docs="https://docs.google.com/document/d/<id>,./notes/data_model.md" \
  --folders=<drive_folder_id_or_url> \
  --topic="Order pipeline documentation" \
  --entry_group=my-gcp-project.us-central1.myEntryGroup \
  --project=my-gcp-project \
  --model=gemini-2.5-pro \
  --output_dir=/tmp/enrich_out
```

</Tab>
<Tab title="Context overlay">

```bash
python3 agents/enrichment/src/agent_runner.py \
  --mode=context_overlay \
  --dataset=my-proj.analytics \
  --entry_group=my-gcp-project.us-central1.overlayGroup \
  --folders=<drive_folder_id_or_url> \
  --tables=orders,customers \
  --topic="Enriched table context" \
  --project=my-gcp-project \
  --model=gemini-2.5-pro \
  --output_dir=/tmp/enrich_out
```

</Tab>
</Tabs>

</Step>
<Step title="Verify output">

```bash
find /tmp/enrich_out -type f
```

Expected artifacts:

:::files
/tmp/enrich_out/
├── catalog.yaml          # kcmd manifest (written by agent via kcmd init)
├── catalog/              # per-entry YAML + sidecar Markdown
├── trajectory.json       # tool-call log of what the agent read and produced
└── refine_session.json   # saved session for refinement re-invocation
:::

In **context_overlay** mode, each table directory also contains read-only mirrors:

```
catalog/bigquery/<project>/<dataset>/
├── orders.ref.yaml           # read-only 1P entry (kcmd reference)
├── orders.ref.overview.md    # existing 1P overview, if any
├── orders.yaml               # pushable overlay entry
├── orders.overview.md        # enriched overview
└── orders.queries.md         # queries aspect sidecar
```

</Step>
</Steps>

## Refine output interactively

After the initial run, refine without re-reading source docs or re-pulling the dataset. Each entry stores its grounding prompt in `refine_session.json`, so refinement reuses loaded context.

<Tabs>
<Tab title="Interactive REPL">

```bash
python3 agents/enrichment/src/agent_runner.py \
  --mode=table \
  --dataset=my-proj.analytics \
  --folders=./local_md_corpus \
  --project=my-gcp-project \
  --model=gemini-2.5-pro \
  --output_dir=/tmp/enrich_out \
  --interactive
```

At the `refine>` prompt you can rewrite overviews, add sections, re-enumerate entries, or ask questions. Commands: `:entries`, `:show <id>`, `:quit`. No-op on a non-TTY.

</Tab>
<Tab title="Single refinement turn">

```bash
python3 agents/enrichment/src/agent_runner.py \
  --refine_instruction="make the orders overview more concise" \
  --output_dir=/tmp/enrich_out \
  --project=my-gcp-project \
  --model=gemini-2.5-pro
```

Requires a prior run's `refine_session.json`. Skips the enrichment pipeline entirely.

</Tab>
</Tabs>

Refinement operations:

| Operation | Effect |
|-----------|--------|
| `rewrite` | Re-generate one or more entry overviews with a change |
| `reenumerate` | Add, remove, split, merge, or recategorize entries (doc mode fully; table/overlay modes re-categorize only—entries are pinned to dataset tables) |
| `answer` | Respond to a question about the output; no files change |
| `noop` | Ask for clarification when the request is ambiguous |

## Publish enriched metadata

The agent generates mdcode only. Push to Dataplex is your step:

```bash
cd /tmp/enrich_out
CLOUDSDK_CORE_PROJECT=<project> CLOUDSDK_COMPUTE_REGION=<region> kcmd push
```

See [Publish enriched metadata](/publish-enriched-metadata) for push options, entry-link reconciliation, and reference-layer constraints.

## Evaluate before publishing

Score a run with the golden-free evaluator (no reference answers required):

```bash
cd agents/enrichment
pip install -r eval/requirements.txt
python -m eval --output-dir /tmp/enrich_out
```

Writes `eval_report.md` next to `trajectory.json`. See [Evaluate enrichment output](/evaluate-enrichment-output).

## Troubleshooting

| Symptom | Likely cause | What to check |
|---------|--------------|---------------|
| `kcmd not found` | Binary not built | `cd agents/mdcode && npm run build` or set `$KCMD_BIN` |
| `--entry_group is required` | Missing flag in doc/overlay mode | Pass `project.location.entryGroupId`; create the group with `gcloud dataplex entry-groups create` first |
| No reference tables pulled | Dataset or permissions | Verify `--dataset` and read access to `@bigquery` entries |
| GitHub code context empty | MCP auth or scope | Confirm `GITHUB_PERSONAL_ACCESS_TOKEN`; check `[Code]` log lines for tool-call counts |
| `queries` push 403 | Missing permission | Caller needs `dataplex.entryGroups.useQueriesAspect`; overview still publishes |
| Refinement skipped | Non-interactive shell | Use `--refine_instruction` for webapp-style single-turn refine |

More signals in [Troubleshooting](/troubleshooting).

## Related pages

<CardGroup cols={2}>
<Card title="Installation" href="/installation">
Prerequisites, Python and Node.js setup, and credential configuration.
</Card>
<Card title="Enrichment workflows" href="/enrichment-workflows">
How agents read metadata, ground on external sources, and hand off to kcmd push.
</Card>
<Card title="Enrichment agent flags" href="/enrichment-agent-flags">
Complete `agent_runner.py` flag reference by mode.
</Card>
<Card title="Publish enriched metadata" href="/publish-enriched-metadata">
Push mdcode workspaces and reconcile entry links.
</Card>
<Card title="Evaluate enrichment output" href="/evaluate-enrichment-output">
Golden-free and golden-based scoring of enrichment runs.
</Card>
<Card title="Sync catalog metadata" href="/sync-catalog-metadata">
Initialize kcmd workspaces and pull catalog snapshots.
</Card>
</CardGroup>
