# Run the catalog enrichment agent > Execute table, doc, or context_overlay modes with Drive, local markdown, GitHub, feedback, glossary, and usage-signal inputs; refine output interactively before publishing. - Repository: GoogleCloudPlatform/knowledge-catalog - GitHub: https://github.com/GoogleCloudPlatform/knowledge-catalog - Human docs: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5 - Complete Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/llms-full.txt ## Source Files - `agents/enrichment/README.md` - `agents/enrichment/src/agent_runner.py` - `agents/enrichment/src/engine.py` - `agents/enrichment/src/modes/context_overlay_mode.py` - `agents/enrichment/src/tools/kcmd_tools.py` - `agents/enrichment/src/refine.py` - `agents/enrichment/src/tools/github_tools.py` --- --- title: Run the catalog enrichment agent description: Execute table, doc, or context_overlay modes with Drive, local markdown, GitHub, feedback, glossary, and usage-signal inputs; refine output interactively before publishing. --- The catalog enrichment agent generates **Metadata as Code** (mdcode) for Google Cloud Knowledge Catalog (Dataplex). It reads source material—Google Drive documents, local Markdown, BigQuery metadata, optional GitHub repositories, user-feedback proposals, and query-usage signals—and writes enriched YAML and Markdown artifacts under a local output directory. The agent talks to the catalog **only through `kcmd`** (read-only `init`, `pull`, and `reference`); you publish with `kcmd push`. Entry point: `agents/enrichment/src/agent_runner.py`. ## Choose a mode Three enrichment flows are available. Mode is selected with `--mode` or inferred when omitted (`--dataset` implies `table`; otherwise `doc`). `context_overlay` is never inferred—you must pass it explicitly. | Mode | Target | What it produces | |------|--------|------------------| | `table` | BigQuery dataset (`--dataset`) | Enriched overviews and `queries` aspects on live `@bigquery` table entries | | `doc` | Entry group (`--entry_group`) | Knowledge-base entries from crawled docs (map-reduce → enumerate → write) | | `context_overlay` | Dataset + entry group | New overlay entries per table in an editable group; 1P tables pulled read-only via `kcmd reference` | `kcmd init --bigquery-dataset` and `kcmd pull` scaffold the workspace. The agent routes Drive or local Markdown documents to each table, writes enriched `.overview.md` sidecars, and optionally emits `

.queries.md` from `INFORMATION_SCHEMA` query history plus SQL extracted from routed docs. With `--glossaries`, columns are mapped to Dataplex glossary terms and field-level `links.definition` are injected. Crawls Google Docs (and optional Drive folders or local Markdown directories), map-reduces them through a topic lens, enumerates canonical entries, and fans out per-entry overview writers. Requires `--entry_group` to already exist—the agent does not create entry groups. Like table mode for routing and writing, but 1P BigQuery entries are pulled read-only via `kcmd reference` as `

.ref.yaml` mirrors. One new generic overlay entry per table is created in your editable `--entry_group`. The `queries` aspect attaches to the overlay, not the live table. ## Prerequisites ```bash cd agents/mdcode npm install npm run build # -> agents/mdcode/dist/kcmd ``` The agent resolves `kcmd` automatically at `agents/mdcode/dist/kcmd` (override with `$KCMD_BIN`). Add `dist` to `PATH` only if you plan to run `kcmd push` yourself. ```bash python3 -m venv ~/.venv/kc-enrich source ~/.venv/kc-enrich/bin/activate pip install -r agents/enrichment/src/requirements.txt ``` `google-cloud-bigquery` powers usage signals; `mcp` is needed only for a local stdio GitHub MCP server (the default hosted remote works without it). ```bash gcloud auth application-default login \ --scopes='openid,https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/drive.readonly' ``` Vertex AI project, location, and model are supplied per run via flags—nothing is hardcoded. ## Required flags Every invocation requires these flags regardless of mode: Google Cloud project hosting the Vertex AI model. Vertex AI model id for reasoning-heavy steps (e.g. `gemini-2.5-pro`). High-volume structured steps use `KC_LIGHT_MODEL` when set, otherwise the main model. Local directory for the generated mdcode tree, `trajectory.json`, and `refine_session.json`. Vertex AI location (e.g. `us-central1`). Free-text use case that steers enrichment and doc-mode topic reduction. Mode-specific requirements: | Flag | `doc` | `table` | `context_overlay` | |------|:-----:|:-------:|:-----------------:| | `--dataset` | — | required | required | | `--entry_group` | required | — | required | | `--folders` | optional | optional | optional | | `--docs` | optional | — | optional | | `--tables` | — | — | optional | | `--include_usage` | — | optional (default `true`) | optional (default `true`) | | `--glossaries` | — | optional | — | | `--feedback_dir` / `--feedback_files` | optional | optional | optional | | `--repo` / `--repo_ref` / `--repo_subdir` | optional | optional | optional | | `--interactive` / `--refine_instruction` | optional | optional | optional | See [Enrichment agent flags reference](/enrichment-agent-flags) for the full flag matrix. ## Configure source inputs ### Google Drive and local Markdown `--folders` and `--docs` accept a comma-separated mixed list. Each entry is classified format-first: 1. `http://` / `https://` → Google Drive (Doc or folder URL) 2. Ends in `.md` / `.markdown` → local Markdown file 3. Path-shaped (`/abs`, `./rel`, `~/path`, or contains `/`) → local directory (read recursively) or file 4. Bare name that exists on disk → local 5. Otherwise → Google Drive ID In **doc mode**, a local `.md` in `--docs` is a depth-0 spine doc; a directory contributes depth-1 children. In **table** and **context_overlay** modes, local files join the relevance-router candidate pool alongside Drive documents. ### BigQuery usage signal For `table` and `context_overlay` modes, `--include_usage` (default `true`) fetches `INFORMATION_SCHEMA` query history and emits `

.queries.md` sidecars conforming to the Dataplex `queries` aspect. Days of query history to aggregate. `auto` tries `JOBS_BY_PROJECT` then falls back to `JOBS_BY_USER`; `project` requires project-wide access; `user` reads only the caller's queries. Replace user emails with stable SHA hashes in the usage signal. ### Glossary column linking (table mode only) Comma-separated Dataplex glossaries as `project.location.glossaryId`. Maps BigQuery columns to glossary terms and injects field-level `links.definition` into entry YAML. ### User-feedback proposals (all modes) Directory of feedback files (`.md`/`.json`) walked recursively. Each file holds JSON shaped `{"proposals": [...]}`. Explicit comma-separated feedback file paths; combinable with `--feedback_dir`. Feedback is the **highest-priority context source**—proposals override conflicting information from Drive docs, semantic search, or `INFORMATION_SCHEMA`-derived patterns. In table and overlay modes, proposals route per-table by `target_asset.name` FQN; `eval_candidate.golden_sql` from valid proposals becomes a `[Source: User Feedback]` entry in the `queries` aspect. ### GitHub source code (all modes) GitHub repo as `owner/name` or URL. A code-understanding agent explores the repo via the GitHub MCP server and distills code component cards. Branch, tag, or SHA (default: repo default branch). Path prefix to scope exploration (e.g. `src/server`). Path to `mcp.json` describing the GitHub MCP server. Falls back to `KC_ENRICH_MCP_CONFIG`, then the hosted remote server. Select server entry with `KC_ENRICH_GITHUB_MCP_SERVER` (default `github_remote`). Set a Personal Access Token before running: ```bash export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_... ``` ```json title="mcp.json — remote and local servers" { "mcpServers": { "github_remote": { "type": "http", "url": "https://api.githubcopilot.com/mcp/", "headers": {"Authorization": "Bearer ${GITHUB_PERSONAL_ACCESS_TOKEN}"} }, "github": { "type": "stdio", "command": "github-mcp-server", "args": ["stdio"], "env": {"GITHUB_PERSONAL_ACCESS_TOKEN": "${GITHUB_PERSONAL_ACCESS_TOKEN}"} } } } ``` In **doc mode**, distinct components surface as their own knowledge-base entries. In **table** and **context_overlay** modes, cards join the relevance router's candidate pool so code that reads or writes a table grounds that table's overview and queries aspect. ## Run the agent ```bash export PYTHONPATH=agents/enrichment/src ``` ```bash python3 agents/enrichment/src/agent_runner.py \ --mode=table \ --dataset=my-proj.analytics \ --folders=https://drive.google.com/drive/folders/,./local_md_corpus \ --topic="Customer 360 data" \ --project=my-gcp-project \ --location=us-central1 \ --model=gemini-2.5-pro \ --output_dir=/tmp/enrich_out ``` Create the entry group first: ```bash gcloud dataplex entry-groups create myEntryGroup \ --project=my-gcp-project --location=us-central1 ``` Then run: ```bash python3 agents/enrichment/src/agent_runner.py \ --mode=doc \ --docs="https://docs.google.com/document/d/,./notes/data_model.md" \ --folders= \ --topic="Order pipeline documentation" \ --entry_group=my-gcp-project.us-central1.myEntryGroup \ --project=my-gcp-project \ --model=gemini-2.5-pro \ --output_dir=/tmp/enrich_out ``` ```bash python3 agents/enrichment/src/agent_runner.py \ --mode=context_overlay \ --dataset=my-proj.analytics \ --entry_group=my-gcp-project.us-central1.overlayGroup \ --folders= \ --tables=orders,customers \ --topic="Enriched table context" \ --project=my-gcp-project \ --model=gemini-2.5-pro \ --output_dir=/tmp/enrich_out ``` ```bash find /tmp/enrich_out -type f ``` Expected artifacts: :::files /tmp/enrich_out/ ├── catalog.yaml # kcmd manifest (written by agent via kcmd init) ├── catalog/ # per-entry YAML + sidecar Markdown ├── trajectory.json # tool-call log of what the agent read and produced └── refine_session.json # saved session for refinement re-invocation ::: In **context_overlay** mode, each table directory also contains read-only mirrors: ``` catalog/bigquery/// ├── orders.ref.yaml # read-only 1P entry (kcmd reference) ├── orders.ref.overview.md # existing 1P overview, if any ├── orders.yaml # pushable overlay entry ├── orders.overview.md # enriched overview └── orders.queries.md # queries aspect sidecar ``` ## Refine output interactively After the initial run, refine without re-reading source docs or re-pulling the dataset. Each entry stores its grounding prompt in `refine_session.json`, so refinement reuses loaded context. ```bash python3 agents/enrichment/src/agent_runner.py \ --mode=table \ --dataset=my-proj.analytics \ --folders=./local_md_corpus \ --project=my-gcp-project \ --model=gemini-2.5-pro \ --output_dir=/tmp/enrich_out \ --interactive ``` At the `refine>` prompt you can rewrite overviews, add sections, re-enumerate entries, or ask questions. Commands: `:entries`, `:show `, `:quit`. No-op on a non-TTY. ```bash python3 agents/enrichment/src/agent_runner.py \ --refine_instruction="make the orders overview more concise" \ --output_dir=/tmp/enrich_out \ --project=my-gcp-project \ --model=gemini-2.5-pro ``` Requires a prior run's `refine_session.json`. Skips the enrichment pipeline entirely. Refinement operations: | Operation | Effect | |-----------|--------| | `rewrite` | Re-generate one or more entry overviews with a change | | `reenumerate` | Add, remove, split, merge, or recategorize entries (doc mode fully; table/overlay modes re-categorize only—entries are pinned to dataset tables) | | `answer` | Respond to a question about the output; no files change | | `noop` | Ask for clarification when the request is ambiguous | ## Publish enriched metadata The agent generates mdcode only. Push to Dataplex is your step: ```bash cd /tmp/enrich_out CLOUDSDK_CORE_PROJECT= CLOUDSDK_COMPUTE_REGION= kcmd push ``` See [Publish enriched metadata](/publish-enriched-metadata) for push options, entry-link reconciliation, and reference-layer constraints. ## Evaluate before publishing Score a run with the golden-free evaluator (no reference answers required): ```bash cd agents/enrichment pip install -r eval/requirements.txt python -m eval --output-dir /tmp/enrich_out ``` Writes `eval_report.md` next to `trajectory.json`. See [Evaluate enrichment output](/evaluate-enrichment-output). ## Troubleshooting | Symptom | Likely cause | What to check | |---------|--------------|---------------| | `kcmd not found` | Binary not built | `cd agents/mdcode && npm run build` or set `$KCMD_BIN` | | `--entry_group is required` | Missing flag in doc/overlay mode | Pass `project.location.entryGroupId`; create the group with `gcloud dataplex entry-groups create` first | | No reference tables pulled | Dataset or permissions | Verify `--dataset` and read access to `@bigquery` entries | | GitHub code context empty | MCP auth or scope | Confirm `GITHUB_PERSONAL_ACCESS_TOKEN`; check `[Code]` log lines for tool-call counts | | `queries` push 403 | Missing permission | Caller needs `dataplex.entryGroups.useQueriesAspect`; overview still publishes | | Refinement skipped | Non-interactive shell | Use `--refine_instruction` for webapp-style single-turn refine | More signals in [Troubleshooting](/troubleshooting). ## Related pages Prerequisites, Python and Node.js setup, and credential configuration. How agents read metadata, ground on external sources, and hand off to kcmd push. Complete `agent_runner.py` flag reference by mode. Push mdcode workspaces and reconcile entry links. Golden-free and golden-based scoring of enrichment runs. Initialize kcmd workspaces and pull catalog snapshots.