Agent-readable docs

Knowledge Catalog Documentation

Reference for Knowledge Catalog context engineering: Open Knowledge Format bundles, Metadata as Code (kcmd) sync, enrichment agents, discovery samples, and evaluation workflows for data stewards and agent builders.

Pages

  1. OverviewKnowledge Catalog tooling surface: OKF bundles, kcmd Metadata as Code workspaces, enrichment and discovery agents, and the shortest paths to produce and publish metadata context.
  2. InstallationPrerequisites, Python and Node.js setup, package installs, and credential configuration for BigQuery, Vertex AI or Gemini, and gcloud Application Default Credentials.
  3. QuickstartFirst successful runs: initialize a kcmd workspace and pull metadata, produce an OKF bundle from BigQuery, or run the catalog enrichment agent and inspect output.
  4. Open Knowledge FormatOKF v0.1 bundle structure, concept documents, frontmatter fields, index.md progressive disclosure, and cross-link semantics for vendor-neutral knowledge exchange.
  5. Metadata as Codekcmd workspace model: catalog.yaml manifest, YAML and Markdown layouts, pull/push sync, reference layers, entry links, and glossary scope for Knowledge Catalog metadata.
  6. Enrichment workflowsHow enrichment agents read source metadata, ground on external docs or code, emit OKF bundles or mdcode artifacts, and hand off to kcmd push for catalog publication.
  7. Sync catalog metadataInitialize a kcmd workspace for BigQuery, knowledge base, entry group, BigLake, or glossary scope; pull snapshots; check status; and push local edits back to Knowledge Catalog.
  8. Produce OKF bundlesRun the OKF enrichment agent against a BigQuery source with optional web crawl seeds, concept scoping, and two-pass BQ-then-web enrichment into a versionable bundle directory.
  9. Run the catalog enrichment agentExecute table, doc, or context_overlay modes with Drive, local markdown, GitHub, feedback, glossary, and usage-signal inputs; refine output interactively before publishing.
  10. Publish enriched metadataPush mdcode workspaces with kcmd, publish sample enrichment output via catalog APIs, and reconcile entry links and aspects without modifying read-only reference layers.
  11. Visualize OKF bundlesGenerate self-contained viz.html graph viewers from OKF bundles with force-directed layouts, concept detail panels, backlinks, and in-browser markdown rendering.
  12. Run the discovery agentDeploy the Knowledge Catalog discovery agent with ADK: required GCP APIs and IAM roles, environment variables, and root-agent or AgentTool integration patterns.

Complete Markdown

# Knowledge Catalog Documentation

> Reference for Knowledge Catalog context engineering: Open Knowledge Format bundles, Metadata as Code (kcmd) sync, enrichment agents, discovery samples, and evaluation workflows for data stewards and agent builders.

## Context Links

- [Agent index](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/llms.txt)
- [Human interactive docs](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5)
- [GitHub repository](https://github.com/GoogleCloudPlatform/knowledge-catalog)

## Repository Metadata

- Repository: GoogleCloudPlatform/knowledge-catalog
- Branch: main
- Generated: 2026-06-15T02:58:52.928Z
- Updated: 2026-06-15T03:01:11.890Z
- Runtime: Grok CLI
- Format: Documentation
- Pages: 22

## Page Index

- 01. [Overview](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/01-overview.md) - Knowledge Catalog tooling surface: OKF bundles, kcmd Metadata as Code workspaces, enrichment and discovery agents, and the shortest paths to produce and publish metadata context.
- 02. [Installation](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/02-installation.md) - Prerequisites, Python and Node.js setup, package installs, and credential configuration for BigQuery, Vertex AI or Gemini, and gcloud Application Default Credentials.
- 03. [Quickstart](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/03-quickstart.md) - First successful runs: initialize a kcmd workspace and pull metadata, produce an OKF bundle from BigQuery, or run the catalog enrichment agent and inspect output.
- 04. [Open Knowledge Format](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/04-open-knowledge-format.md) - OKF v0.1 bundle structure, concept documents, frontmatter fields, index.md progressive disclosure, and cross-link semantics for vendor-neutral knowledge exchange.
- 05. [Metadata as Code](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/05-metadata-as-code.md) - kcmd workspace model: catalog.yaml manifest, YAML and Markdown layouts, pull/push sync, reference layers, entry links, and glossary scope for Knowledge Catalog metadata.
- 06. [Enrichment workflows](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/06-enrichment-workflows.md) - How enrichment agents read source metadata, ground on external docs or code, emit OKF bundles or mdcode artifacts, and hand off to kcmd push for catalog publication.
- 07. [Sync catalog metadata](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/07-sync-catalog-metadata.md) - Initialize a kcmd workspace for BigQuery, knowledge base, entry group, BigLake, or glossary scope; pull snapshots; check status; and push local edits back to Knowledge Catalog.
- 08. [Produce OKF bundles](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/08-produce-okf-bundles.md) - Run the OKF enrichment agent against a BigQuery source with optional web crawl seeds, concept scoping, and two-pass BQ-then-web enrichment into a versionable bundle directory.
- 09. [Run the catalog enrichment agent](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/09-run-the-catalog-enrichment-agent.md) - Execute table, doc, or context_overlay modes with Drive, local markdown, GitHub, feedback, glossary, and usage-signal inputs; refine output interactively before publishing.
- 10. [Publish enriched metadata](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/10-publish-enriched-metadata.md) - Push mdcode workspaces with kcmd, publish sample enrichment output via catalog APIs, and reconcile entry links and aspects without modifying read-only reference layers.
- 11. [Visualize OKF bundles](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/11-visualize-okf-bundles.md) - Generate self-contained viz.html graph viewers from OKF bundles with force-directed layouts, concept detail panels, backlinks, and in-browser markdown rendering.
- 12. [Run the discovery agent](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/12-run-the-discovery-agent.md) - Deploy the Knowledge Catalog discovery agent with ADK: required GCP APIs and IAM roles, environment variables, and root-agent or AgentTool integration patterns.
- 13. [Evaluate enrichment output](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/13-evaluate-enrichment-output.md) - Score enrichment runs with dynamic golden-free metrics or golden-based eval: structural validity, hallucination checks, fact recall, consistency across runs, and report artifacts.
- 14. [kcmd CLI reference](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/14-kcmd-cli-reference.md) - kcmd commands, init flags per source type, pull and push options including dry-run, force, validate-only, reference pull, and authentication via gcloud ADC.
- 15. [catalog.yaml manifest reference](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/15-catalog.yaml-manifest-reference.md) - scope, snapshot, publishing, reference, aliases, entry and aspect types, entryLinks reconciliation rules, and layout selection for YAML versus Markdown knowledge-base mode.
- 16. [kcmd MCP server reference](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/16-kcmd-mcp-server-reference.md) - MCP server startup, workspace path binding, and agent tools for pull, push, list-entries, lookup-entry, and modify-entry in agentic metadata workflows.
- 17. [OKF enrichment-agent CLI reference](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/17-okf-enrichment-agent-cli-reference.md) - enrichment-agent enrich and visualize subcommands, BigQuery source flags, web crawl constraints, concept scoping, model selection, and environment variables.
- 18. [Enrichment agent flags reference](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/18-enrichment-agent-flags-reference.md) - agent_runner.py flags by mode: table, doc, context_overlay; source inputs, usage signal, glossaries, feedback, GitHub MCP, refinement, and required Vertex project and model values.
- 19. [OKF bundle recipes](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/19-okf-bundle-recipes.md) - Copy-paste enrichment recipes for GA4 merchandise store, Stack Overflow, and Bitcoin public datasets with seed files, exact commands, and expected bundle outputs.
- 20. [Toolbox enrichment demo](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/20-toolbox-enrichment-demo.md) - End-to-end TypeScript demo: kcmd init and pull, kcagent enrich with md-fileset MCP tools, fileset skills, prompt configuration, and BigQuery demo dataset setup.
- 21. [Troubleshooting](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/21-troubleshooting.md) - Common auth, billing, push conflict, web crawl cap, glossary provisioning, and model credential failures with verification signals from tests and README constraints.
- 22. [Contributing](https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/22-contributing.md) - CLA requirements, fork-and-PR workflow, style expectations, and test commands for Python pytest and TypeScript npm run test across package directories.

## Source File Index

- `agents/enrichment/eval/__main__.py`
- `agents/enrichment/eval/dynamic_eval.py`
- `agents/enrichment/eval/golden_eval.py`
- `agents/enrichment/eval/goldens/README.md`
- `agents/enrichment/eval/goldens/TEMPLATE.json`
- `agents/enrichment/eval/metrics.py`
- `agents/enrichment/README.md`
- `agents/enrichment/src/agent_runner.py`
- `agents/enrichment/src/engine.py`
- `agents/enrichment/src/linking.py`
- `agents/enrichment/src/modes/context_overlay_mode.py`
- `agents/enrichment/src/modes/doc_mode.py`
- `agents/enrichment/src/modes/table_mode.py`
- `agents/enrichment/src/refine.py`
- `agents/enrichment/src/requirements.txt`
- `agents/enrichment/src/tools/bq_usage_tools.py`
- `agents/enrichment/src/tools/feedback_tools.py`
- `agents/enrichment/src/tools/github_tools.py`
- `agents/enrichment/src/tools/kcmd_tools.py`
- `agents/mdcode/docs/concept.md`
- `agents/mdcode/docs/design.md`
- `agents/mdcode/package.json`
- `agents/mdcode/README.md`
- `agents/mdcode/src/libts/gcp/context.ts`
- `agents/mdcode/src/libts/layout.ts`
- `agents/mdcode/src/libts/manifest.ts`
- `agents/mdcode/src/libts/snapshot.ts`
- `agents/mdcode/src/libts/sync.ts`
- `agents/mdcode/src/tool/commands.ts`
- `agents/mdcode/src/tool/main.ts`
- `agents/mdcode/src/tool/mcp.ts`
- `CODE_OF_CONDUCT.md`
- `CONTRIBUTING.md`
- `LICENSE.md`
- `okf/bundles/crypto_bitcoin/index.md`
- `okf/bundles/stackoverflow/datasets/stackoverflow.md`
- `okf/bundles/stackoverflow/index.md`
- `okf/pyproject.toml`
- `okf/README.md`
- `okf/samples/crypto_bitcoin/README.md`
- `okf/samples/ga4_merch_store/README.md`
- `okf/samples/ga4_merch_store/seeds.txt`
- `okf/samples/stackoverflow/README.md`
- `okf/samples/stackoverflow/seeds.txt`
- `okf/SPEC.md`
- `okf/src/enrichment_agent/agent.py`
- `okf/src/enrichment_agent/bundle/document.py`
- `okf/src/enrichment_agent/bundle/index.py`
- `okf/src/enrichment_agent/bundle/paths.py`
- `okf/src/enrichment_agent/cli.py`
- `okf/src/enrichment_agent/prompts/enrichment_instruction.md`
- `okf/src/enrichment_agent/prompts/web_ingestion_instruction.md`
- `okf/src/enrichment_agent/runner.py`
- `okf/src/enrichment_agent/sources/bigquery.py`
- `okf/src/enrichment_agent/tools/web_tools.py`
- `okf/src/enrichment_agent/viewer/generator.py`
- `okf/src/enrichment_agent/viewer/static/viz.css`
- `okf/src/enrichment_agent/viewer/static/viz.js`
- `okf/src/enrichment_agent/viewer/templates/viz.html`
- `okf/src/enrichment_agent/web/fetcher.py`
- `okf/tests/test_bigquery_source.py`
- `okf/tests/test_web_fetcher.py`
- `README.md`
- `samples/discovery/agent.py`
- `samples/discovery/README.md`
- `samples/discovery/requirements.txt`
- `samples/discovery/SKILL.md`
- `samples/discovery/tools.py`
- `samples/discovery/utils.py`
- `samples/enrichment/README.md`
- `samples/enrichment/src/enrichment/enrich.py`
- `samples/enrichment/src/enrichment/publish.py`
- `samples/enrichment/src/env.sh`
- `samples/enrichment/src/tools/fileskb/README.md`
- `samples/README.md`
- `toolbox/enrichment/package.json`
- `toolbox/enrichment/README.md`
- `toolbox/enrichment/src/agent/enrich/agent.ts`
- `toolbox/enrichment/src/agent/enrich/command.ts`
- `toolbox/enrichment/src/tools/md/server.ts`
- `toolbox/mdcode/docs/concept.md`
- `toolbox/mdcode/docs/spec.md`
- `toolbox/mdcode/README.md`
- `toolbox/mdcode/src/tool/commands.ts`
- `toolbox/mdcode/src/tool/main.ts`
- `toolbox/mdcode/src/tool/mcp.ts`
- `toolbox/README.md`

---

## 01. Overview

> Knowledge Catalog tooling surface: OKF bundles, kcmd Metadata as Code workspaces, enrichment and discovery agents, and the shortest paths to produce and publish metadata context.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/01-overview.md
- Generated: 2026-06-15T02:52:38.507Z

### Source Files

- `README.md`
- `toolbox/README.md`
- `okf/README.md`
- `agents/enrichment/README.md`
- `samples/README.md`
- `agents/mdcode/README.md`

---
title: "Overview"
description: "Knowledge Catalog tooling surface: OKF bundles, kcmd Metadata as Code workspaces, enrichment and discovery agents, and the shortest paths to produce and publish metadata context."
---

GoogleCloudPlatform/knowledge-catalog ships four cooperating surfaces for Knowledge Catalog (Dataplex): `kcmd` Metadata as Code workspaces under `toolbox/mdcode` and `agents/mdcode`, catalog enrichment agents that emit mdcode or OKF bundles, an OKF specification and reference producer under `okf/`, and ADK-based discovery and enrichment samples under `samples/`. All publish paths converge on version-controlled artifacts—YAML/Markdown mdcode trees or plain-markdown OKF bundles—that agents and humans can read without proprietary SDKs.

## Repository layout

| Path | Role | Primary CLIs / entrypoints |
|------|------|---------------------------|
| `toolbox/mdcode/` | Metadata as Code CLI, library, MCP server | `kcmd init`, `pull`, `push`, `reference`, `mcp` |
| `toolbox/enrichment/` | TypeScript enrichment harness | `kcagent enrich` (depends on `kcmd`) |
| `agents/mdcode/` | Same `kcmd` package, vendored for Python agents | `kcmd` (built to `agents/mdcode/dist/kcmd`) |
| `agents/enrichment/` | Python catalog enrichment agent + eval | `agent_runner.py`, `python -m eval` |
| `okf/` | Open Knowledge Format v0.1 + reference producer | `python -m enrichment_agent enrich`, `visualize` |
| `samples/discovery/` | Knowledge Catalog search agent (ADK) | `adk run …` |
| `samples/enrichment/` | Python enrichment sample + publish helpers | `enrichment/enrich.py` |

<Note>
`toolbox/` and `agents/` mirror the same `kcmd` and enrichment capabilities in TypeScript and Python respectively. The Python enrichment agent shells out to `agents/mdcode/dist/kcmd` and never calls the Dataplex API directly.
</Note>

## Metadata formats

Two artifact families cover the same problem—making catalog knowledge agent-ready—with different sync models.

| Format | Location | Structure | Sync target | Best for |
|--------|----------|-----------|-------------|----------|
| **Metadata as Code (mdcode)** | `catalog.yaml` + `catalog/` tree | YAML entries, Markdown sidecars (`.overview.md`), read-only `.ref.yaml` reference layers | Knowledge Catalog via `kcmd pull` / `push` | Live catalog publication, glossary links, entry groups |
| **Open Knowledge Format (OKF)** | `okf/SPEC.md` | Directory of `.md` files with YAML frontmatter, optional `index.md` | Files only—tarball, git, static host; no catalog API required | Vendor-neutral exchange, offline bundles, progressive disclosure |

OKF is intentionally provider-neutral: any producer (human, agent, export pipeline) can write bundles; any consumer (LLM, static viewer, search index) can read them. The `okf/` enrichment agent is one reference producer; `enrichment_agent visualize` emits a self-contained `viz.html` graph viewer.

mdcode is the catalog-native path: `catalog.yaml` declares `scope`, `snapshot`, `publishing`, and optional `reference` blocks that control which entries, aspects, and `entryLinks` are pulled, grounded, and pushed.

## Architecture

```mermaid
flowchart TB
  subgraph sources [Source metadata]
    BQ[BigQuery datasets]
    Drive[Google Drive / local .md]
    GH[GitHub repos via MCP]
    Web[Web crawl seeds]
  end

  subgraph produce [Enrichment producers]
    PyAgent[agents/enrichment/agent_runner.py]
    TSAgent[toolbox/enrichment/kcagent]
    OKFAgent[okf/enrichment_agent]
  end

  subgraph artifacts [Version-controlled artifacts]
    Mdcode[catalog.yaml + catalog/ mdcode tree]
    OKFBundle[OKF bundle directory]
  end

  subgraph publish [Catalog publication]
    Kcmd[kcmd pull / push / reference]
    Viz[okf visualize → viz.html]
  end

  subgraph consume [Consumption]
    Catalog[Knowledge Catalog service]
    Agents[Downstream AI agents]
    MCP[kcmd MCP server]
  end

  BQ --> PyAgent
  Drive --> PyAgent
  GH --> PyAgent
  BQ --> OKFAgent
  Web --> OKFAgent
  BQ --> TSAgent

  PyAgent --> Kcmd
  TSAgent --> Kcmd
  Kcmd --> Mdcode
  OKFAgent --> OKFBundle

  Mdcode --> Kcmd
  Kcmd -->|push| Catalog
  OKFBundle --> Viz
  OKFBundle --> Agents
  Mdcode --> MCP
  MCP --> Agents
  Catalog --> Agents
```

## kcmd: Metadata as Code workspace

`kcmd` manages a local workspace bound to a `catalog.yaml` manifest. Initialization selects exactly one primary source type; BigQuery mode accepts multiple datasets via repeated `--bigquery-dataset`.

<ParamField body="--bigquery-dataset" type="string">
BigQuery scope as `project.dataset`. YAML layout for tables, views, schemas.
</ParamField>

<ParamField body="--kb" type="string">
Knowledge base entry group as `project.location.entryGroupId`. Markdown layout for wiki-style pages.
</ParamField>

<ParamField body="--entry-group" type="string">
Custom entry group scope. YAML layout for user-defined entries.
</ParamField>

<ParamField body="--glossary" type="string">
Business glossary scope as `project.location.glossary-id` (comma-separated or location mode). `push` updates existing glossary metadata but does not create glossary tree nodes.
</ParamField>

Core sync commands:

| Command | Effect |
|---------|--------|
| `kcmd init` | Scaffold `catalog.yaml` for the chosen scope |
| `kcmd pull` | Download editable metadata into `catalog/` |
| `kcmd reference` | Pull read-only `.ref.yaml` grounding layers (never pushed) |
| `kcmd push` | Upload local edits; reconcile `entryLinks` when configured |
| `kcmd mcp` | Expose `list-entries`, `lookup-entry`, `modify-entry` over MCP |

Authentication uses `gcloud` Application Default Credentials for both CLI and MCP.

## Enrichment agents

Three enrichment paths read source metadata, ground on external docs or code, and emit artifacts ready for inspection or publication.

### Python catalog enrichment agent (`agents/enrichment/`)

`agent_runner.py` dispatches three modes. The agent talks to the catalog only through read-only `kcmd init` / `pull` / `reference`; you run `kcmd push` to publish.

| Mode | Required flags | Output |
|------|----------------|--------|
| `table` | `--dataset`, `--project`, `--model`, `--output_dir` | Enriched table overviews on live `@bigquery` entries; optional `queries` aspect and glossary column links |
| `doc` | `--entry_group`, `--project`, `--model`, `--output_dir` | Knowledge-base mdcode snapshot from Drive or local Markdown |
| `context_overlay` | `--dataset`, `--entry_group`, `--project`, `--model`, `--output_dir` | New overlay entries in an owned entry group; live BigQuery entries stay read-only |

Shared optional inputs across modes: `--folders` (Drive or local `.md`), `--feedback_dir` / `--feedback_files` (highest-priority proposals), `--repo` (GitHub MCP code context), `--include_usage` (BigQuery `INFORMATION_SCHEMA` query history), `--interactive` / `--refine_instruction` (post-run refinement).

### TypeScript enrichment harness (`toolbox/enrichment/`)

`kcagent enrich` runs against an initialized mdcode workspace with MCP tools and a prompt file:

```bash
kcmd init --bigquery-dataset <projectId>.<datasetId>
kcmd pull
kcagent enrich --catalog-path . --tools-path tools --prompt-path prompt.md
```

The toolbox demo wires an `md-fileset` MCP server and fileset skills for organizational Markdown grounding.

### OKF enrichment agent (`okf/`)

`python -m enrichment_agent enrich` runs a two-pass pipeline: a BigQuery pass writes one OKF concept per advertised concept, then an optional web pass crawls `--web-seed` / `--web-seed-file` URLs with `--web-max-pages` and `--web-allowed-host` caps. Use `--no-web` for BQ-only runs. Sample recipes live under `okf/samples/` with checked-in bundles under `okf/bundles/`.

## Discovery agent

`samples/discovery/` implements an ADK agent that calls the Knowledge Catalog Search API. It decomposes natural-language questions, issues multiple search queries, and reranks results. Run as a root agent (`discovery_agent` → `root_agent` in `agent.py`) or embed as an `AgentTool` in a parent agent. Required APIs: `dataplex.googleapis.com`, `aiplatform.googleapis.com`, `serviceusage.googleapis.com`.

## Shortest paths to metadata context

<Steps>
<Step title="Sync existing catalog metadata">
Initialize a workspace, pull a snapshot, and inspect entries locally.

```bash
cd my-workspace
kcmd init --bigquery-dataset my-project.my-dataset
kcmd pull
```

Verify: `catalog/` contains YAML entries and `catalog.yaml` lists `scope` and `snapshot` types.
</Step>

<Step title="Produce an OKF bundle from BigQuery">
Generate a portable, git-friendly knowledge bundle without catalog push.

```bash
cd okf
python3 -m venv .venv && .venv/bin/pip install -e .[dev]
.venv/bin/python -m enrichment_agent enrich \
  --source bq \
  --dataset my-project.my-dataset \
  --web-seed-file seeds.txt \
  --out ./bundles/my-bundle
```

Verify: bundle directory contains concept `.md` files with frontmatter; optional `viz.html` via `enrichment_agent visualize --bundle ./bundles/my-bundle`.
</Step>

<Step title="Enrich catalog metadata with the Python agent">
Generate mdcode from BigQuery plus Drive or local Markdown, then publish.

```bash
export PYTHONPATH=agents/enrichment/src
python3 agents/enrichment/src/agent_runner.py \
  --mode=table \
  --dataset=my-project.my-dataset \
  --folders=./local_md_corpus \
  --project=my-project \
  --model=gemini-2.5-pro \
  --output_dir=/tmp/enrich_out

cd /tmp/enrich_out && kcmd push
```

Verify: `find /tmp/enrich_out -type f` shows `catalog/`, `trajectory.json`, and enriched overview sidecars.
</Step>

<Step title="Discover assets with the search agent">
Deploy the discovery agent for semantic catalog search.

```bash
cd samples/discovery
pip install -r requirements.txt
export GOOGLE_CLOUD_PROJECT=my-project
export GOOGLE_GENAI_USE_VERTEXAI=True
adk run samples/discovery
```

Verify: agent responds to natural-language data-asset queries using `knowledge_catalog_search`.
</Step>
</Steps>

## Evaluation and quality gates

Before publishing mdcode, score a run with the golden-free evaluator:

```bash
cd agents/enrichment
python -m eval --output-dir /tmp/enrich_out
```

Deterministic metrics (`structural_validity`, `perf`) always run. Judge-based metrics (`hallucination_free`, `redundancy_index`, `disambiguation_efficacy`, `absence_of_contradictions`) activate when Vertex AI ADC is configured. Optional `--golden` files add concept recall, fact recall, and consistency metrics across repeated runs.

## Prerequisites at a glance

| Requirement | Used by |
|-------------|---------|
| Node.js + npm | Build `kcmd` and `kcagent` (`npm run build` in package dirs) |
| Python 3.11+ (3.13 for OKF) | `agents/enrichment`, `okf/`, `samples/` |
| `gcloud` ADC | `kcmd`, BigQuery, Drive read scopes for enrichment |
| Vertex AI or Gemini credentials | Enrichment and discovery agents (`--project`, `--model`, or `GEMINI_API_KEY`) |
| GitHub PAT (optional) | `--repo` code context via GitHub MCP |

<Warning>
Public BigQuery datasets are readable but query bytes bill against your project. Web enrichment enforces `--web-max-pages` and domain filters inside the fetch tool.
</Warning>

## Choosing a path

| Goal | Start here |
|------|------------|
| Bi-directional sync with Knowledge Catalog | `kcmd init` → `pull` → edit → `push` |
| Portable, vendor-neutral knowledge for agents | OKF `enrich` → git or static host |
| Rich table docs grounded on Drive, usage SQL, glossaries | `agent_runner.py --mode=table` |
| Wiki/knowledge base from docs | `agent_runner.py --mode=doc` |
| Enriched context without touching live BQ entries | `agent_runner.py --mode=context_overlay` |
| Natural-language asset discovery | `samples/discovery` ADK agent |
| Agent-driven metadata edits in CI | `kcmd mcp` with `modify-entry` |

<Info>
Model backends in this repository default to Vertex AI / Gemini, but both OKF bundles and mdcode trees are consumable by any agent that reads files—no provider lock-in at the artifact layer.
</Info>

## Next

<CardGroup>
<Card title="Installation" href="/installation">
Prerequisites, Python and Node.js setup, package installs, and credential configuration for BigQuery, Vertex AI or Gemini, and gcloud ADC.
</Card>
<Card title="Quickstart" href="/quickstart">
First successful runs: initialize a kcmd workspace and pull metadata, produce an OKF bundle from BigQuery, or run the catalog enrichment agent and inspect output.
</Card>
<Card title="Open Knowledge Format" href="/open-knowledge-format">
OKF v0.1 bundle structure, concept documents, frontmatter fields, and cross-link semantics.
</Card>
<Card title="Metadata as Code" href="/metadata-as-code">
kcmd workspace model: catalog.yaml manifest, pull/push sync, reference layers, and glossary scope.
</Card>
<Card title="Enrichment workflows" href="/enrichment-workflows">
How enrichment agents read source metadata, ground on external docs or code, and hand off to kcmd push.
</Card>
</CardGroup>

---

## 02. Installation

> Prerequisites, Python and Node.js setup, package installs, and credential configuration for BigQuery, Vertex AI or Gemini, and gcloud Application Default Credentials.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/02-installation.md
- Generated: 2026-06-15T02:53:00.431Z

### Source Files

- `okf/pyproject.toml`
- `okf/README.md`
- `agents/mdcode/package.json`
- `toolbox/enrichment/package.json`
- `agents/enrichment/src/requirements.txt`
- `samples/discovery/requirements.txt`
- `samples/enrichment/src/env.sh`

---
title: "Installation"
description: "Prerequisites, Python and Node.js setup, package installs, and credential configuration for BigQuery, Vertex AI or Gemini, and gcloud Application Default Credentials."
---

Knowledge Catalog tooling in this repository ships as independent install surfaces—`kcmd` (TypeScript/Bun binary), Python enrichment agents, and sample workflows—each with its own package manifest but a shared dependency on `gcloud` Application Default Credentials (ADC) for GCP APIs and a separate model credential path for LLM-backed agents.

## Install surfaces

| Surface | Path | Runtime | Primary binary / entrypoint |
| --- | --- | --- | --- |
| Metadata as Code (`kcmd`) | `agents/mdcode/` (mirror: `toolbox/mdcode/`) | Node.js, npm, Bun | `dist/kcmd` |
| Catalog enrichment agent | `agents/enrichment/` | Python 3.11+ | `agents/enrichment/src/agent_runner.py` |
| OKF enrichment agent | `okf/` | Python ≥ 3.11 | `enrichment-agent` / `python -m enrichment_agent` |
| Toolbox enrichment harness | `toolbox/enrichment/` | Node.js, npm, Bun | `dist/kcagent`, `dist/md-fileset` |
| Discovery agent sample | `samples/discovery/` | Python 3.11+ | ADK CLI against `agent.py` |
| Enrichment sample | `samples/enrichment/` | Python 3.11+ | `python3 -m enrichment.*` |

<Note>
Install only the surfaces you plan to run. The catalog enrichment agent shells out to a built `kcmd` binary; the toolbox `kcagent` package depends on the sibling `toolbox/mdcode` build.
</Note>

## Prerequisites

| Requirement | Used by | Verification |
| --- | --- | --- |
| **gcloud CLI** | `kcmd`, BigQuery, Knowledge Catalog APIs, ADC | `gcloud --version` |
| **Python 3.11+** | OKF agent (`requires-python = ">=3.11"`), catalog enrichment agent, samples | `python3 --version` |
| **Node.js + npm** (recent LTS) | `kcmd`, `kcagent` builds | `node --version`, `npm --version` |
| **Bun** (via `npm install`) | Compiles standalone `kcmd` and `kcagent` binaries | Installed as a devDependency; invoked by `npm run build` |

For discovery-agent deployment you also need a GCP project with Knowledge Catalog (`dataplex.googleapis.com`), Vertex AI (`aiplatform.googleapis.com`), and Service Usage (`serviceusage.googleapis.com`) APIs enabled, plus IAM roles that grant `dataplex.projects.search`, `aiplatform.endpoints.predict`, and `serviceusage.services.use`.

## Install `kcmd`

`kcmd` is the Metadata as Code CLI and MCP server. Build it from `agents/mdcode` (or the equivalent `toolbox/mdcode` tree).

<Steps>
<Step title="Install Node dependencies">

```bash
cd agents/mdcode
npm install
```

</Step>

<Step title="Build the standalone binary">

```bash
npm run build
```

Produces `agents/mdcode/dist/kcmd`. The build compiles TypeScript (`build:libts`) then uses Bun to emit a single executable (`build:tool`).

</Step>

<Step title="Optional: add kcmd to PATH">

```bash
echo "export PATH=\"$(pwd)/dist:\$PATH\"" >> ~/.zshrc
source ~/.zshrc
which kcmd
```

The catalog enrichment agent resolves `agents/mdcode/dist/kcmd` automatically; PATH is only required when you invoke `kcmd` yourself (for example `kcmd push`).

</Step>
</Steps>

<Tabs>
<Tab title="agents/mdcode">

Canonical path referenced by `agents/enrichment/src/tools/kcmd_tools.py`.

```bash
cd agents/mdcode && npm install && npm run build
```

</Tab>

<Tab title="toolbox/mdcode">

Used by the toolbox enrichment demo and `kcagent` package (`kcmd: file:../mdcode`).

```bash
cd toolbox/mdcode && npm install && npm run build
```

</Tab>
</Tabs>

## Install Python agents

### Catalog enrichment agent

<Steps>
<Step title="Build kcmd first">

Follow the `kcmd` steps above. The agent never calls the Dataplex API directly—it shells out to `kcmd init`, `kcmd pull`, and `kcmd reference`.

</Step>

<Step title="Create a virtual environment and install dependencies">

```bash
python3 -m venv ~/.venv/kc-enrich
source ~/.venv/kc-enrich/bin/activate
pip install -r agents/enrichment/src/requirements.txt
```

Core packages: `google-adk`, `google-genai`, `google-api-python-client`, `google-auth`, `google-cloud-bigquery`, `pyyaml`, `requests`, `absl-py`. Install `mcp` only when `--repo` uses a local stdio GitHub MCP server; the default hosted remote works without it.

</Step>

<Step title="Set PYTHONPATH for direct invocation">

```bash
export PYTHONPATH=agents/enrichment/src
```

</Step>
</Steps>

### OKF enrichment agent

From the `okf/` directory:

```bash
python3.13 -m venv .venv
.venv/bin/pip install --index-url https://pypi.org/simple/ -e .[dev]
```

The package (`enrichment-agent`) requires Python ≥ 3.11 and installs `google-adk`, `google-cloud-bigquery`, `pyyaml`, `pydantic`, and `markdownify`. The `enrichment-agent` console script maps to `enrichment_agent.cli:main`.

Run tests after install:

```bash
.venv/bin/pytest
```

### Toolbox `kcagent`

```bash
cd toolbox/enrichment
npm install
npm run build
```

Produces `dist/kcagent` and `dist/md-fileset`. Requires a built `toolbox/mdcode/dist/kcmd` on the relative path `../../mdcode/dist/kcmd`.

### Samples

<Tabs>
<Tab title="Discovery sample">

```bash
python3 -m venv /tmp/kcsearch
source /tmp/kcsearch/bin/activate
cd samples/discovery
pip3 install -r requirements.txt
```

Dependencies: `google-adk`, `google-cloud-dataplex`, `google-api-core`.

</Tab>

<Tab title="Enrichment sample">

```bash
git clone https://github.com/GoogleCloudPlatform/knowledge-catalog.git
cd samples/enrichment/src
source env.sh --install
```

`env.sh --install` creates `.venv` and runs `pip install -r requirements.txt`. It exports `GOOGLE_GENAI_USE_VERTEXAI=True` and reads the active gcloud project into `KC_ENRICH_SAMPLE_PROJECT`.

</Tab>
</Tabs>

### Evaluation tooling (optional)

To score enrichment output with `python -m eval --run`, install both requirement files:

```bash
pip install -r agents/enrichment/src/requirements.txt \
            -r agents/enrichment/eval/requirements.txt
```

## Credential configuration

```mermaid
flowchart LR
  subgraph gcloud["gcloud CLI"]
    ADC["ADC token"]
    Proj["config project"]
    Region["compute/region"]
  end
  subgraph consumers["Consumers"]
    kcmd["kcmd / MCP"]
    BQ["BigQuery client"]
    Catalog["Knowledge Catalog HTTP"]
  end
  subgraph llm["LLM backends"]
    Vertex["Vertex AI"]
    Studio["Gemini API key"]
  end
  ADC --> kcmd
  ADC --> BQ
  ADC --> Catalog
  Proj --> kcmd
  Region --> kcmd
  Vertex --> OKF["OKF agent"]
  Vertex --> CatalogAgent["Catalog enrichment agent"]
  Studio --> OKF
```

### Application Default Credentials

`kcmd` obtains tokens by shelling out to gcloud:

```bash
gcloud auth application-default login
gcloud config set project <project-id>
gcloud config set compute/region <region>
```

`ApiContext.default()` reads the active project (`gcloud config get-value project`), compute region (`gcloud config get-value compute/region`), and ADC access token (`gcloud auth application-default print-access-token`). All three must be non-empty or `kcmd` fails fast. Tokens refresh automatically on HTTP 401 via `gcloud auth application-default print-access-token`.

<Warning>
`kcmd` requires a configured compute region, not just a project. Set `gcloud config set compute/region` before running `kcmd pull` or `kcmd push`.
</Warning>

For catalog enrichment with Google Drive sources, request Drive read scope at login:

```bash
gcloud auth application-default login \
  --scopes='openid,https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/drive.readonly'
```

The enrichment sample also sets a quota project:

```bash
gcloud auth application-default set-quota-project $CLOUD_PROJECT
```

### BigQuery

BigQuery clients use ADC. Public datasets (for example `bigquery-public-data.*`) are readable, but query bytes bill against the caller's project:

```bash
gcloud auth application-default login
gcloud config set project <your-billing-project>
```

The OKF agent accepts an optional `--billing-project` flag; when omitted, the BigQuery client uses the ADC default project.

### Vertex AI and Gemini

Model credentials are separate from catalog ADC. Choose one backend:

<Tabs>
<Tab title="Vertex AI (catalog enrichment agent)">

The catalog enrichment agent always sets Vertex mode from CLI flags—no manual env export is required at runtime:

<ParamField body="--project" type="string" required>
GCP project for the Vertex AI model. Also sets `GOOGLE_CLOUD_PROJECT`.
</ParamField>

<ParamField body="--location" type="string">
Vertex AI region. Default: `global`.
</ParamField>

<ParamField body="--model" type="string" required>
Model ID, for example `gemini-2.5-pro`.
</ParamField>

```bash
python3 agents/enrichment/src/agent_runner.py \
  --mode=table \
  --dataset=<project>.<dataset> \
  --project=<your_gcp_project> \
  --location=us-central1 \
  --model=gemini-2.5-pro \
  --output_dir=<local_output_dir>
```

</Tab>

<Tab title="Vertex AI (OKF / discovery samples)">

```bash
export GOOGLE_GENAI_USE_VERTEXAI=true
export GOOGLE_CLOUD_PROJECT=<project-id>
export GOOGLE_CLOUD_LOCATION=<region>
```

Discovery sample sets `GOOGLE_GENAI_USE_VERTEXAI=True` and `GOOGLE_CLOUD_PROJECT` before running via ADK.

</Tab>

<Tab title="Gemini API key (OKF agent)">

For AI Studio instead of Vertex:

```bash
export GEMINI_API_KEY=<your-api-key>
```

Do not set `GOOGLE_GENAI_USE_VERTEXAI` when using an API key. The OKF agent default model is `gemini-flash-latest` (override with `--model`).

</Tab>
</Tabs>

### Optional credentials

| Variable / secret | When needed |
| --- | --- |
| `KCMD_BIN` | Override auto-resolved `agents/mdcode/dist/kcmd` path |
| `GITHUB_PERSONAL_ACCESS_TOKEN` | `--repo` GitHub source via GitHub MCP server |
| `KC_ENRICH_MCP_CONFIG` | Custom MCP server configuration for GitHub tools |
| `KC_ENRICH_SAMPLE_PROJECT` | Set automatically by `samples/enrichment/src/env.sh` from gcloud config |

## Verify installation

<Steps>
<Step title="Confirm gcloud ADC">

```bash
gcloud auth application-default print-access-token | head -c 20
gcloud config get-value project
gcloud config get-value compute/region
```

Each command should return a non-empty value.

</Step>

<Step title="Confirm kcmd binary">

```bash
agents/mdcode/dist/kcmd --help
# or, if on PATH:
kcmd --help
```

</Step>

<Step title="Confirm Python agent imports">

```bash
source ~/.venv/kc-enrich/bin/activate
export PYTHONPATH=agents/enrichment/src
python3 -c "import engine; print('ok')"
```

</Step>

<Step title="Confirm OKF agent CLI">

```bash
cd okf && .venv/bin/enrichment-agent --help
```

</Step>

<Step title="Confirm toolbox binaries (if built)">

```bash
toolbox/enrichment/dist/kcagent --help
toolbox/mdcode/dist/kcmd --help
```

</Step>
</Steps>

<Check>
A successful install returns help text from each built binary and a non-empty ADC token. The catalog enrichment agent additionally requires `--project` and `--model` at run time; missing values raise `UsageError` before any enrichment work starts.
</Check>

## Environment variable reference

| Variable | Set by | Purpose |
| --- | --- | --- |
| `GOOGLE_GENAI_USE_VERTEXAI` | User or agent (`agent_runner.py`, `env.sh`) | Route `google-genai` calls through Vertex AI |
| `GOOGLE_CLOUD_PROJECT` | User, flags, or `env.sh` | Vertex project and genai client project |
| `GOOGLE_CLOUD_LOCATION` | User or `--location` flag | Vertex region (default `global` in catalog agent) |
| `GEMINI_API_KEY` | User | AI Studio authentication for OKF agent |
| `KCMD_BIN` | User | Explicit path to `kcmd` binary |
| `GITHUB_PERSONAL_ACCESS_TOKEN` | User | GitHub MCP server PAT for `--repo` |
| `GCP_LOG` | User | Enable verbose HTTP logging in `kcmd` `ApiContext` |

## Next

<CardGroup>
<Card title="Quickstart" href="/quickstart">
First successful runs: initialize a kcmd workspace, produce an OKF bundle, or run the catalog enrichment agent.
</Card>
<Card title="kcmd CLI reference" href="/kcmd-cli-reference">
Commands, init flags per source type, pull/push options, and ADC authentication behavior.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
Auth, billing, push conflict, and model credential failures with verification signals.
</Card>
</CardGroup>

---

## 03. Quickstart

> First successful runs: initialize a kcmd workspace and pull metadata, produce an OKF bundle from BigQuery, or run the catalog enrichment agent and inspect output.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/03-quickstart.md
- Generated: 2026-06-15T02:52:33.637Z

### Source Files

- `agents/mdcode/README.md`
- `okf/README.md`
- `okf/src/enrichment_agent/cli.py`
- `agents/enrichment/src/agent_runner.py`
- `toolbox/mdcode/README.md`
- `samples/enrichment/README.md`

---
title: "Quickstart"
description: "First successful runs: initialize a kcmd workspace and pull metadata, produce an OKF bundle from BigQuery, or run the catalog enrichment agent and inspect output."
---

Knowledge Catalog tooling in this repository exposes three independent first-run surfaces: the `kcmd` CLI for Metadata as Code sync (`agents/mdcode`), the OKF enrichment agent for vendor-neutral markdown bundles (`okf/`), and the catalog enrichment agent that emits mdcode artifacts for `kcmd push` (`agents/enrichment/`). Each path below completes with inspectable filesystem output and a concrete verification command.

<Note>
Complete prerequisite setup — Node.js, Python, package installs, and credential configuration — is documented on the [Installation](/installation) page. This page assumes Application Default Credentials via `gcloud auth application-default login`.
</Note>

## Choose a path

| Path | CLI entrypoint | Primary output | Typical next step |
|------|----------------|----------------|-------------------|
| Sync catalog metadata | `kcmd` | `catalog.yaml` + `catalog/` YAML or Markdown entries | Edit locally, then `kcmd push` |
| Produce an OKF bundle | `python -m enrichment_agent enrich` | Directory of OKF concept `.md` files + `index.md` | `visualize` or version in git |
| Run catalog enrichment | `python3 agents/enrichment/src/agent_runner.py` | mdcode tree with overview sidecars + `trajectory.json` | Review, then `kcmd push` |

```mermaid
flowchart LR
  subgraph kcmd_path ["kcmd workspace"]
    init["kcmd init"]
    pull["kcmd pull"]
    catalog["catalog/ entries"]
    init --> pull --> catalog
  end
  subgraph okf_path ["OKF enrichment"]
    enrich["enrichment_agent enrich"]
    bundle["OKF bundle/"]
    enrich --> bundle
  end
  subgraph agent_path ["Catalog enrichment agent"]
    runner["agent_runner.py"]
    mdcode["mdcode output_dir/"]
    runner --> mdcode
  end
  catalog --> push["kcmd push"]
  mdcode --> push
```

## Shared prerequisites

<Steps>
<Step title="Authenticate to Google Cloud">

```bash
gcloud auth application-default login
gcloud config set project <your-gcp-project-id>
```

`kcmd` and BigQuery-backed agents use Application Default Credentials. Table-mode enrichment also requires Vertex AI access via `--project` and `--model`.

</Step>
<Step title="Build kcmd (required for paths 1 and 3)">

```bash
cd agents/mdcode
npm install
npm run build
export PATH="$(pwd)/dist:$PATH"
which kcmd
```

The catalog enrichment agent shells out to `agents/mdcode/dist/kcmd` automatically; adding `dist/` to `PATH` lets you run `kcmd push` from any output directory.

</Step>
</Steps>

---

## Path 1: Initialize a kcmd workspace and pull metadata

`kcmd init` scaffolds `catalog.yaml` and selects a workspace mode. `kcmd pull` downloads editable metadata into `catalog/`.

<Steps>
<Step title="Create a workspace directory">

```bash
mkdir -p ~/kc-workspace && cd ~/kc-workspace
```

</Step>
<Step title="Initialize for a BigQuery dataset">

<ParamField body="--bigquery-dataset" type="string" required>
Dataset identifier as `project-id.dataset-id`. Repeat the flag to include multiple datasets in one workspace.
</ParamField>

```bash
kcmd init --bigquery-dataset <project-id>.<dataset-id>
```

Other init modes: `--kb` (Markdown knowledge base), `--entry-group`, `--biglake-namespace` (with `--iceberg`), or `--glossary`.

</Step>
<Step title="Pull a metadata snapshot">

```bash
kcmd pull
```

Pull writes entry files under `catalog/` — `.yaml` for data assets or `.md` for knowledge-base mode — plus optional `.ref.yaml` reference layers when declared in the manifest.

</Step>
<Step title="Verify the snapshot">

```bash
kcmd status
ls -R catalog/
```

<Check>
Success signals: `catalog.yaml` exists at the workspace root; `catalog/bigquery/<project>/<dataset>/` contains one `.yaml` file per table or view; `kcmd status` reports the local snapshot state without auth errors.
</Check>

</Step>
</Steps>

<RequestExample>

```bash title="BigQuery workspace init and pull"
mkdir -p ~/kc-bq-demo && cd ~/kc-bq-demo
kcmd init --bigquery-dataset my-project.analytics
kcmd pull
kcmd status
```

</RequestExample>

<ResponseExample>

```text title="Expected layout after pull"
catalog.yaml
catalog/
└── bigquery/
    └── my-project/
        └── analytics/
            ├── my-project.analytics.yaml
            └── analytics/
                ├── orders.yaml
                └── customers.yaml
```

</ResponseExample>

---

## Path 2: Produce an OKF bundle from BigQuery

The OKF enrichment agent (`enrichment_agent`) runs a BigQuery pass that writes one OKF concept document per advertised concept, then an optional web pass that enriches from seed URLs.

<Steps>
<Step title="Install the OKF agent">

```bash
cd okf
python3 -m venv .venv
.venv/bin/pip install --index-url https://pypi.org/simple/ -e .[dev]
```

</Step>
<Step title="Configure model credentials">

<Tabs>
<Tab title="Vertex AI">

```bash
export GOOGLE_GENAI_USE_VERTEXAI=true
export GOOGLE_CLOUD_PROJECT=<your-gcp-project-id>
export GOOGLE_CLOUD_LOCATION=<region>
```

</Tab>
<Tab title="AI Studio">

```bash
export GEMINI_API_KEY=<your-api-key>
```

</Tab>
</Tabs>

BigQuery reads public datasets with ADC; query bytes bill to your configured project.

</Step>
<Step title="Run enrichment against a public dataset">

<CodeGroup>

```bash title="BQ-only (fastest first run)"
.venv/bin/python -m enrichment_agent enrich \
  --source bq \
  --dataset bigquery-public-data.ga4_obfuscated_sample_ecommerce \
  --no-web \
  --out ./bundles/my-first-bundle
```

```bash title="BQ + web pass (seeded docs)"
.venv/bin/python -m enrichment_agent enrich \
  --source bq \
  --dataset bigquery-public-data.ga4_obfuscated_sample_ecommerce \
  --web-seed-file samples/ga4_merch_store/seeds.txt \
  --out ./bundles/ga4_merch_store
```

</CodeGroup>

<ParamField body="--source" type="string" required>
Source adapter. Currently `bq` (BigQuery).
</ParamField>

<ParamField body="--dataset" type="string" required>
BigQuery dataset as `project.dataset`.
</ParamField>

<ParamField body="--out" type="path" required>
Bundle root directory to create or update.
</ParamField>

<ParamField body="--no-web" type="flag">
Skip the web crawl pass entirely.
</ParamField>

<ParamField body="--concept" type="string">
Enrich a single concept id (e.g. `tables/events_`). Repeatable.
</ParamField>

</Step>
<Step title="Inspect the bundle">

```bash
find ./bundles/my-first-bundle -name '*.md' | head -20
cat ./bundles/my-first-bundle/index.md
```

<Check>
Success signals: stderr reports `Enriched N concept(s) into <out>`; the bundle contains `index.md` at each directory level, concept files under paths like `datasets/` and `tables/`, and YAML frontmatter with `type`, `title`, and `resource` fields on each concept.
</Check>

</Step>
<Step title="Generate an interactive graph viewer (optional)">

```bash
.venv/bin/python -m enrichment_agent visualize \
  --bundle ./bundles/my-first-bundle
open ./bundles/my-first-bundle/viz.html
```

</Step>
</Steps>

:::files
path/to/bundle/
├── index.md
├── datasets/
│   └── ga4_obfuscated_sample_ecommerce.md
├── tables/
│   ├── index.md
│   └── events_.md
└── references/          # present when web pass runs
    └── metrics/
        └── event_count.md
:::

---

## Path 3: Run the catalog enrichment agent and inspect output

`agent_runner.py` dispatches to `table`, `doc`, or `context_overlay` modes. For a first run without Google Drive, use **table mode** with a local Markdown corpus and a BigQuery dataset the agent discovers via `kcmd init` + `kcmd pull`.

<Steps>
<Step title="Install Python dependencies">

```bash
python3 -m venv ~/.venv/kc-enrich
source ~/.venv/kc-enrich/bin/activate
pip install -r agents/enrichment/src/requirements.txt
export PYTHONPATH=agents/enrichment/src
```

</Step>
<Step title="Run table-mode enrichment">

<ParamField body="--mode" type="enum" required>
`table`, `doc`, or `context_overlay`. Omit to infer: `--dataset` present implies `table`.
</ParamField>

<ParamField body="--project" type="string" required>
GCP project hosting the Vertex AI model.
</ParamField>

<ParamField body="--model" type="string" required>
Vertex model id, e.g. `gemini-2.5-pro`.
</ParamField>

<ParamField body="--dataset" type="string" required>
BigQuery dataset as `project.dataset`.
</ParamField>

<ParamField body="--output_dir" type="path" required>
Local directory for the generated mdcode tree.
</ParamField>

<ParamField body="--folders" type="string">
Comma-separated Google Drive folder URLs/IDs and/or local directories of `.md` files used as grounding context.
</ParamField>

```bash
python3 agents/enrichment/src/agent_runner.py \
  --mode=table \
  --dataset=<project>.<dataset> \
  --folders=agents/enrichment/eval/corpora/thelook_ecommerce \
  --topic="E-commerce analytics metadata" \
  --project=<your-gcp-project> \
  --location=us-central1 \
  --model=gemini-2.5-pro \
  --output_dir=/tmp/enrich_out
```

The agent runs read-only `kcmd init` and `kcmd pull` internally, then writes `<table>.overview.md` sidecars next to pulled entry YAML files.

</Step>
<Step title="Inspect generated artifacts">

```bash
find /tmp/enrich_out -type f | sort
cat /tmp/enrich_out/trajectory.json | head -40
ls /tmp/enrich_out/catalog/
```

<Check>
Success signals: `catalog.yaml` and `catalog/<project>.<dataset>/` exist; each enriched table has a `<table>.yaml` entry and a `<table>.overview.md` sidecar; `trajectory.json` records tool calls (`read_local_md`, `fetch_gdoc`, etc.) for downstream evaluation.
</Check>

</Step>
<Step title="Review a table overview">

```bash
# Replace with an actual table name from your dataset
cat /tmp/enrich_out/catalog/<project>.<dataset>/<table>.overview.md
```

Overview sidecars carry the enriched prose; entry YAML retains the pulled schema as the source of truth.

</Step>
<Step title="Optional interactive refinement">

```bash
python3 agents/enrichment/src/agent_runner.py \
  --mode=table \
  --dataset=<project>.<dataset> \
  --folders=agents/enrichment/eval/corpora/thelook_ecommerce \
  --project=<your-gcp-project> \
  --model=gemini-2.5-pro \
  --output_dir=/tmp/enrich_out \
  --interactive
```

The `refine>` REPL reuses loaded context without re-pulling the dataset.

</Step>
</Steps>

<ResponseExample>

```text title="Table-mode log excerpt"
[kcmd] 🔎 Discovering my-project.analytics via kcmd init + pull ...
[kcmd] OK: ...
[kcmd] 📑 orders (12 cols)
[kcmd] 📑 customers (8 cols)
```

</ResponseExample>

<Warning>
The enrichment agent generates mdcode and runs read-only `kcmd` commands only. Publishing enriched metadata to Knowledge Catalog is a separate `kcmd push` step from `--output_dir`.
</Warning>

---

## Compare outputs

| Artifact | kcmd pull | OKF bundle | Catalog enrichment agent |
|----------|-----------|------------|--------------------------|
| Manifest | `catalog.yaml` | — | `catalog.yaml` (from `kcmd init`) |
| Entry format | YAML (`.yaml`) or Markdown (`.md`) | OKF concept `.md` + frontmatter | YAML entry + `.overview.md` sidecar |
| Schema source | Pulled from catalog | Embedded in concept body | Pulled via `kcmd pull` (not rewritten) |
| Run log | CLI stdout | stderr summary line | `trajectory.json` |
| Publish path | `kcmd push` | Exchange as files; no `kcmd` step | `kcmd push` from `--output_dir` |

OKF bundles are vendor-neutral and portable across tools. mdcode output from the catalog enrichment agent is designed for direct `kcmd push` into Dataplex.

---

## Quick troubleshooting

<AccordionGroup>
<Accordion title="kcmd pull returns auth or permission errors">

Re-run `gcloud auth application-default login` and confirm `gcloud config get-value project` matches the dataset project. Verify Dataplex/Knowledge Catalog API access for the target project.

</Accordion>
<Accordion title="OKF enrich exits on missing --dataset">

`--dataset` is required when `--source bq`. Use the fully qualified form `project.dataset`.

</Accordion>
<Accordion title="Catalog enrichment agent reports no tables pulled">

Confirm `--dataset` uses `project.dataset` format, ADC is valid, and the dataset has readable `@bigquery` catalog entries. Check `[kcmd]` log lines for the underlying `kcmd init` + `kcmd pull` result.

</Accordion>
<Accordion title="agent_runner.py requires --project and --model">

Both flags are mandatory in every mode. The agent configures Vertex AI from `--project`, `--location` (default `global`), and `--model`.

</Accordion>
</AccordionGroup>

## Next

<CardGroup>
<Card title="Installation" href="/installation">
Prerequisites, Python and Node.js setup, package installs, and credential configuration.
</Card>
<Card title="Sync catalog metadata" href="/sync-catalog-metadata">
Deeper kcmd pull/push workflows, reference layers, and glossary scope.
</Card>
<Card title="Produce OKF bundles" href="/produce-okf-bundles">
Two-pass BQ-then-web enrichment, concept scoping, and web crawl constraints.
</Card>
<Card title="Run catalog enrichment agent" href="/run-catalog-enrichment-agent">
All three modes, Drive and GitHub inputs, glossary linking, and refinement.
</Card>
<Card title="Publish enriched metadata" href="/publish-enriched-metadata">
Push mdcode workspaces with `kcmd push` and reconcile entry links.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
Auth, billing, push conflict, and model credential failures.
</Card>
</CardGroup>

---

## 04. Open Knowledge Format

> OKF v0.1 bundle structure, concept documents, frontmatter fields, index.md progressive disclosure, and cross-link semantics for vendor-neutral knowledge exchange.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/04-open-knowledge-format.md
- Generated: 2026-06-15T02:52:39.022Z

### Source Files

- `okf/SPEC.md`
- `okf/src/enrichment_agent/bundle/document.py`
- `okf/src/enrichment_agent/bundle/index.py`
- `okf/src/enrichment_agent/bundle/paths.py`
- `okf/bundles/stackoverflow/index.md`
- `okf/README.md`

---
title: Open Knowledge Format
description: OKF v0.1 bundle structure, concept documents, frontmatter fields, index.md progressive disclosure, and cross-link semantics for vendor-neutral knowledge exchange.
---

Open Knowledge Format (OKF) v0.1 is a vendor-neutral way to ship catalog knowledge as a directory of UTF-8 Markdown files with YAML frontmatter. A bundle is self-describing: humans can read it with ordinary file tools, agents can parse it without a proprietary SDK, and version control can diff it like source code. OKF standardizes only the structural conventions needed for interoperability; producers remain free to organize domains, extend frontmatter, and choose tooling.

## What OKF is for

OKF targets four goals:

1. Give enrichment agents a universal write target.
2. Give consumption agents predictable traversal rules.
3. Enable knowledge exchange across organizations and systems.
4. Require only a small set of fields so partial or agent-generated bundles stay useful.

OKF is **not** a fixed taxonomy, storage layer, or replacement for domain schemas such as Avro, Protobuf, or OpenAPI. It **references** those assets through concept documents and external citations.

## Bundle structure

A **knowledge bundle** is a directory tree of `.md` files. Directory layout is producer-defined; folders group related concepts but do not encode relationship types.

:::files
path/to/bundle/
├── index.md                      # Optional directory listing (progressive disclosure)
├── log.md                        # Optional update history
├── <concept>.md                  # Concept at bundle root
└── <subdirectory>/
    ├── index.md
    ├── <concept>.md
    └── <nested>/
        └── …
:::

Bundles may be distributed as:

- A git repository (recommended — history, attribution, diffs).
- A tarball or zip archive.
- A subdirectory inside a larger repository.

### Reserved filenames

These filenames have defined meaning at any hierarchy level and **must not** be used for concept documents:

| Filename | Purpose |
|----------|---------|
| `index.md` | Directory listing for progressive disclosure |
| `log.md` | Chronological update history for that scope |

Every other `.md` file is a concept document.

## Concept documents

Each **concept** is one Markdown file with two parts:

1. A YAML **frontmatter** block delimited by `---` on its own lines at the top of the file.
2. A Markdown **body** with free-form content.

The **concept ID** is the file path within the bundle with the `.md` suffix removed. For example, `tables/users.md` has concept ID `tables/users`.

Concept ID segments must match `[A-Za-z0-9_][A-Za-z0-9_.\-]*` per segment. Tools map IDs to paths with `concept_id_to_path` and reverse with `path_to_concept_id`.

### Frontmatter fields

OKF v0.1 conformance requires only a non-empty `type` field. The Knowledge Catalog enrichment agent enforces a stricter write contract for generated bundles.

<ParamField body="type" type="string" required>
Short string identifying the concept kind. Examples: `BigQuery Table`, `BigQuery Dataset`, `Reference`, `Playbook`. Types are not centrally registered; consumers must tolerate unknown values.
</ParamField>

<ParamField body="title" type="string">
Human-readable display name. If omitted, consumers may derive a title from the filename. Required by the enrichment agent's `write_concept_doc` tool.
</ParamField>

<ParamField body="description" type="string">
One-sentence summary used in `index.md` entries, search snippets, and previews. Required by the enrichment agent.
</ParamField>

<ParamField body="resource" type="string">
Canonical URI for the underlying asset (for example, a BigQuery table API URL or console link). Omit for abstract concepts.
</ParamField>

<ParamField body="tags" type="string[]">
YAML list of short categorization strings. Producers may synthesize tag-browsing views at consumption time by scanning frontmatter; OKF does not define a separate tag file format.
</ParamField>

<ParamField body="timestamp" type="string">
ISO 8601 datetime of the last meaningful change. The enrichment agent auto-fills UTC time when omitted.
</ParamField>

Producers may add arbitrary extension keys. Consumers should preserve unknown keys on round-trip and must not reject documents because of unrecognized fields.

**Enrichment agent key order.** When writing through `write_concept_doc`, frontmatter is reordered to: `type`, `resource`, `title`, `description`, `tags`, `timestamp`, then any extensions.

### Body conventions

The body has no required sections. These headings carry conventional meaning:

| Heading | Purpose |
|---------|---------|
| `# Schema` | Structured description of columns, fields, or enumerations |
| `# Examples` | Concrete usage examples, often fenced code blocks |
| `# Common query patterns` | SQL or API usage patterns (enrichment agent convention) |
| `# Citations` | External sources backing claims in the body |

The enrichment agent expects, in order: short prose, `# Schema`, `# Common query patterns` (for tables), and `# Citations`. During the web enrichment pass, writes that shrink an existing BigQuery Table's `# Schema` field set or `# Citations` entry count are rejected to preserve metadata-grounded content.

<RequestExample>
```markdown
---
type: BigQuery Table
title: Users
description: One row per registered Stack Overflow user.
resource: https://bigquery.googleapis.com/v2/projects/bigquery-public-data/datasets/stackoverflow/tables/users
tags: [Stack Overflow, users, profiles]
timestamp: 2026-05-28T23:32:24+00:00
---

This table stores user profiles for the [stackoverflow](../datasets/stackoverflow.md) dataset.

# Schema

* `id` (INTEGER) - Unique identifier for the user.
* `display_name` (STRING) - Publicly visible name.

# Common query patterns

```sql
SELECT id, display_name, reputation
FROM `bigquery-public-data.stackoverflow.users`
ORDER BY reputation DESC
LIMIT 10
```

# Citations

[1] [Stack Overflow Users Table](https://bigquery.googleapis.com/v2/projects/bigquery-public-data/datasets/stackoverflow/tables/users)
```
</RequestExample>

## `index.md` and progressive disclosure

An `index.md` may appear in any directory, including the bundle root. Index files contain **no frontmatter** (except optionally at bundle root for version declaration — see Versioning). The body lists directory contents under section headings so humans and agents can browse one level at a time instead of loading the entire corpus.

```markdown
# BigQuery Table

* [Users](users.md) - One row per registered Stack Overflow user.
* [Votes](votes.md) - Records of upvotes and downvotes on posts.

# Subdirectories

* [references](references/index.md) - Enumerated types and internal references.
```

Entries should include each linked concept's `description` from frontmatter. The enrichment agent's `regenerate_indexes` groups concepts by `type`, sorts entries alphabetically by title, and synthesizes subdirectory blurbs when a folder has multiple children. Single-child directories reuse the child's description.

`index.md` files are navigation aids, not concepts. Graph viewers and concept walkers skip them.

<Steps>
<Step title="Open the bundle root index">
Read `index.md` at the bundle root to see top-level subdirectories and any root-level concepts.
</Step>
<Step title="Drill into a section">
Follow a subdirectory link such as `tables/index.md` to see concepts grouped by type.
</Step>
<Step title="Open a concept document">
Follow a concept link to load frontmatter metadata and the full body.
</Step>
</Steps>

## Cross-linking

Concepts express relationships beyond parent/child directory structure with standard Markdown links. The relationship kind (joins-with, depends-on, parent-of, and so on) is conveyed by surrounding prose, not by link syntax. Graph consumers typically treat links as directed, untyped edges.

### Link forms

| Form | Example | Notes |
|------|---------|-------|
| Bundle-relative absolute | `[customers](/tables/customers.md)` | SPEC-recommended; stable when moving documents within a subdirectory |
| File-relative | `[users](users.md)` from `tables/events.md` | Resolves correctly when browsing plain files (GitHub, local filesystem) |
| Parent traversal | `[dataset](../datasets/stackoverflow.md)` | Typical pattern from a table to its dataset |

OKF consumers **must tolerate broken links**. A missing target is not malformed; it may represent knowledge not yet authored.

### Producer and consumer guidance

The OKF specification recommends bundle-relative absolute paths starting with `/`. The enrichment agent instructs producers to use **file-relative paths only** and avoid leading `/` so links render correctly on GitHub. The bundled graph viewer extracts edges only from relative `.md` links resolved within the bundle; absolute `/…` links and external URLs are skipped for edge construction but still work as navigation in rendered Markdown.

Rules enforced by the enrichment agent when writing:

- Link only to concept IDs returned by `list_concepts()`.
- Do not link from headers, fenced code blocks, or schema field listings.
- Do not self-link.
- One link per concept mention per section is sufficient.

## Citations

External claims should be listed under `# Citations` at the bottom of the document, numbered:

```markdown
# Citations

[1] [BigQuery table schema](https://console.cloud.google.com/bigquery?p=acme&d=sales&t=orders)
[2] [Internal data quality runbook](https://wiki.acme.internal/data/quality)
```

Citation targets may be absolute URLs, bundle-relative paths, or paths into a `references/` subtree that mirrors external material as first-class OKF concepts.

## `log.md` (optional)

A `log.md` at any hierarchy level records changes for that scope. Format is a flat list grouped by date, newest first:

```markdown
# Directory Update Log

## 2026-05-22
* **Update**: Added [Customer Metrics](/tables/customer-metrics.md).
* **Creation**: Established [Dataplex Playbook](/playbooks/dataplex.md).

## 2026-05-15
* **Initialization**: Created foundational directory structure.
```

Date headings use ISO 8601 `YYYY-MM-DD`. Leading bold verbs (`**Update**`, `**Creation**`, `**Deprecation**`) are conventions, not requirements.

## Conformance

A bundle is **conformant with OKF v0.1** when:

1. Every non-reserved `.md` file has parseable YAML frontmatter.
2. Every frontmatter block contains a non-empty `type` field.
3. Every present `index.md` or `log.md` follows the structures described above.

Consumers should treat all other constraints as soft guidance. Consumers **must not** reject a bundle because of:

- Missing optional frontmatter fields
- Unknown `type` values or extension keys
- Broken cross-links
- Missing `index.md` files

This permissive model keeps bundles useful as they grow, refactor, and are partially generated by agents.

## Versioning

This repository ships OKF **version 0.1**. Future revisions use `<major>.<minor>` semantics: minor bumps add backward-compatible optional fields; major bumps may break required fields or reserved filenames.

Bundles may declare their target version with `okf_version: "0.1"` in **bundle-root `index.md` frontmatter** — the only place frontmatter is permitted on an `index.md`. Consumers that do not understand the declared version should attempt best-effort consumption.

## Example bundle layout

The repository includes three reference bundles under `okf/bundles/`:

| Bundle | Domain |
|--------|--------|
| `ga4/` | GA4 e-commerce sample dataset |
| `stackoverflow/` | Stack Overflow public dataset |
| `crypto_bitcoin/` | Bitcoin blocks and transactions |

A typical Stack Overflow bundle organizes `datasets/`, `tables/`, and `references/` subtrees, each with its own `index.md`, and cross-links such as a table pointing to its parent dataset with `../datasets/stackoverflow.md`.

## Produce, visualize, and publish

OKF bundles in this project are commonly produced by the OKF enrichment agent (BigQuery metadata plus optional web crawl) or the catalog enrichment agent, then optionally visualized or published into a Knowledge Catalog workspace.

<AccordionGroup>
<Accordion title="Enrichment agent write contract vs OKF minimum">
OKF conformance requires only `type`. The enrichment agent's `write_concept_doc` requires `type`, `title`, `description`, and `timestamp` (auto-filled when absent). This stricter contract keeps auto-generated `index.md` entries informative and bundles consistent for downstream catalog sync.
</Accordion>
<Accordion title="Graph consumption behavior">
The `visualize` subcommand walks all concept `.md` files, builds nodes from frontmatter, and draws directed edges from relative cross-links. Missing link targets are skipped without error. Backlinks ("Cited by") are computed from the reverse link graph in the generated `viz.html` viewer.
</Accordion>
</AccordionGroup>

## Related pages

<CardGroup cols={2}>
<Card title="Overview" icon="book-open" href="/overview">
Knowledge Catalog tooling surface and shortest paths to produce and publish metadata context.
</Card>
<Card title="Produce OKF bundles" icon="package" href="/produce-okf-bundles">
Run the OKF enrichment agent against BigQuery with optional web crawl seeds.
</Card>
<Card title="Visualize OKF bundles" icon="network" href="/visualize-okf-bundles">
Generate self-contained `viz.html` graph viewers from bundle cross-links.
</Card>
<Card title="OKF bundle recipes" icon="flask" href="/okf-bundle-recipes">
Copy-paste recipes for GA4, Stack Overflow, and Bitcoin sample bundles.
</Card>
<Card title="Enrichment workflows" icon="workflow" href="/enrichment-workflows">
How agents read source metadata, emit OKF bundles, and hand off to catalog publication.
</Card>
<Card title="Metadata as Code" icon="code" href="/metadata-as-code">
kcmd workspace model for syncing enriched metadata into Knowledge Catalog.
</Card>
</CardGroup>

---

## 05. Metadata as Code

> kcmd workspace model: catalog.yaml manifest, YAML and Markdown layouts, pull/push sync, reference layers, entry links, and glossary scope for Knowledge Catalog metadata.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/05-metadata-as-code.md
- Generated: 2026-06-15T02:53:27.764Z

### Source Files

- `agents/mdcode/README.md`
- `agents/mdcode/docs/concept.md`
- `agents/mdcode/src/libts/manifest.ts`
- `agents/mdcode/src/libts/snapshot.ts`
- `agents/mdcode/src/libts/sync.ts`
- `toolbox/mdcode/docs/concept.md`

---
title: "Metadata as Code"
description: "kcmd workspace model: catalog.yaml manifest, YAML and Markdown layouts, pull/push sync, reference layers, entry links, and glossary scope for Knowledge Catalog metadata."
---

Metadata as Code is the `kcmd` workspace model in `agents/mdcode`: a directory rooted at `catalog.yaml` that mirrors Knowledge Catalog entries, aspects, and entry links as versionable YAML and Markdown files, with `kcmd pull`, `kcmd reference`, and `kcmd push` synchronizing editable snapshots against Dataplex.

## Workspace model

A kcmd workspace is a filesystem directory that acts as the unit of synchronization with Knowledge Catalog. The manifest (`catalog.yaml`) declares which GCP resources to manage, which metadata types to snapshot locally, which subsets to publish, and optional read-only reference scopes. Editable artifacts live under `catalog/`; the layout engine (`CatalogSnapshot` + `CatalogLayout`) maps service resources to local file paths based on scope type.

```mermaid
flowchart TB
  subgraph workspace["kcmd workspace"]
    manifest["catalog.yaml"]
    catalog["catalog/"]
    ref["*.ref.yaml siblings"]
  end

  subgraph kcmd["kcmd CLI / MCP"]
    pull["pull"]
    reference["reference"]
    push["push"]
  end

  subgraph service["Knowledge Catalog / Dataplex"]
    entries["Entries + Aspects"]
    links["EntryLinks"]
    glossary["Glossary hierarchy"]
  end

  manifest --> pull
  manifest --> reference
  manifest --> push
  pull --> catalog
  reference --> ref
  catalog --> push
  pull <--> entries
  pull <--> links
  reference <--> entries
  push --> entries
  push --> links
  push --> glossary
```

<Info>
Authentication uses gcloud Application Default Credentials (`gcloud auth application-default login`). The CLI and MCP server share the same workspace binding.
</Info>

## Scope types and layouts

`kcmd init` requires exactly one primary source type. The init flag writes `scope` into `catalog.yaml` and selects the on-disk layout automatically.

| Source type | Init flag | Scope prefix | Layout | Target resource |
| --- | --- | --- | --- | --- |
| BigQuery | `--bigquery-dataset` | `bq-dataset` | Standard (YAML) | Tables, views, schemas in `@bigquery` |
| Knowledge base | `--kb` | `kb` | Documents (Markdown) | Wiki/doc entries in an Entry Group |
| Entry group | `--entry-group` | `entryGroup` | Standard (YAML) | Custom user-managed entries |
| BigLake (Iceberg) | `--biglake-namespace --iceberg` | `biglake-iceberg-namespace` | Standard (YAML) | Iceberg table metadata |
| Glossary | `--glossary` | `glossary` | Standard (YAML) | Business glossary terms and categories |

BigQuery mode accepts multiple datasets by repeating `--bigquery-dataset` or by declaring an array in `scope`. Glossary mode supports comma-separated IDs, display-name lookup, or location mode (`--glossary my-project.us-central1`) to manage all glossaries in a location.

### Standard layout (YAML + sidecars)

Used for `bq-dataset`, `entryGroup`, `biglake-*`, and `glossary` scopes. Each entry is a `<entry-id>.yaml` file. Unstructured aspects (for example `overview`) split into sidecar Markdown files named `<entry-id>.<aspect-alias>.md`. Reference baselines are sibling `*.ref.yaml` files.

:::files
/
├── catalog.yaml
└── catalog/
    └── bigquery/
        └── my-project/
            ├── my-dataset.yaml
            └── my-dataset/
                ├── orders.yaml
                ├── orders.ref.yaml
                └── orders.overview.md
:::

### Documents layout (Markdown-first)

Used for `kb` scopes. Each entry is a single `.md` file: structured metadata in YAML frontmatter, with `overview.content` promoted to the Markdown body.

:::files
/
├── catalog.yaml
└── catalog/
    └── my-namespace/
        └── my-project/
            └── my-location/
                ├── page1.md
                └── playbooks/mbr.md
:::

## catalog.yaml manifest

The manifest drives all sync behavior. `CatalogManifest.load` validates scope, snapshot, publishing, and optional reference blocks.

<ParamField body="scope" type="string | string[]" required>
Primary resource(s) to manage. Format: `<type>.<resource-id>`. Examples: `bq-dataset.my-project.my-dataset`, `kb.my-project.us-central1.my-kb`, `glossary.my-project.global.my-glossary`. Multi-dataset scopes use a YAML array of `bq-dataset.*` entries.
</ParamField>

<ParamField body="resourceAlias" type="object">
Optional alias map for aspect types, glossaries, and entry link types. Built-in Dataplex types already have predefined aliases (`bigquery-table`, `schema`, `overview`, `definition`, `synonym`, `related`, `schema-join`).
</ParamField>

<ParamField body="snapshot" type="object">
Entry, aspect, and entry link types to download locally. Required aspects of listed entry types are implicitly included. `entryLinks` triggers `lookupEntryLinks` on pull.
</ParamField>

<ParamField body="publishing" type="object">
Subset of snapshot types that `kcmd push` writes back. Must be a subset of `snapshot`; publishing types not in snapshot cause validation errors.
</ParamField>

<ParamField body="reference" type="object">
Read-only scope for grounding. `reference.scope` can differ from the primary scope (for example, pull schemas from a dataset while publishing enrichments to another). `reference.snapshot` mirrors `snapshot` structure.
</ParamField>

<RequestExample>

```yaml title="catalog.yaml — BigQuery enrichment workspace"
scope: bq-dataset.my-project.my-dataset

snapshot:
  entries:
    - dataplex-types.global.bigquery-table
  aspects:
    - dataplex-types.global.schema
    - dataplex-types.global.overview
  entryLinks:
    - definition
    - synonym

publishing:
  aspects:
    - dataplex-types.global.overview
  entryLinks:
    - definition

reference:
  scope: bq-dataset.my-project.my-dataset
  snapshot:
    entries:
      - dataplex-types.global.bigquery-table
    aspects:
      - dataplex-types.global.schema
    entryLinks:
      - definition
```

</RequestExample>

## Pull, reference, and push

### Pull editable metadata

`kcmd pull` lists entries from the scoped source, calls `lookupEntry` for each matching entry type, and writes files under `catalog/`. When `snapshot.entryLinks` is declared, pull also calls `lookupEntryLinks` and inlines results into entry YAML.

<Steps>
<Step title="Initialize the workspace">

```bash
kcmd init --bigquery-dataset my-project.my-dataset
```

This writes `catalog.yaml` with the correct `scope` prefix and layout selection.

</Step>
<Step title="Configure snapshot and publishing">

Edit `catalog.yaml` to declare which entry types, aspects, and entry links to manage locally and which to publish.

</Step>
<Step title="Pull metadata">

```bash
kcmd pull
```

Verify `.yaml` or `.md` files appear under `catalog/` matching your scope hierarchy.

</Step>
</Steps>

### Pull reference layers

`kcmd reference` downloads read-only metadata defined in the `reference:` block. Files are saved as `*.ref.yaml` siblings to editable entries. Reference files are indexed separately and marked non-modifiable — `push` skips them via `isModifiable`.

<Warning>
Reference layers are never pushed. Use them as authoritative baselines for enrichment agents; diff live `.yaml` against `.ref.yaml` to surface only your changes.
</Warning>

When `reference.snapshot.entryLinks` is set, reference pull includes pre-edit link state so diffs do not treat existing links as enrichment additions.

### Push local edits

`kcmd push` iterates modifiable entries, converts local metadata to Dataplex API representations, and creates or updates entries and entry links.

| Behavior | Detail |
| --- | --- |
| Auto-create entries | Missing entries and parent Entry Groups are created during push (non-ingested scopes) |
| Aspect filtering | Only aspects listed in `publishing.aspects` are sent; required ingested aspects are skipped |
| Entry link reconciliation | When `publishing.entryLinks` is set, push compares local vs remote links by normalized target + path; matches are kept, new links created, unmatched remote links deleted |
| Glossary tree | `kcmd push` never creates Glossary, GlossaryCategory, or GlossaryTerm resources; it fails fast if they are missing |
| Glossary metadata updates | Descriptions and labels on existing glossary resources can be updated |
| Flags | `--force` overwrites conflicts; `--validate-only` validates without pushing; `--dry-run` logs planned mutations |

<Check>
EntryLinks that reference glossary terms (for example `definition` links from a BQ column to a term) are catalog metadata and are created/deleted normally by push. The no-create rule applies only to the glossary hierarchy itself.
</Check>

## Entry links

Entry links are first-class artifacts in pull and push. Declare link types in `snapshot.entryLinks` to fetch them; declare a subset in `publishing.entryLinks` to reconcile them on push. Omit `publishing.entryLinks` to read links without mutating them.

**Column-level links** carry a `Schema.<field>` source path. On pull, these are inlined under `aspects.schema.fields[].links`. On push, the path is reconstructed as `Schema.${field.name}`.

**Entry-level links** without a schema path appear under the top-level `links` block.

**Target resolution** uses a human-readable form for glossary terms (`<project>.<location>.<glossary-display-name>.<term-display-name>`) while preserving the full UID resource path in `id` for round-trip push. Matching during reconciliation unwraps `@dataplex` proxy entries and normalizes project IDs to avoid spurious delete-and-recreate cycles.

<RequestExample>

```yaml title="Column-level definition link (excerpt)"
aspects:
  schema:
    fields:
      - name: customer_id
        dataType: STRING
        mode: NULLABLE
        links:
          definition:
            - target: my-project.global.business-glossary.customer-id
              id: projects/my-project/locations/global/glossaries/biz/terms/customer-id

links:
  related:
    - target: my-other-project.us.docs-eg.runbook-page
```

</RequestExample>

Built-in entry link aliases include `definition`, `synonym`, `related`, and `schema-join`, each mapping to `dataplex-types.global.*` link types.

## Glossary scope

A Business Glossary can be the primary workspace scope (`glossary.<project>.<location>.<glossary-id>`). The local hierarchy mirrors the glossary tree under `catalog/glossaries/`:

```yaml title="Glossary term entry"
name: glossaries/Business Glossary (biz)/terms/customer-id
type: glossaryTerm
displayName: customer-id
description: Unique identifier for a customer record.
parent: projects/my-project/locations/global/glossaries/biz
```

Glossaries also work as `reference.scope` so enrichment workspaces can ground on business vocabulary without owning glossary CRUD. Provision glossary resources out-of-band (Dataplex console or `gcloud dataplex glossaries create`) before the first push; `kcmd pull` then `kcmd push` manages metadata on existing nodes.

<Tabs>
<Tab title="Single glossary">

```bash
kcmd init --glossary my-project.us-central1.my-glossary-id
```

</Tab>
<Tab title="Multiple glossaries">

```bash
kcmd init --glossary my-project.us-central1.glossary-a,glossary-b
```

</Tab>
<Tab title="Location mode">

```bash
kcmd init --glossary my-project.us-central1
```

</Tab>
</Tabs>

## Agent integration

Metadata as Code artifacts are the interchange format for enrichment agents and human-in-the-loop review. Agents read and modify workspace files; `kcmd push` publishes approved changes. The built-in MCP server exposes `list-entries`, `lookup-entry`, and `modify-entry` tools bound to a workspace path, enabling agentic metadata workflows without coupling to a specific model provider.

<CardGroup>
<Card title="Sync catalog metadata" href="/sync-catalog-metadata">
Initialize workspaces per source type, pull snapshots, check status, and push edits back to Knowledge Catalog.
</Card>
<Card title="catalog.yaml reference" href="/catalog-manifest-reference">
Full manifest field reference: scope, snapshot, publishing, reference, aliases, and layout selection rules.
</Card>
<Card title="kcmd CLI reference" href="/kcmd-cli-reference">
Command flags for init, pull, push, reference, dry-run, force, and validate-only.
</Card>
</CardGroup>

## Related pages

<CardGroup>
<Card title="Overview" href="/overview">
Knowledge Catalog tooling surface and shortest paths to produce and publish metadata context.
</Card>
<Card title="Quickstart" href="/quickstart">
First successful runs: initialize a workspace, pull metadata, and inspect output.
</Card>
<Card title="Enrichment workflows" href="/enrichment-workflows">
How enrichment agents read source metadata, emit mdcode artifacts, and hand off to kcmd push.
</Card>
<Card title="Publish enriched metadata" href="/publish-enriched-metadata">
Push mdcode workspaces and reconcile entry links without modifying reference layers.
</Card>
<Card title="kcmd MCP reference" href="/kcmd-mcp-reference">
MCP server startup, workspace binding, and agent tools for pull, push, and modify-entry.
</Card>
</CardGroup>

---

## 06. Enrichment workflows

> How enrichment agents read source metadata, ground on external docs or code, emit OKF bundles or mdcode artifacts, and hand off to kcmd push for catalog publication.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/06-enrichment-workflows.md
- Generated: 2026-06-15T02:53:43.208Z

### Source Files

- `okf/src/enrichment_agent/runner.py`
- `agents/enrichment/src/agent_runner.py`
- `agents/enrichment/src/modes/table_mode.py`
- `agents/enrichment/src/modes/doc_mode.py`
- `toolbox/enrichment/README.md`
- `samples/enrichment/README.md`

---
title: "Enrichment workflows"
description: "How enrichment agents read source metadata, ground on external docs or code, emit OKF bundles or mdcode artifacts, and hand off to kcmd push for catalog publication."
---

Knowledge Catalog ships four enrichment surfaces that share a read → ground → emit → publish pattern: the OKF `enrichment-agent` CLI writes vendor-neutral OKF bundles from BigQuery (plus an optional web pass); `agents/enrichment/src/agent_runner.py` writes mdcode workspaces through read-only `kcmd init` / `pull` / `reference`; `toolbox/enrichment` exposes a TypeScript `kcagent enrich` with pluggable MCP tools and skills; and `samples/enrichment` demonstrates download → enrich → publish against catalog APIs. None of the agents call `kcmd push` — publication is always your step after reviewing local output.

## Enrichment surfaces

| Surface | Entry point | Source metadata | Output artifact | Publication |
| --- | --- | --- | --- | --- |
| OKF enrichment agent | `enrichment-agent enrich` | BigQuery API via `BigQuerySource` | OKF bundle directory (`--out`) | Exchange or import; not wired to `kcmd push` |
| Catalog enrichment agent | `agent_runner.py` | `kcmd init` + `pull` or `reference` | mdcode workspace (`catalog.yaml` + `catalog/`) | `kcmd push` |
| Toolbox agent | `kcagent enrich` | `kcmd pull` snapshot in `--catalog-path` | Updated mdcode in workspace | `kcmd push` |
| Python sample | `python3 -m enrichment.enrich` | Downloaded snapshot (`enrichment.download`) | Updated metadata directory | `python3 -m enrichment.publish` or `kcmd push` |

<Note>
Model and cloud configuration are BYOC/BYOK: pass `--project`, `--location`, and `--model` to the catalog agent; OKF uses `--model` with ADC billing project. No provider is hardcoded beyond what you configure at runtime.
</Note>

## Shared workflow pattern

Every enrichment path follows the same lifecycle: discover concepts or entries from a source, attach external grounding, generate enriched prose or structured aspects, optionally refine, then publish.

```mermaid
flowchart TB
  subgraph sources["Source metadata"]
    BQ["BigQuery API / INFORMATION_SCHEMA"]
    KC["kcmd pull / reference"]
  end

  subgraph grounding["External grounding"]
    Drive["Google Drive / local .md"]
    Web["Web crawl seeds"]
    GH["GitHub MCP repo exploration"]
    FB["User-feedback proposals"]
    Usage["BQ query-history signal"]
  end

  subgraph agents["Enrichment agents"]
    OKF["okf/enrichment_agent"]
    CAT["agents/enrichment agent_runner"]
    TB["toolbox/kcagent"]
  end

  subgraph artifacts["Local artifacts"]
    OKFB["OKF bundle"]
    MDC["mdcode workspace"]
  end

  subgraph publish["Publication (user step)"]
    PUSH["kcmd push"]
    API["catalog API publish"]
  end

  BQ --> OKF
  KC --> CAT
  KC --> TB
  Drive --> CAT
  Drive --> TB
  Web --> OKF
  GH --> CAT
  FB --> CAT
  Usage --> CAT
  OKF --> OKFB
  CAT --> MDC
  TB --> MDC
  MDC --> PUSH
  MDC --> API
```

## OKF bundle workflow

The OKF enrichment agent (`okf/src/enrichment_agent/`) implements a two-pass pipeline controlled by `EnrichmentRunner`:

1. **BQ pass** — `BigQuerySource.list_concepts()` enumerates dataset and table concepts (wildcard shard families collapse to one concept per prefix). For each concept, `build_bq_agent` runs an ADK agent with tools to `read_concept`, `sample_rows`, and `write_concept_doc`.
2. **Web pass** (optional) — When `--web-seed` or `--web-seed-file` is set, `build_web_agent` crawls outward from seeds with hard limits (`--web-max-pages`, `--web-max-depth`, host/path constraints). Fetched pages enrich existing concepts or land in `references/<slug>`.
3. **Index regeneration** — `regenerate_indexes()` rebuilds progressive-disclosure `index.md` files across the bundle.

<Steps>
<Step title="Run BQ-then-web enrichment">

```bash
enrichment-agent enrich \
  --source bq \
  --dataset my-project.my_dataset \
  --out ./bundle \
  --web-seed https://cloud.google.com/bigquery/docs \
  --model gemini-2.5-pro
```

Use `--concept tables/events_` to scope a single concept. Pass `--no-web` to skip the web pass.

</Step>
<Step title="Inspect the bundle">

Each concept becomes a markdown file with YAML frontmatter (`type`, `title`, `description`, `timestamp`, optional `resource`). Run `enrichment-agent visualize --bundle ./bundle` to emit `viz.html`.

</Step>
</Steps>

OKF output is designed for version control, agent context loading, and cross-system exchange — not direct Dataplex push. See [Open Knowledge Format](/open-knowledge-format) for bundle semantics.

## mdcode catalog workflow

`agent_runner.py` dispatches three modes. Mode is inferred when `--mode` is empty: `--dataset` implies `table`, otherwise `doc`. `context_overlay` must be set explicitly.

### Table mode

Table mode discovers BigQuery tables exclusively through kcmd:

1. `kcmd init --bigquery-dataset <project>.<dataset>` + manifest declaring schema, overview, and queries aspects.
2. `kcmd pull` writes `catalog/<project>.<dataset>/<table>.yaml` with live schema.
3. Grounding docs are fetched from `--folders` / `--docs` (Drive or local markdown), summarized, and relevance-routed per table (threshold 0.5).
4. Optional `INFORMATION_SCHEMA` usage signal, doc-extracted SQL, and user-feedback `golden_sql` merge into `<table>.queries.md`.
5. Per-table `<table>.overview.md` sidecars are written; pulled `.dataplex-types.global.overview.md` duplicates are removed to prevent silent overwrite on push.

<ParamField body="--dataset" type="string" required>
Fully qualified `project.dataset`.
</ParamField>

<ParamField body="--folders" type="list">
Drive folder URLs/IDs and/or local markdown directories for grounding.
</ParamField>

<ParamField body="--glossaries" type="list">
Dataplex glossaries as `project.location.glossaryId`. Enables column→term linking via `LinkingAgent`; `kcmd push` reconciles `entryLinks.definition`.
</ParamField>

<ParamField body="--include_usage" type="bool" default="true">
Fetch BQ query-history patterns into the `queries` aspect. Requires `dataplex.entryGroups.useQueriesAspect` permission on push.
</ParamField>

### Doc mode

Doc mode builds a knowledge-base entry group:

1. `kcmd init --entry-group <project>.<location>.<entryGroupId>` + pull existing KB entries as seed inputs.
2. Recursive depth-weighted crawl of `--docs` (depth 0 spine) and `--folders` (depth 1 children), max depth 2.
3. Map-reduce summarization: per-doc neutral cards (cache-aware via `KC_ENRICH_CACHE_MODE=summary`), then topic-shaped batch reduction.
4. `EnumerationAgent` produces categories and entries; each entry gets deterministic YAML (`dataplex-types.global.generic` aspect) plus `<id>.overview.md` under `catalog/<category>/`.

Pre-existing KC overviews are preserved as writer grounding — the agent extends rather than drops published content unless contradicted.

### Context overlay mode

Context overlay mirrors table mode but separates ownership:

- 1P BigQuery entries arrive read-only via `kcmd reference` as `<table>.ref.yaml` + `<table>.ref.overview.md`.
- A new overlay entry per table is created in `--entry-group` as `<table>.yaml` + `<table>.overview.md`.
- Only overlay pairs are pushable; `.ref.*` mirrors stay read-only.

Use this when you need richer descriptions without modifying live `@bigquery` entries.

<RequestExample>

```bash
export PYTHONPATH=agents/enrichment/src

python3 agents/enrichment/src/agent_runner.py \
  --mode=table \
  --dataset=my-project.my_dataset \
  --folders=https://drive.google.com/drive/folders/ABC123 \
  --topic="E-commerce analytics" \
  --project=my-gcp-project \
  --location=global \
  --model=gemini-2.5-pro \
  --output_dir=/tmp/enrich_out
```

</RequestExample>

## Context sources

All catalog-agent modes accept overlapping grounding inputs. Priority matters when sources conflict.

| Source | Flags | Modes | Behavior |
| --- | --- | --- | --- |
| Google Drive / local markdown | `--docs`, `--folders` | table, doc, context_overlay | Doc mode crawls recursively; table mode relevance-routes per table |
| GitHub repository | `--repo`, `--repo_ref`, `--repo_subdir`, `--mcp_config` | all | GitHub MCP explores code; doc mode seeds KB entries; table modes join router pool |
| User feedback | `--feedback_dir`, `--feedback_files` | all | `{proposals: [...]}` JSON; **highest priority**, overrides Drive and usage signals |
| BQ usage history | `--include_usage`, `--usage_window_days`, `--usage_scope` | table, context_overlay | `INFORMATION_SCHEMA.JOBS_BY_*` patterns into `queries` aspect |
| Dataplex glossaries | `--glossaries` | table | Reference pull + column linking into entry YAML |
| Web URLs | `--web-seed`, `--web-seed-file` | OKF only | Bounded crawl with host/path guards |

<Warning>
User-feedback proposals with `golden_sql` emit into the `queries` aspect with `source: USER` and take precedence in sidecar ordering. Feedback in doc mode is prepended globally to every entry writer prompt because proposals target table/column FQNs, not KB entry IDs.
</Warning>

## Output artifacts

### OKF bundle layout

:::files
bundle/
├── index.md
├── datasets/
│   └── my_dataset.md
├── tables/
│   └── events_.md
└── references/
    └── some-external-doc.md
:::

Concept documents carry required frontmatter keys and markdown body sections (schema, sample rows, citations). `write_concept_doc` enforces completeness and merges with existing on-disk content via `read_existing_doc`.

### mdcode workspace layout

Table mode writes under the bq-dataset scope:

:::files
output_dir/
├── catalog.yaml
└── catalog/
    └── my-project.my_dataset/
        ├── orders.yaml
        ├── orders.overview.md
        └── orders.queries.md
:::

Doc mode nests by enumeration category:

:::files
output_dir/
├── catalog.yaml
└── catalog/
    └── customer-360/
        ├── orders-entry.yaml
        └── orders-entry.overview.md
:::

Context overlay adds `.ref.*` mirrors alongside overlay pairs. Trajectory files (`trajectory.json`) and `refine_session.json` support evaluation and interactive refinement.

## Refinement before publication

After the initial run, refine without re-reading sources:

<ParamField body="--interactive" type="bool">
Stay in a `refine>` REPL reusing loaded `EnrichmentSession` context.
</ParamField>

<ParamField body="--refine_instruction" type="string">
Apply one refinement turn from saved `refine_session.json`, then exit. Used by webapp persist+re-invoke flows.
</ParamField>

Refinement operations are `rewrite` (regenerate selected overviews) and `answer` (Q&A without file changes). Table-mode re-enumeration recategorizes only — entries are pinned 1:1 to dataset tables. Doc-mode re-enumeration can add, remove, or move entries.

## Publication handoff

Enrichment agents stop at local artifacts. Publishing is explicit:

<Steps>
<Step title="Review local output">

```bash
cd /tmp/enrich_out
kcmd status          # see pending aspect changes
git diff catalog/    # or diff against a prior pull
```

</Step>
<Step title="Push to Knowledge Catalog">

```bash
kcmd push                    # upload publishing.aspects from catalog.yaml
kcmd push --dry-run          # validate without writing
kcmd push --validate-only    # schema check only
```

Set `CLOUDSDK_CORE_PROJECT` and authenticate via `gcloud auth application-default login`. The `publishing` section in `catalog.yaml` controls which aspects and entry links reconcile — reference layers (`.ref.*` in overlay mode) are never pushed.

</Step>
<Step title="Verify in catalog">

Confirm overview and queries aspects appear on target entries. If `queries` push fails with 403, check `dataplex.entryGroups.useQueriesAspect` permission. If overview push reports success but content is unchanged, verify no duplicate `.dataplex-types.global.overview.md` sidecar overwrote your `.overview.md` file.

</Step>
</Steps>

The Python sample (`samples/enrichment`) follows the same handoff with `python3 -m enrichment.publish --dir <output>` as an alternative to `kcmd push` for demonstration datasets.

<Tip>
Run `python -m eval --output-dir <path>` against generated mdcode to score structural validity, hallucination risk, and cross-run consistency before pushing to production entry groups.
</Tip>

## Toolbox customizable agent

`toolbox/enrichment` packages `kcagent`, a TypeScript agent that enriches an existing kcmd workspace with custom MCP tools and skills:

```bash
kcmd init --bigquery-dataset <project>.<dataset>
kcmd pull
kcagent enrich --catalog-path . --tools-path tools --prompt-path prompt.md
kcmd push
```

Configure `tools/mcp.json` (for example `md-fileset` for local markdown corpora) and `tools/skills/*/SKILL.md` to describe tool usage. This path suits organizations that need custom source connectors without modifying the Python catalog agent.

## Mode selection

| Goal | Recommended path |
| --- | --- |
| Portable, git-friendly knowledge exchange | OKF `enrichment-agent enrich` |
| Enrich live BigQuery table overviews in-place | Catalog agent `table` mode + `kcmd push` |
| Build a knowledge base from Google Docs | Catalog agent `doc` mode + `kcmd push` |
| Richer docs without touching `@bigquery` entries | Catalog agent `context_overlay` mode |
| Custom MCP tools and prompt-driven enrichment | `toolbox/kcagent enrich` |
| Learn the API publish flow | `samples/enrichment` download → enrich → publish |

## Related pages

<CardGroup>
<Card title="Produce OKF bundles" href="/produce-okf-bundles">
Run the OKF enrichment agent with BigQuery sources, web crawl seeds, and concept scoping.
</Card>
<Card title="Run catalog enrichment agent" href="/run-catalog-enrichment-agent">
Execute table, doc, or context_overlay modes with Drive, GitHub, feedback, and glossary inputs.
</Card>
<Card title="Publish enriched metadata" href="/publish-enriched-metadata">
Push mdcode workspaces with kcmd and reconcile entry links without modifying reference layers.
</Card>
<Card title="Sync catalog metadata" href="/sync-catalog-metadata">
Initialize kcmd workspaces, pull snapshots, and understand the mdcode layout agents write into.
</Card>
<Card title="Evaluate enrichment output" href="/evaluate-enrichment-output">
Score runs with dynamic golden-free metrics or golden-based evaluation before publication.
</Card>
<Card title="Toolbox enrichment demo" href="/toolbox-enrichment-demo">
End-to-end TypeScript demo with kcmd, kcagent, and md-fileset MCP tools.
</Card>
</CardGroup>

---

## 07. Sync catalog metadata

> Initialize a kcmd workspace for BigQuery, knowledge base, entry group, BigLake, or glossary scope; pull snapshots; check status; and push local edits back to Knowledge Catalog.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/07-sync-catalog-metadata.md
- Generated: 2026-06-15T02:54:30.169Z

### Source Files

- `agents/mdcode/README.md`
- `agents/mdcode/src/tool/commands.ts`
- `agents/mdcode/src/libts/sync.ts`
- `agents/mdcode/src/tool/main.ts`
- `toolbox/mdcode/README.md`
- `toolbox/mdcode/src/tool/commands.ts`

---
title: "Sync catalog metadata"
description: "Initialize a kcmd workspace for BigQuery, knowledge base, entry group, BigLake, or glossary scope; pull snapshots; check status; and push local edits back to Knowledge Catalog."
---

`kcmd` in `agents/mdcode` implements Metadata as Code sync against the Dataplex Catalog API: `init` writes `catalog.yaml`, `pull` and `reference` materialize remote metadata into `catalog/`, and `push` publishes editable local artifacts back. `CatalogSync` in the TypeScript library owns the pull/push engine; the CLI in `src/tool/main.ts` exposes `init`, `pull`, `push`, `reference`, and `mcp`.

<Note>
Install `kcmd` from `agents/mdcode` (`npm install` then `npm run build`) or run `npx kcmd`. Authentication uses gcloud Application Default Credentials.
</Note>

## Prerequisites

Before initializing a workspace:

1. Enable Dataplex / Knowledge Catalog APIs and grant IAM to list, lookup, and modify catalog entries.
2. Authenticate with ADC:

```bash
gcloud auth application-default login
```

`ApiContext.default()` reads the active gcloud project, compute region, and access token. Missing values cause init or sync to fail immediately.

## Sync lifecycle

```mermaid
sequenceDiagram
  participant User
  participant kcmd as kcmd CLI
  participant Snapshot as CatalogSnapshot
  participant Sync as CatalogSync
  participant API as Dataplex Catalog API

  User->>kcmd: init --scope-flag
  kcmd->>Snapshot: write catalog.yaml
  User->>kcmd: pull
  kcmd->>Sync: pull()
  Sync->>API: lookupEntry / lookupEntryLinks
  API-->>Sync: entries + links
  Sync->>Snapshot: _storeResource → catalog/
  User->>kcmd: edit local files
  User->>kcmd: push
  kcmd->>Sync: push()
  Sync->>Snapshot: listEntries / _fetchResource
  Sync->>API: createEntry / modifyEntry / reconcile EntryLinks
```

| Phase | Command | Local output | Remote effect |
|-------|---------|--------------|---------------|
| Bootstrap | `kcmd init` | `catalog.yaml` | None |
| Download | `kcmd pull` | `catalog/**/*.yaml` or `*.md` | Read-only |
| Grounding | `kcmd reference` | `catalog/**/*.ref.yaml` | Read-only |
| Publish | `kcmd push` | Unchanged files | Creates/updates entries and links |

## Initialize a workspace

`kcmd init` requires exactly one primary source type. The flag selects workspace mode, `catalog.yaml` `scope`, and on-disk layout (YAML for data assets, Markdown for knowledge bases).

| Mode | Flag | ID format | Layout |
|------|------|-----------|--------|
| BigQuery | `--bigquery-dataset` | `project.dataset` (repeat flag for multiple datasets) | YAML |
| Knowledge base | `--kb` | `project.location.entry-group-id` | Markdown (`.md`) |
| Entry group | `--entry-group` | `project.location.entry-group-id` | YAML |
| BigLake (Iceberg) | `--biglake-namespace` + `--iceberg` | `project.catalog.namespace` | YAML |
| Glossary | `--glossary` | `project.location.glossary-id` (comma-separated or location-only) | YAML under `catalog/glossaries/` |

<Steps>
<Step title="Create the workspace directory">

```bash
mkdir my-catalog-workspace && cd my-catalog-workspace
```

</Step>
<Step title="Run init for your scope">

<CodeGroup>
```bash BigQuery
kcmd init --bigquery-dataset my-project.my_dataset
```

```bash Knowledge base
kcmd init --kb my-project.us-central1.my-kb-id
```

```bash Entry group
kcmd init --entry-group my-project.us-central1.my-entry-group
```

```bash BigLake Iceberg
kcmd init --biglake-namespace my-project.my-catalog.my-namespace --iceberg
```

```bash Glossary
kcmd init --glossary my-project.us-central1.my-glossary-id
```
</CodeGroup>

Add `--pull` to initialize and immediately download metadata:

```bash
kcmd init --bigquery-dataset my-project.my_dataset --pull
```

</Step>
<Step title="Verify catalog.yaml">

Init prints the generated manifest. A BigQuery workspace produces a scope like `bq-dataset.my-project.my_dataset`. Customize `snapshot`, `publishing`, and optional `reference` blocks before the first pull.

</Step>
</Steps>

<ParamField body="--bigquery-dataset" type="string[]">
One or more dataset IDs as `project.dataset`. Multiple flags merge into a single multi-dataset workspace.
</ParamField>

<ParamField body="--kb" type="string" required>
Knowledge base entry group as `project.location.entry-group-id`. Uses Markdown layout.
</ParamField>

<ParamField body="--entry-group" type="string" required>
Custom Dataplex entry group as `project.location.entry-group-id`.
</ParamField>

<ParamField body="--biglake-namespace" type="string" required>
BigLake namespace as `project.catalog.namespace`. Requires `--iceberg`; other metastores are rejected.
</ParamField>

<ParamField body="--glossary" type="string" required>
Glossary scope: single ID, comma-separated IDs, display name, or location-only (`project.location`) for all glossaries in a location.
</ParamField>

<ParamField body="--pull" type="boolean">
Run `kcmd pull` immediately after writing `catalog.yaml`.
</ParamField>

## Pull editable metadata

`kcmd pull` loads `catalog.yaml`, enumerates entries from the scoped source, calls `lookupEntry` for each matching entry type, and writes files under `catalog/`.

```bash
kcmd pull
```

<ParamField body="--dry-run" type="boolean">
Log `[DRY-RUN] Pull Resource: …` without writing files.
</ParamField>

### What pull fetches

- **Entries** listed in `snapshot.entries` (or all entries when the list is empty).
- **Aspects** named in `snapshot.aspects`, passed to `lookupEntry` as the aspect filter.
- **Entry links** when `snapshot.entryLinks` is set: `lookupEntryLinks` runs per entry. Column-level links with a `Schema.<field>` source path land under `aspects.schema.fields[].links`; entry-level links appear under top-level `links`. Omit `snapshot.entryLinks` to skip link download.

### Layout-specific output

| Scope layout | File pattern | Example |
|--------------|--------------|---------|
| Standard (BQ, entry group, BigLake, glossary) | `catalog/<namespace>/…/<entry-id>.yaml` | `catalog/bigquery/my-project/my_dataset/orders.yaml` |
| Documents (knowledge base) | `catalog/<entry-id>.md` with YAML frontmatter | `catalog/getting-started.md` |

Long-form aspect text can detach into sidecar files such as `orders.dataplex-types.global.overview.md`.

## Pull reference layers

`kcmd reference` downloads read-only metadata declared in the manifest `reference:` block. Reference files use a `.ref.yaml` suffix as siblings to editable files and are never pushed.

```bash
kcmd reference
```

Typical `catalog.yaml` reference block:

```yaml
scope: entryGroup.my-project.us-central1.my-eg
reference:
  scope: bq-dataset.my-project.my_dataset
  snapshot:
    entries:
      - dataplex-types.global.bigquery-table
    aspects:
      - dataplex-types.global.schema
    entryLinks:
      - definition
```

Reference pull honors `reference.snapshot.entryLinks` the same way `pull` honors `snapshot.entryLinks`, so diffs between live `.yaml` and `.ref.yaml` show only enrichment deltas.

<Warning>
Files ending in `.ref.yaml` are skipped during `push`. `isModifiable()` returns false when an entry has only a reference path.
</Warning>

## Check local changes

The design doc and toolbox README describe `kcmd status` for detecting local modifications against a saved checksum state. In the current `agents/mdcode` implementation:

- `CatalogSync.status()` throws `Not yet implemented`.
- `src/tool/main.ts` does not register a `status` subcommand.

Until `status` ships, inspect changes with version control or filesystem diff:

```bash
git diff catalog/
git status catalog/
```

For enrichment workflows, compare editable files against `.ref.yaml` baselines to isolate agent-added metadata.

## Push local edits

`kcmd push` walks modifiable entries in `catalog/`, converts local YAML or Markdown to Dataplex API payloads, and applies creates or updates.

```bash
kcmd push
```

<ParamField body="--dry-run" type="boolean">
Log planned creates, updates, and EntryLink mutations without calling the API.
</ParamField>

<ParamField body="--force" type="boolean">
Declared on the CLI; conflict override is not yet wired in `CatalogSync.push`.
</ParamField>

<ParamField body="--validate-only" type="boolean">
Declared on the CLI; pre-push validation is not yet wired in `CatalogSync.push`.
</ParamField>

### Push behavior by resource type

**Catalog entries**

- Missing entries: `push` auto-creates the parent entry group (if needed) and the entry from local file paths.
- Existing entries: `modifyEntry` updates `aspects` and, for non-ingested sources, `entry_source`.
- Only aspects listed in `publishing.aspects` are written back.

**Entry links**

When `publishing.entryLinks` is set, `push` reconciles local vs remote links per entry:

1. Normalize both sides (unwrap `@dataplex` proxies, canonicalize project IDs).
2. Keep matching links in place.
3. Delete remote links of configured types with no local match.
4. Create local links missing remotely.

Omit or leave `publishing.entryLinks` empty to disable link mutations.

**Glossary hierarchy**

<Warning>
`kcmd push` does not create `Glossary`, `GlossaryCategory`, or `GlossaryTerm` resources. Missing glossary nodes cause push to fail fast with an explicit error. Provision glossaries via the Dataplex console or `gcloud dataplex glossaries create` first, then `pull` and `push` to update descriptions and labels on existing nodes. EntryLinks that reference glossary terms are created and deleted normally.
</Warning>

### Auto-creation rules

| Resource | Created by push? |
|----------|------------------|
| Entry group | Yes, when missing |
| Catalog entry | Yes, when missing |
| Entry link | Yes (when `publishing.entryLinks` configured) |
| Glossary / category / term | No — must exist before push |

## Configure sync scope in catalog.yaml

The manifest drives every sync operation. Key fields:

<ResponseField name="scope" type="string | string[]">
Primary source of truth. Supported prefixes: `bq-dataset.*`, `entryGroup.*`, `kb.*`, `biglake-namespace.*`, `biglake-iceberg-namespace.*`, `glossary.<project>.<location>.<id>`.
</ResponseField>

<ResponseField name="snapshot" type="object">
`entries`, `aspects`, and optional `entryLinks` to download on `pull`.
</ResponseField>

<ResponseField name="publishing" type="object">
Subset of `snapshot` aspects and `entryLinks` written on `push`. Publishing entry link types must appear in `snapshot.entryLinks`.
</ResponseField>

<ResponseField name="reference" type="object">
Optional read-only scope and `reference.snapshot` for `kcmd reference`.
</ResponseField>

Example manifest for a BigQuery dataset with link sync:

```yaml
scope: bq-dataset.my-project.my_dataset

snapshot:
  entries:
    - dataplex-types.global.bigquery-table
  aspects:
    - dataplex-types.global.schema
    - dataplex-types.global.overview
  entryLinks:
    - definition

publishing:
  aspects:
    - dataplex-types.global.overview
  entryLinks:
    - definition
```

## Workspace layout

:::files
my-workspace/
├── catalog.yaml
└── catalog/
    └── bigquery/
        └── my-project/
            ├── my_dataset.yaml
            └── my_dataset/
                ├── orders.yaml
                ├── orders.ref.yaml
                └── orders.dataplex-types.global.overview.md
:::

Knowledge base workspaces replace `.yaml` entry files with `.md` documents. Glossary workspaces mirror the glossary tree under `catalog/glossaries/`.

## Agent-driven sync

Start the MCP server to let agents read and modify the local snapshot:

```json
{
  "mcpServers": {
    "kcmd": {
      "command": "npx",
      "args": ["-y", "kcmd", "mcp", "--path", "/absolute/path/to/workspace"]
    }
  }
}
```

MCP tools: `list-entries`, `lookup-entry`, `modify-entry`. Run `kcmd pull` and `kcmd push` from the CLI (or enrichment pipelines) to sync with the remote catalog after agent edits.

## Common failures

| Symptom | Likely cause | Verification |
|---------|--------------|--------------|
| `Unable to retrieve project, location, or token` | Missing gcloud ADC or config | `gcloud auth application-default login` |
| `Must provide either --entry-group, --bigquery-dataset, …` | No init flag | Pass exactly one source flag |
| `Must specify --iceberg when initializing a BigLake namespace` | BigLake without `--iceberg` | Add `--iceberg` |
| Pull skips entries silently | 403 on `lookupEntry` (missing resource or IAM) | Confirm entry exists in console; check permissions |
| `Glossary term '…' does not exist` on push | Glossary node not provisioned | Create via console/gcloud, then `pull` |
| `Failed to create entry group` | IAM lacks `entryGroups.create` | Grant Dataplex admin or entry-group create role |
| Entry links deleted and recreated every push | Project ID vs number mismatch | Ensure normalization; avoid hand-editing `id` fields |

## End-to-end workflow

<Steps>
<Step title="Initialize and pull">

```bash
kcmd init --bigquery-dataset my-project.ecommerce --pull
```

</Step>
<Step title="Optional: pull reference baselines">

Add a `reference:` block to `catalog.yaml`, then:

```bash
kcmd reference
```

</Step>
<Step title="Edit metadata locally">

Update aspect YAML, sidecar Markdown, or entry link targets under `catalog/`.

</Step>
<Step title="Preview push">

```bash
kcmd push --dry-run
```

</Step>
<Step title="Publish">

```bash
kcmd push
```

Expect `Successfully pushed catalog entries.` on success.

</Step>
</Steps>

## Related pages

<CardGroup>
<Card title="Metadata as Code" href="/metadata-as-code">
Workspace model, manifest fields, reference layers, and entry link semantics.
</Card>
<Card title="Quickstart" href="/quickstart">
First successful init, pull, and inspect workflow.
</Card>
<Card title="kcmd CLI reference" href="/kcmd-cli-reference">
Full command and flag reference for init, pull, push, and reference.
</Card>
<Card title="catalog.yaml manifest reference" href="/catalog-manifest-reference">
Scope, snapshot, publishing, and entryLinks reconciliation rules.
</Card>
<Card title="Publish enriched metadata" href="/publish-enriched-metadata">
Push enrichment output and reconcile entry links after agent runs.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
Auth, billing, push conflict, and glossary provisioning failures.
</Card>
</CardGroup>

---

## 08. Produce OKF bundles

> Run the OKF enrichment agent against a BigQuery source with optional web crawl seeds, concept scoping, and two-pass BQ-then-web enrichment into a versionable bundle directory.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/08-produce-okf-bundles.md
- Generated: 2026-06-15T02:53:50.950Z

### Source Files

- `okf/README.md`
- `okf/src/enrichment_agent/cli.py`
- `okf/src/enrichment_agent/runner.py`
- `okf/src/enrichment_agent/agent.py`
- `okf/src/enrichment_agent/sources/bigquery.py`
- `okf/src/enrichment_agent/prompts/enrichment_instruction.md`
- `okf/src/enrichment_agent/prompts/web_ingestion_instruction.md`

---
title: "Produce OKF bundles"
description: "Run the OKF enrichment agent against a BigQuery source with optional web crawl seeds, concept scoping, and two-pass BQ-then-web enrichment into a versionable bundle directory."
---

The `enrichment-agent` CLI (`python -m enrichment_agent enrich`) reads BigQuery dataset metadata through a pluggable `Source` interface, runs a Google ADK agent per concept to emit OKF markdown documents, optionally augments those docs from seeded web pages, and regenerates `index.md` files across the output bundle directory.

## What you produce

An OKF bundle is a directory of markdown files with YAML frontmatter. Each BigQuery concept becomes one document; the web pass may add `references/` docs and augment existing primary concepts. The bundle is plain files—suitable for git, static hosting, or downstream agent consumption.

:::files
bundles/<name>/
├── index.md                    # Auto-generated directory index
├── datasets/
│   ├── index.md
│   └── <dataset_id>.md
├── tables/
│   ├── index.md
│   └── <table_id>.md           # Sharded families use prefix (e.g. events_.md)
└── references/                 # Optional, from web pass
    ├── metrics/
    ├── joins/
    └── <slug>.md
:::

<Info>
OKF bundles are vendor-neutral. The enrichment agent is one producer; the format itself is defined in the OKF specification and is not tied to a model provider or serving system.
</Info>

## Two-pass enrichment

Enrichment runs in two sequential passes orchestrated by `EnrichmentRunner.enrich_all()`:

| Pass | Agent | Input | Output |
|------|-------|-------|--------|
| BQ pass | `okf_bq_enrichment_agent` | BigQuery metadata per concept | One OKF doc per advertised concept |
| Web pass | `okf_web_ingestion_agent` | Seed URLs and crawl constraints | Augmented primary docs and optional `references/` docs |

```mermaid
sequenceDiagram
    participant CLI as enrichment_agent CLI
    participant Runner as EnrichmentRunner
    participant BQ as BigQuerySource
    participant BQAgent as okf_bq_enrichment_agent
    participant WebAgent as okf_web_ingestion_agent
    participant Bundle as bundle_root/

    CLI->>Runner: enrich_all(only?)
    Runner->>BQ: list_concepts()
    loop Each concept
        Runner->>BQAgent: enrich_concept(ref)
        BQAgent->>Bundle: write_concept_doc
    end
    opt web_seeds provided
        Runner->>WebAgent: run_web_pass()
        WebAgent->>Bundle: augment / mint references
    end
    Runner->>Bundle: regenerate_indexes()
```

**BQ pass.** For each `ConceptRef` from the source, the agent calls `read_concept_raw`, optionally `sample_rows`, and writes exactly one document via `write_concept_doc`. Documents include prose, `# Schema`, `# Common query patterns`, and `# Citations`.

**Web pass.** When seed URLs are provided, a separate agent crawls outward from seeds using `fetch_url`. For each fetched page it enriches existing concepts, mints `references/<slug>` docs, or skips. Hard limits are enforced inside the tool—not by prompt alone.

Skip the web pass with `--no-web`, or omit seeds entirely.

## Prerequisites

<Steps>
<Step title="Install the OKF package">

From the `okf/` directory:

```bash
python3.13 -m venv .venv
.venv/bin/pip install --index-url https://pypi.org/simple/ -e .[dev]
```

The CLI entry point is `enrichment-agent`; the module form `python -m enrichment_agent` is equivalent.

</Step>

<Step title="Configure BigQuery credentials">

```bash
gcloud auth application-default login
gcloud config set project <your-billing-project>
```

Public datasets are readable, but query bytes bill against the caller's project. Override billing with `--billing-project`.

</Step>

<Step title="Configure model credentials">

Use one of:

<Tabs>
<Tab title="AI Studio">

Set `GEMINI_API_KEY`.

</Tab>
<Tab title="Vertex AI">

```bash
export GOOGLE_GENAI_USE_VERTEXAI=true
export GOOGLE_CLOUD_PROJECT=<id>
export GOOGLE_CLOUD_LOCATION=<region>
```

</Tab>
</Tabs>

Default model is `gemini-flash-latest` (override with `--model`).

</Step>
</Steps>

## Run enrichment

<RequestExample>

```bash
.venv/bin/python -m enrichment_agent enrich \
    --source bq \
    --dataset bigquery-public-data.ga4_obfuscated_sample_ecommerce \
    --web-seed-file samples/ga4_merch_store/seeds.txt \
    --out ./bundles/ga4_merch_store
```

</RequestExample>

<ResponseExample>

```text
Enriched 12 concept(s) into bundles/ga4_merch_store; web pass used 3 seed(s)
```

</ResponseExample>

### Required flags

<ParamField body="--source" type="string" required>
Source adapter. Currently only `bq` (BigQuery).
</ParamField>

<ParamField body="--dataset" type="string" required>
BigQuery dataset in `project.dataset` form (for example `bigquery-public-data.ga4_obfuscated_sample_ecommerce`).
</ParamField>

<ParamField body="--out" type="path" required>
Bundle root directory. Created if missing.
</ParamField>

### Concept scoping

<ParamField body="--concept" type="string">
Enrich only the given concept id. Repeatable. Format is slash-separated segments matching the source's concept ids, for example `tables/events_` or `datasets/ga4_obfuscated_sample_ecommerce`.
</ParamField>

Use concept scoping to iterate on a single table without re-running the full dataset:

```bash
.venv/bin/python -m enrichment_agent enrich \
    --source bq \
    --dataset bigquery-public-data.ga4_obfuscated_sample_ecommerce \
    --web-seed-file samples/ga4_merch_store/seeds.txt \
    --out ./bundles/ga4_merch_store \
    --concept tables/events_
```

Unknown concept ids raise `ValueError` before enrichment starts.

### Web crawl configuration

| Flag | Default | Purpose |
|------|---------|---------|
| `--web-seed` | — | Single seed URL; repeatable |
| `--web-seed-file` | — | File with one URL per line (`#` comments allowed); repeatable |
| `--no-web` | `false` | Skip web pass entirely |
| `--web-max-pages` | `100` | Hard cap on pages fetched per run |
| `--web-max-depth` | `2` | Max hop distance from any seed (seeds are depth 0) |
| `--web-allowed-host` | seed hosts only | Extra hostnames the crawler may fetch; repeatable |
| `--web-allowed-path-prefix` | no restriction | Only fetch URLs whose path starts with one of these prefixes; repeatable |
| `--web-denied-path-substring` | — | Reject URLs whose path contains these substrings; repeatable |

Seed files support inline comments:

```text
# GA4 BigQuery Export — schema reference
https://support.google.com/analytics/answer/7029846
```

Allowed hosts default to the netloc of each seed URL. The `fetch_url` tool rejects URLs outside allowed hosts, over budget, beyond max depth, on denied path substrings, or not reachable from the seed link graph.

<Warning>
When `fetch_url` returns `"max_pages reached"` or an `error` field, treat it as final. Do not retry rejected URLs in the same run.
</Warning>

### Other flags

<ParamField body="--billing-project" type="string">
Google Cloud project billed for BigQuery queries. Defaults to Application Default Credentials default project.
</ParamField>

<ParamField body="--model" type="string">
Gemini model id. Default: `gemini-flash-latest`.
</ParamField>

<ParamField body="-v, --verbose" type="boolean">
Enable debug logging for enrichment agent events.
</ParamField>

## BigQuery concepts

`BigQuerySource` advertises one concept per dataset plus one per table. Sharded tables matching `prefix_######` (6–8 digit suffix) collapse into a single wildcard concept at `tables/<prefix>` with a representative shard for schema sampling.

Concept ids map to filesystem paths:

| Concept id | Document path |
|------------|---------------|
| `datasets/<dataset_id>` | `datasets/<dataset_id>.md` |
| `tables/<table_id>` | `tables/<table_id>.md` |
| `references/<slug>` | `references/<slug>.md` |

The BQ agent tools are `list_concepts`, `read_concept_raw`, `sample_rows`, `read_existing_doc`, and `write_concept_doc`.

## Web pass behavior

The web agent augments BQ-produced docs under strict rules:

- **Augmentation, not rewrite.** Existing `#` headings, schema field listings, and citations must be preserved. The tool refuses writes that shrink `# Schema` field sets or reduce `# Citations` entry counts on `BigQuery Table` docs.
- **Reference minting.** Pages that define reusable entities, metrics, enums, or conventions may become `references/<slug>.md` docs when they pass topic-shape, citation, and reuse gates.
- **Structured extractions.** Metrics go to `references/metrics/<slug>.md`; join paths to `references/joins/<a>__<b>.md`. These bypass the four-gate reference test.

Web agent tools add `fetch_url` to the BQ tool set.

## Bundle output and indexes

After both passes, `regenerate_indexes()` writes or updates `index.md` at every directory level in the bundle. Each index groups child concepts by `type` frontmatter field and links to their `description` one-liner.

Documents require frontmatter keys `type`, `title`, `description`, and `timestamp` (auto-filled when omitted). Recommended keys are `resource` and `tags`.

<Check>
Verify a successful run by confirming concept markdown files exist under `datasets/` and `tables/`, optional `references/` content appears when seeds were used, and `index.md` files are present at the bundle root and in subdirectories.
</Check>

## Version and iterate

Bundles are directories of plain files. Commit them to git for diff-based review, re-run with `--concept` to refine individual docs, or point `--out` at an existing bundle so `read_existing_doc` lets the agent refine rather than rewrite.

Pre-built sample bundles live under `okf/bundles/` (GA4, Stack Overflow, Bitcoin). Matching recipes with exact commands and seed files are under `okf/samples/`.

## Troubleshooting

| Symptom | Likely cause | Action |
|---------|--------------|--------|
| `--dataset is required for --source bq` | Missing dataset flag | Pass `--dataset project.dataset` |
| `dataset must be in 'project.dataset' form` | Malformed dataset id | Use two-part identifier |
| `Unknown concept(s): ...` | Invalid `--concept` id | Run without `--concept` first to see advertised ids via source listing |
| Web pass produces no references | Seeds too broad or budget exhausted | Add focused seed URLs; raise `--web-max-pages` or tighten `--web-allowed-path-prefix` |
| `Refusing to write: ... missing ... field(s)` | Web agent replaced schema | Re-run with augmentation-aware prompts; preserve existing `# Schema` |
| `max_pages reached` in logs | Crawl budget spent | Increase `--web-max-pages` or reduce seed scope |

<AccordionGroup>
<Accordion title="BQ-only enrichment">

Omit seeds or pass `--no-web`:

```bash
.venv/bin/python -m enrichment_agent enrich \
    --source bq \
    --dataset <project>.<dataset> \
    --no-web \
    --out ./bundles/<name>
```

</Accordion>

<Accordion title="Restrict crawl to documentation paths">

```bash
.venv/bin/python -m enrichment_agent enrich \
    --source bq \
    --dataset <project>.<dataset> \
    --web-seed-file seeds.txt \
    --web-allowed-path-prefix /docs/ \
    --web-denied-path-substring /login \
    --web-max-pages 50 \
    --out ./bundles/<name>
```

</Accordion>
</AccordionGroup>

## Next

<CardGroup>
<Card title="Open Knowledge Format" href="/open-knowledge-format">
OKF v0.1 bundle structure, frontmatter fields, index.md progressive disclosure, and cross-link semantics.
</Card>
<Card title="OKF bundle recipes" href="/okf-bundle-recipes">
Copy-paste commands for GA4, Stack Overflow, and Bitcoin public datasets with seed files and expected outputs.
</Card>
<Card title="Visualize OKF bundles" href="/visualize-okf-bundles">
Generate self-contained `viz.html` graph viewers from produced bundles.
</Card>
<Card title="OKF enrichment CLI reference" href="/okf-enrichment-cli-reference">
Full flag and environment variable reference for `enrich` and `visualize` subcommands.
</Card>
</CardGroup>

---

## 09. Run the catalog enrichment agent

> Execute table, doc, or context_overlay modes with Drive, local markdown, GitHub, feedback, glossary, and usage-signal inputs; refine output interactively before publishing.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/09-run-the-catalog-enrichment-agent.md
- Generated: 2026-06-15T02:54:30.472Z

### Source Files

- `agents/enrichment/README.md`
- `agents/enrichment/src/agent_runner.py`
- `agents/enrichment/src/engine.py`
- `agents/enrichment/src/modes/context_overlay_mode.py`
- `agents/enrichment/src/tools/kcmd_tools.py`
- `agents/enrichment/src/refine.py`
- `agents/enrichment/src/tools/github_tools.py`

---
title: Run the catalog enrichment agent
description: Execute table, doc, or context_overlay modes with Drive, local markdown, GitHub, feedback, glossary, and usage-signal inputs; refine output interactively before publishing.
---

The catalog enrichment agent generates **Metadata as Code** (mdcode) for Google Cloud Knowledge Catalog (Dataplex). It reads source material—Google Drive documents, local Markdown, BigQuery metadata, optional GitHub repositories, user-feedback proposals, and query-usage signals—and writes enriched YAML and Markdown artifacts under a local output directory. The agent talks to the catalog **only through `kcmd`** (read-only `init`, `pull`, and `reference`); you publish with `kcmd push`.

Entry point: `agents/enrichment/src/agent_runner.py`.

## Choose a mode

Three enrichment flows are available. Mode is selected with `--mode` or inferred when omitted (`--dataset` implies `table`; otherwise `doc`). `context_overlay` is never inferred—you must pass it explicitly.

| Mode | Target | What it produces |
|------|--------|------------------|
| `table` | BigQuery dataset (`--dataset`) | Enriched overviews and `queries` aspects on live `@bigquery` table entries |
| `doc` | Entry group (`--entry_group`) | Knowledge-base entries from crawled docs (map-reduce → enumerate → write) |
| `context_overlay` | Dataset + entry group | New overlay entries per table in an editable group; 1P tables pulled read-only via `kcmd reference` |

<AccordionGroup>
<Accordion title="Table mode — enrich live BigQuery entries">

`kcmd init --bigquery-dataset` and `kcmd pull` scaffold the workspace. The agent routes Drive or local Markdown documents to each table, writes enriched `<table>.overview.md` sidecars, and optionally emits `<table>.queries.md` from `INFORMATION_SCHEMA` query history plus SQL extracted from routed docs. With `--glossaries`, columns are mapped to Dataplex glossary terms and field-level `links.definition` are injected.

</Accordion>
<Accordion title="Doc mode — build a knowledge base from documents">

Crawls Google Docs (and optional Drive folders or local Markdown directories), map-reduces them through a topic lens, enumerates canonical entries, and fans out per-entry overview writers. Requires `--entry_group` to already exist—the agent does not create entry groups.

</Accordion>
<Accordion title="Context-overlay mode — enrich without touching live tables">

Like table mode for routing and writing, but 1P BigQuery entries are pulled read-only via `kcmd reference` as `<table>.ref.yaml` mirrors. One new generic overlay entry per table is created in your editable `--entry_group`. The `queries` aspect attaches to the overlay, not the live table.

</Accordion>
</AccordionGroup>

## Prerequisites

<Steps>
<Step title="Build kcmd">

```bash
cd agents/mdcode
npm install
npm run build   # -> agents/mdcode/dist/kcmd
```

The agent resolves `kcmd` automatically at `agents/mdcode/dist/kcmd` (override with `$KCMD_BIN`). Add `dist` to `PATH` only if you plan to run `kcmd push` yourself.

</Step>
<Step title="Install Python dependencies">

```bash
python3 -m venv ~/.venv/kc-enrich
source ~/.venv/kc-enrich/bin/activate
pip install -r agents/enrichment/src/requirements.txt
```

`google-cloud-bigquery` powers usage signals; `mcp` is needed only for a local stdio GitHub MCP server (the default hosted remote works without it).

</Step>
<Step title="Authenticate">

```bash
gcloud auth application-default login \
  --scopes='openid,https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/drive.readonly'
```

Vertex AI project, location, and model are supplied per run via flags—nothing is hardcoded.

</Step>
</Steps>

## Required flags

Every invocation requires these flags regardless of mode:

<ParamField body="--project" type="string" required>
Google Cloud project hosting the Vertex AI model.
</ParamField>

<ParamField body="--model" type="string" required>
Vertex AI model id for reasoning-heavy steps (e.g. `gemini-2.5-pro`). High-volume structured steps use `KC_LIGHT_MODEL` when set, otherwise the main model.
</ParamField>

<ParamField body="--output_dir" type="string" required>
Local directory for the generated mdcode tree, `trajectory.json`, and `refine_session.json`.
</ParamField>

<ParamField body="--location" type="string" default="global">
Vertex AI location (e.g. `us-central1`).
</ParamField>

<ParamField body="--topic" type="string" default="Metadata enrichment">
Free-text use case that steers enrichment and doc-mode topic reduction.
</ParamField>

Mode-specific requirements:

| Flag | `doc` | `table` | `context_overlay` |
|------|:-----:|:-------:|:-----------------:|
| `--dataset` | — | required | required |
| `--entry_group` | required | — | required |
| `--folders` | optional | optional | optional |
| `--docs` | optional | — | optional |
| `--tables` | — | — | optional |
| `--include_usage` | — | optional (default `true`) | optional (default `true`) |
| `--glossaries` | — | optional | — |
| `--feedback_dir` / `--feedback_files` | optional | optional | optional |
| `--repo` / `--repo_ref` / `--repo_subdir` | optional | optional | optional |
| `--interactive` / `--refine_instruction` | optional | optional | optional |

See [Enrichment agent flags reference](/enrichment-agent-flags) for the full flag matrix.

## Configure source inputs

### Google Drive and local Markdown

`--folders` and `--docs` accept a comma-separated mixed list. Each entry is classified format-first:

1. `http://` / `https://` → Google Drive (Doc or folder URL)
2. Ends in `.md` / `.markdown` → local Markdown file
3. Path-shaped (`/abs`, `./rel`, `~/path`, or contains `/`) → local directory (read recursively) or file
4. Bare name that exists on disk → local
5. Otherwise → Google Drive ID

In **doc mode**, a local `.md` in `--docs` is a depth-0 spine doc; a directory contributes depth-1 children. In **table** and **context_overlay** modes, local files join the relevance-router candidate pool alongside Drive documents.

### BigQuery usage signal

For `table` and `context_overlay` modes, `--include_usage` (default `true`) fetches `INFORMATION_SCHEMA` query history and emits `<table>.queries.md` sidecars conforming to the Dataplex `queries` aspect.

<ParamField body="--usage_window_days" type="integer" default="30">
Days of query history to aggregate.
</ParamField>

<ParamField body="--usage_scope" type="enum" default="auto">
`auto` tries `JOBS_BY_PROJECT` then falls back to `JOBS_BY_USER`; `project` requires project-wide access; `user` reads only the caller's queries.
</ParamField>

<ParamField body="--anonymize_users" type="boolean" default="false">
Replace user emails with stable SHA hashes in the usage signal.
</ParamField>

### Glossary column linking (table mode only)

<ParamField body="--glossaries" type="string">
Comma-separated Dataplex glossaries as `project.location.glossaryId`. Maps BigQuery columns to glossary terms and injects field-level `links.definition` into entry YAML.
</ParamField>

### User-feedback proposals (all modes)

<ParamField body="--feedback_dir" type="string">
Directory of feedback files (`.md`/`.json`) walked recursively. Each file holds JSON shaped `{"proposals": [...]}`.
</ParamField>

<ParamField body="--feedback_files" type="string">
Explicit comma-separated feedback file paths; combinable with `--feedback_dir`.
</ParamField>

Feedback is the **highest-priority context source**—proposals override conflicting information from Drive docs, semantic search, or `INFORMATION_SCHEMA`-derived patterns. In table and overlay modes, proposals route per-table by `target_asset.name` FQN; `eval_candidate.golden_sql` from valid proposals becomes a `[Source: User Feedback]` entry in the `queries` aspect.

### GitHub source code (all modes)

<ParamField body="--repo" type="string">
GitHub repo as `owner/name` or URL. A code-understanding agent explores the repo via the GitHub MCP server and distills code component cards.
</ParamField>

<ParamField body="--repo_ref" type="string">
Branch, tag, or SHA (default: repo default branch).
</ParamField>

<ParamField body="--repo_subdir" type="string">
Path prefix to scope exploration (e.g. `src/server`).
</ParamField>

<ParamField body="--mcp_config" type="string">
Path to `mcp.json` describing the GitHub MCP server. Falls back to `KC_ENRICH_MCP_CONFIG`, then the hosted remote server. Select server entry with `KC_ENRICH_GITHUB_MCP_SERVER` (default `github_remote`).
</ParamField>

Set a Personal Access Token before running:

```bash
export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_...
```

<CodeGroup>

```json title="mcp.json — remote and local servers"
{
  "mcpServers": {
    "github_remote": {
      "type": "http",
      "url": "https://api.githubcopilot.com/mcp/",
      "headers": {"Authorization": "Bearer ${GITHUB_PERSONAL_ACCESS_TOKEN}"}
    },
    "github": {
      "type": "stdio",
      "command": "github-mcp-server",
      "args": ["stdio"],
      "env": {"GITHUB_PERSONAL_ACCESS_TOKEN": "${GITHUB_PERSONAL_ACCESS_TOKEN}"}
    }
  }
}
```

</CodeGroup>

In **doc mode**, distinct components surface as their own knowledge-base entries. In **table** and **context_overlay** modes, cards join the relevance router's candidate pool so code that reads or writes a table grounds that table's overview and queries aspect.

## Run the agent

<Steps>
<Step title="Set PYTHONPATH">

```bash
export PYTHONPATH=agents/enrichment/src
```

</Step>
<Step title="Run a mode">

<Tabs>
<Tab title="Table">

```bash
python3 agents/enrichment/src/agent_runner.py \
  --mode=table \
  --dataset=my-proj.analytics \
  --folders=https://drive.google.com/drive/folders/<id>,./local_md_corpus \
  --topic="Customer 360 data" \
  --project=my-gcp-project \
  --location=us-central1 \
  --model=gemini-2.5-pro \
  --output_dir=/tmp/enrich_out
```

</Tab>
<Tab title="Doc">

Create the entry group first:

```bash
gcloud dataplex entry-groups create myEntryGroup \
  --project=my-gcp-project --location=us-central1
```

Then run:

```bash
python3 agents/enrichment/src/agent_runner.py \
  --mode=doc \
  --docs="https://docs.google.com/document/d/<id>,./notes/data_model.md" \
  --folders=<drive_folder_id_or_url> \
  --topic="Order pipeline documentation" \
  --entry_group=my-gcp-project.us-central1.myEntryGroup \
  --project=my-gcp-project \
  --model=gemini-2.5-pro \
  --output_dir=/tmp/enrich_out
```

</Tab>
<Tab title="Context overlay">

```bash
python3 agents/enrichment/src/agent_runner.py \
  --mode=context_overlay \
  --dataset=my-proj.analytics \
  --entry_group=my-gcp-project.us-central1.overlayGroup \
  --folders=<drive_folder_id_or_url> \
  --tables=orders,customers \
  --topic="Enriched table context" \
  --project=my-gcp-project \
  --model=gemini-2.5-pro \
  --output_dir=/tmp/enrich_out
```

</Tab>
</Tabs>

</Step>
<Step title="Verify output">

```bash
find /tmp/enrich_out -type f
```

Expected artifacts:

:::files
/tmp/enrich_out/
├── catalog.yaml          # kcmd manifest (written by agent via kcmd init)
├── catalog/              # per-entry YAML + sidecar Markdown
├── trajectory.json       # tool-call log of what the agent read and produced
└── refine_session.json   # saved session for refinement re-invocation
:::

In **context_overlay** mode, each table directory also contains read-only mirrors:

```
catalog/bigquery/<project>/<dataset>/
├── orders.ref.yaml           # read-only 1P entry (kcmd reference)
├── orders.ref.overview.md    # existing 1P overview, if any
├── orders.yaml               # pushable overlay entry
├── orders.overview.md        # enriched overview
└── orders.queries.md         # queries aspect sidecar
```

</Step>
</Steps>

## Refine output interactively

After the initial run, refine without re-reading source docs or re-pulling the dataset. Each entry stores its grounding prompt in `refine_session.json`, so refinement reuses loaded context.

<Tabs>
<Tab title="Interactive REPL">

```bash
python3 agents/enrichment/src/agent_runner.py \
  --mode=table \
  --dataset=my-proj.analytics \
  --folders=./local_md_corpus \
  --project=my-gcp-project \
  --model=gemini-2.5-pro \
  --output_dir=/tmp/enrich_out \
  --interactive
```

At the `refine>` prompt you can rewrite overviews, add sections, re-enumerate entries, or ask questions. Commands: `:entries`, `:show <id>`, `:quit`. No-op on a non-TTY.

</Tab>
<Tab title="Single refinement turn">

```bash
python3 agents/enrichment/src/agent_runner.py \
  --refine_instruction="make the orders overview more concise" \
  --output_dir=/tmp/enrich_out \
  --project=my-gcp-project \
  --model=gemini-2.5-pro
```

Requires a prior run's `refine_session.json`. Skips the enrichment pipeline entirely.

</Tab>
</Tabs>

Refinement operations:

| Operation | Effect |
|-----------|--------|
| `rewrite` | Re-generate one or more entry overviews with a change |
| `reenumerate` | Add, remove, split, merge, or recategorize entries (doc mode fully; table/overlay modes re-categorize only—entries are pinned to dataset tables) |
| `answer` | Respond to a question about the output; no files change |
| `noop` | Ask for clarification when the request is ambiguous |

## Publish enriched metadata

The agent generates mdcode only. Push to Dataplex is your step:

```bash
cd /tmp/enrich_out
CLOUDSDK_CORE_PROJECT=<project> CLOUDSDK_COMPUTE_REGION=<region> kcmd push
```

See [Publish enriched metadata](/publish-enriched-metadata) for push options, entry-link reconciliation, and reference-layer constraints.

## Evaluate before publishing

Score a run with the golden-free evaluator (no reference answers required):

```bash
cd agents/enrichment
pip install -r eval/requirements.txt
python -m eval --output-dir /tmp/enrich_out
```

Writes `eval_report.md` next to `trajectory.json`. See [Evaluate enrichment output](/evaluate-enrichment-output).

## Troubleshooting

| Symptom | Likely cause | What to check |
|---------|--------------|---------------|
| `kcmd not found` | Binary not built | `cd agents/mdcode && npm run build` or set `$KCMD_BIN` |
| `--entry_group is required` | Missing flag in doc/overlay mode | Pass `project.location.entryGroupId`; create the group with `gcloud dataplex entry-groups create` first |
| No reference tables pulled | Dataset or permissions | Verify `--dataset` and read access to `@bigquery` entries |
| GitHub code context empty | MCP auth or scope | Confirm `GITHUB_PERSONAL_ACCESS_TOKEN`; check `[Code]` log lines for tool-call counts |
| `queries` push 403 | Missing permission | Caller needs `dataplex.entryGroups.useQueriesAspect`; overview still publishes |
| Refinement skipped | Non-interactive shell | Use `--refine_instruction` for webapp-style single-turn refine |

More signals in [Troubleshooting](/troubleshooting).

## Related pages

<CardGroup cols={2}>
<Card title="Installation" href="/installation">
Prerequisites, Python and Node.js setup, and credential configuration.
</Card>
<Card title="Enrichment workflows" href="/enrichment-workflows">
How agents read metadata, ground on external sources, and hand off to kcmd push.
</Card>
<Card title="Enrichment agent flags" href="/enrichment-agent-flags">
Complete `agent_runner.py` flag reference by mode.
</Card>
<Card title="Publish enriched metadata" href="/publish-enriched-metadata">
Push mdcode workspaces and reconcile entry links.
</Card>
<Card title="Evaluate enrichment output" href="/evaluate-enrichment-output">
Golden-free and golden-based scoring of enrichment runs.
</Card>
<Card title="Sync catalog metadata" href="/sync-catalog-metadata">
Initialize kcmd workspaces and pull catalog snapshots.
</Card>
</CardGroup>

---

## 10. Publish enriched metadata

> Push mdcode workspaces with kcmd, publish sample enrichment output via catalog APIs, and reconcile entry links and aspects without modifying read-only reference layers.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/10-publish-enriched-metadata.md
- Generated: 2026-06-15T02:54:53.762Z

### Source Files

- `agents/enrichment/README.md`
- `agents/mdcode/src/tool/commands.ts`
- `samples/enrichment/src/enrichment/publish.py`
- `samples/enrichment/src/enrichment/enrich.py`
- `toolbox/enrichment/README.md`
- `agents/mdcode/README.md`

---
title: "Publish enriched metadata"
description: "Push mdcode workspaces with kcmd, publish sample enrichment output via catalog APIs, and reconcile entry links and aspects without modifying read-only reference layers."
---

Enrichment agents in this repository **generate** local mdcode artifacts (`catalog.yaml`, `catalog/` entries, Markdown sidecars, and optional `*.ref.yaml` reference layers) but do not call Dataplex publish APIs themselves. Publication is a separate step: run `kcmd push` from the workspace root for the primary Metadata as Code path, or `python -m enrichment.publish` from `samples/enrichment` for a direct overview-aspect update via the Dataplex Catalog API.

## Publication paths

| Path | When to use | What gets published | Reference layers |
|------|-------------|---------------------|------------------|
| `kcmd push` | Enrichment agent output, toolbox demo, any mdcode workspace | Aspects and entry types listed under `publishing` in `catalog.yaml`; optional `entryLinks` reconciliation | `*.ref.yaml` and `*.ref.*.md` are **never** pushed |
| `python -m enrichment.publish` | `samples/enrichment` workflow only | `overview` aspect on existing `@bigquery` table entries | N/A (flat `*.md` snapshot, not mdcode) |

<Info>
The catalog enrichment agent (`agents/enrichment`) shells out to read-only `kcmd init`, `kcmd pull`, and `kcmd reference` commands. You run `kcmd push` after reviewing or evaluating the generated tree.
</Info>

```mermaid
sequenceDiagram
  participant Agent as enrichment agent
  participant WS as mdcode workspace
  participant KCMD as kcmd push
  participant DP as Dataplex Catalog API

  Agent->>WS: write catalog.yaml + catalog/ + sidecars
  Agent->>WS: kcmd reference (optional .ref.yaml)
  Note over WS: *.ref.yaml = read-only grounding
  WS->>KCMD: kcmd push [--dry-run]
  KCMD->>DP: modifyEntry (publishing.aspects)
  KCMD->>DP: create/delete EntryLink (publishing.entryLinks)
  Note over KCMD,DP: .ref.yaml entries skipped (no local editable path)
```

## Prerequisites

- **Authentication**: Application Default Credentials via `gcloud auth application-default login`. Set `CLOUDSDK_CORE_PROJECT` (and optionally `CLOUDSDK_COMPUTE_REGION`) when pushing from an enrichment output directory.
- **Built `kcmd`**: `cd agents/mdcode && npm install && npm run build` produces `agents/mdcode/dist/kcmd`. Add `dist/` to `PATH` or invoke the binary directly.
- **Complete manifest**: Enrichment modes write a full `catalog.yaml` with `snapshot` and `publishing` blocks. A bare `scope:` line from `kcmd init` alone causes `kcmd push` to load no entry types and silently no-op.
- **Pre-existing resources** (mode-dependent):
  - **Doc mode**: target entry group must exist before enrichment (`gcloud dataplex entry-groups create …`).
  - **Context-overlay mode**: editable entry group must exist; live `@bigquery` entries are read-only via `kcmd reference`.
  - **Glossary terms**: `kcmd push` updates metadata on existing glossary terms but does not create glossaries, categories, or terms.

<Warning>
`kcmd push` fails fast when a referenced glossary term, category, or glossary does not exist. Bootstrap glossary structure out-of-band (console or `gcloud dataplex glossaries create` / `glossary-terms create`), then `kcmd pull` before pushing link metadata.
</Warning>

## Publish with kcmd push

### Workspace layout after enrichment

Editable files live beside read-only reference mirrors:

```text
output_dir/
├── catalog.yaml
└── catalog/
    └── bigquery/<project>/<dataset>/
        ├── orders.yaml              # editable — pushed
        ├── orders.overview.md       # sidecar — merged on push
        ├── orders.queries.md        # sidecar (table / overlay modes)
        ├── orders.ref.yaml          # reference — skipped on push
        └── orders.ref.overview.md   # reference sidecar — skipped
```

`CatalogSnapshot.isModifiable` returns true only when an entry has a **local** (non-`.ref.yaml`) path. Reference entries index under `.ref.yaml` but are excluded from push iteration.

### Push flags

<ParamField body="--dry-run" type="boolean">
  Log create, modify, and delete operations without calling the Catalog API.
</ParamField>

<ParamField body="--validate-only" type="boolean">
  Validate the local snapshot against the service without applying changes.
</ParamField>

<ParamField body="--force" type="boolean">
  Overwrite service metadata, ignoring potential conflicts.
</ParamField>

### Standard push workflow

<Steps>
<Step title="Review local artifacts">

Inspect the enrichment output before publishing:

```bash
find /tmp/enrich_out -type f | head -50
diff -u orders.ref.overview.md orders.overview.md   # overlay mode: compare baseline vs enriched
```

Optionally score the run with the dynamic evaluator (`python -m eval --output-dir /tmp/enrich_out`) before push.

</Step>

<Step title="Dry-run push">

From the workspace root (directory containing `catalog.yaml`):

```bash
cd /tmp/enrich_out
CLOUDSDK_CORE_PROJECT=<project> \
  ../agents/mdcode/dist/kcmd push --dry-run
```

Confirm the log shows only intended `Modify Entry`, `Create Entry`, and `EntryLink` operations. No `[DRY-RUN]` lines should target `*.ref.yaml` entries.

</Step>

<Step title="Push to the catalog">

```bash
CLOUDSDK_CORE_PROJECT=<project> CLOUDSDK_COMPUTE_REGION=<region> \
  kcmd push
```

On success the CLI prints `Successfully pushed catalog entries.`

</Step>

<Step title="Verify in the catalog">

Re-pull the affected scope and diff against your pre-push tree, or inspect entries in the Dataplex console. For table mode, confirm `overview` and `queries` aspects updated on `@bigquery` entries. For context-overlay mode, confirm new generic entries exist in your entry group while `@bigquery` entries are unchanged.

</Step>
</Steps>

<RequestExample>

```bash
cd /tmp/enrich_out
CLOUDSDK_CORE_PROJECT=my-gcp-project CLOUDSDK_COMPUTE_REGION=us-central1 \
  kcmd push --dry-run
```

</RequestExample>

<ResponseExample>

```text
Pushing catalog entries...
[DRY-RUN] Modify Entry projects/my-gcp-project/locations/us/entryGroups/@bigquery/entries/bigquery.googleapis.com/projects/my-gcp-project/datasets/analytics/tables/orders (updateMask: aspects, aspects: 655216118709.global.overview,...)
Successfully pushed catalog entries.
```

</ResponseExample>

## Publishing by enrichment mode

### Table mode (`--mode=table`)

The agent runs `kcmd init --bigquery-dataset` + `kcmd pull`, enriches each table's `overview` (and optionally `queries`) aspect, then writes sidecar Markdown. Push targets live `@bigquery` table entries.

Default manifest aspects published:

- `dataplex-types.global.overview`
- `dataplex-types.global.queries`

With `--glossaries`, the manifest also declares `snapshot.entryLinks: [definition, synonym]` and `publishing.entryLinks: [definition]`. The linking step injects column-level `links.definition` into `<table>.yaml`; `kcmd push` reconciles those links to Dataplex.

### Doc mode (`--mode=doc`)

The agent creates knowledge-base entries (generic entry type + `overview` aspect) under your `--entry_group`. Push may auto-create missing entries and entry groups when they do not exist remotely.

### Context-overlay mode (`--mode=context_overlay`)

Read-only 1P BigQuery metadata is pulled via `kcmd reference` into `*.ref.yaml`. The agent writes **new** overlay entries (`<table>.yaml` + `<table>.overview.md`) in your editable entry group. Only overlay pairs are pushed; the `.ref.*` mirror of the live table is never modified or published.

## Manifest controls

`catalog.yaml` `publishing` determines what `kcmd push` writes. Reference scope is pull-only.

| Key | Role on push |
|-----|--------------|
| `publishing.aspects` | Aspect types uploaded via `modifyEntry` |
| `publishing.entries` | Entry types eligible for create/update (doc and overlay modes) |
| `publishing.entryLinks` | Link types reconciled per entry; must be a subset of `snapshot.entryLinks` |
| `reference.scope` | Read-only pull via `kcmd reference`; never pushed |

Example publishing block for table enrichment with glossary links:

```yaml
publishing:
  aspects:
    - dataplex-types.global.overview
    - dataplex-types.global.queries
  entryLinks:
    - definition
```

Omit `publishing.entryLinks` (or leave it empty) to disable link mutations entirely — useful when you only want to read links without taking responsibility for reconciling them.

## Entry link reconciliation

When `publishing.entryLinks` is set, `CatalogSync.push` compares local links from entry YAML against remote `lookupEntryLinks` results for the configured types.

Reconciliation rules:

1. **Match** — normalized target + source path (project ID/number agnostic, `@dataplex` proxy unwrapped). Existing remote links with a local match are kept.
2. **Create** — local links with no remote match are created.
3. **Delete** — remote links of the configured types with no local match are deleted.

Column-level links (source path `Schema.<field>`) are stored under `aspects.schema.fields[].links` in entry YAML. Entry-level links appear under the top-level `links` block.

<AccordionGroup>
<Accordion title="Diff reference vs enriched links">

When `reference.snapshot.entryLinks` is declared, `kcmd reference` includes pre-edit link state in `*.ref.yaml`. Compare live `<table>.yaml` against `<table>.ref.yaml` to see only what enrichment added or removed before pushing.

</Accordion>

<Accordion title="Glossary definition links">

Glossary terms pulled as reference (`catalog/glossaries/.../*.ref.yaml`) ground the LinkingAgent but are not pushed. Only the `definition` links injected into editable table YAML are reconciled on push.

</Accordion>
</AccordionGroup>

## Sample enrichment API publish

The `samples/enrichment` package demonstrates a lighter-weight path that bypasses mdcode. It downloads table overviews into flat Markdown files, enriches them, then publishes via `dataplex.CatalogServiceClient.update_entry` with an aspects-only field mask.

<Steps>
<Step title="Download snapshot">

```bash
python3 -m enrichment.download \
  --dir ../sample/metadata.initial \
  --dataset ${CLOUD_PROJECT}.kc_enrich_sample_data
```

</Step>

<Step title="Enrich">

```bash
python3 -m enrichment.enrich \
  --dir ../sample/metadata.initial \
  --output-dir ../sample/metadata.new \
  --config-dir ../sample/config
```

</Step>

<Step title="Review diff">

```bash
git diff --no-index ../sample/metadata.initial ../sample/metadata.new
```

</Step>

<Step title="Publish">

```bash
python3 -m enrichment.publish --dir ../sample/metadata.new
```

Each `*.md` file is converted back to a Dataplex `Entry` protobuf and `update_entry` is called with `update_mask.paths=['aspects']` and `aspect_keys` set to the overview aspect key.

</Step>
</Steps>

<Note>
The sample publish path updates only the `overview` aspect on existing entries. It does not reconcile entry links, create entries, or manage reference layers. For production enrichment from `agents/enrichment` or `toolbox/enrichment`, use `kcmd push`.
</Note>

## Toolbox demo publish

The TypeScript toolbox demo (`toolbox/enrichment`) follows the mdcode path: `kcmd init` + `kcmd pull`, `kcagent enrich`, then push from the demo workspace:

```bash
cd demo
../../mdcode/dist/kcmd pull
../dist/kcagent enrich --catalog-path . --tools-path tools --prompt-path prompt.md
../../mdcode/dist/kcmd push
```

## Agent-driven publish (MCP)

Agents can publish through the kcmd MCP server (`kcmd mcp --path <workspace>`), which exposes `modify-entry` and related tools. The push semantics are the same as the CLI: only modifiable local entries and manifest-declared publishing types are affected. Reference layers remain read-only.

## Troubleshooting

| Symptom | Likely cause | Mitigation |
|---------|--------------|------------|
| Push succeeds but nothing changes | `publishing` block missing or empty; bare `scope:` manifest | Ensure enrichment wrote a complete manifest with `publishing.aspects` (and `publishing.entries` for doc/overlay modes) |
| `Glossary term does not exist` | Term referenced in links but not provisioned | Create term via `gcloud dataplex glossary-terms create`, then `kcmd pull` |
| `Failed to create entry group` | IAM or quota on Dataplex entry groups | Verify `dataplex.entryGroups.create` permission; create group manually |
| Spurious link delete/create cycles | Project number vs ID mismatch in link targets | Rely on built-in normalization (fixed in `CatalogSync.push`); ensure targets use consistent FQN form from `kcmd pull` |
| Context overlay modified live BQ entry | Pushed wrong files or wrong scope | Confirm `scope:` points at your entry group, not `bq-dataset`; verify only `<table>.yaml` (not `<table>.ref.yaml`) changed locally |
| `kcmd not found` | Binary not built or not on PATH | `cd agents/mdcode && npm run build`; set `KCMD_BIN` or add `dist/` to PATH |

## Next

<CardGroup>
<Card title="Sync catalog metadata" href="/sync-catalog-metadata">
  Initialize workspaces, pull snapshots, and understand the full pull/push lifecycle before publishing.
</Card>
<Card title="Catalog manifest reference" href="/catalog-manifest-reference">
  Configure `snapshot`, `publishing`, `reference`, and `entryLinks` reconciliation rules in `catalog.yaml`.
</Card>
<Card title="kcmd CLI reference" href="/kcmd-cli-reference">
  Full `kcmd push` flags, init modes, and authentication via gcloud ADC.
</Card>
<Card title="Evaluate enrichment output" href="/evaluate-enrichment-output">
  Score structural validity and grounding before you push enriched metadata.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
  Auth, billing, push conflict, and glossary provisioning failures.
</Card>
</CardGroup>

---

## 11. Visualize OKF bundles

> Generate self-contained viz.html graph viewers from OKF bundles with force-directed layouts, concept detail panels, backlinks, and in-browser markdown rendering.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/11-visualize-okf-bundles.md
- Generated: 2026-06-15T02:54:38.906Z

### Source Files

- `okf/README.md`
- `okf/src/enrichment_agent/cli.py`
- `okf/src/enrichment_agent/viewer/generator.py`
- `okf/src/enrichment_agent/viewer/templates/viz.html`
- `okf/src/enrichment_agent/viewer/static/viz.js`
- `okf/src/enrichment_agent/viewer/static/viz.css`

---
title: "Visualize OKF bundles"
description: "Generate self-contained viz.html graph viewers from OKF bundles with force-directed layouts, concept detail panels, backlinks, and in-browser markdown rendering."
---

The `enrichment-agent visualize` subcommand (also invokable as `python -m enrichment_agent visualize`) walks an OKF bundle directory, extracts concepts and cross-links from markdown files, and writes a single self-contained `viz.html` file with an embedded graph and in-browser markdown renderer. No model credentials, network access, or backend server is required at generation time; the viewer loads Cytoscape.js and marked from a CDN only when you open the HTML in a browser.

<Note>
The visualizer is a proof-of-concept **consumer** of OKF bundles. Any tool that reads markdown can consume bundles; this viewer is one bundled option for exploring graph-shaped knowledge.
</Note>

## Prerequisites

- The `enrichment-agent` package installed from the `okf/` directory (see [Installation](/installation)).
- An OKF bundle directory on disk — either produced by the enrichment agent or authored by hand. Bundles in this repository include `okf/bundles/ga4/`, `okf/bundles/stackoverflow/`, and `okf/bundles/crypto_bitcoin/`, each with a pre-generated `viz.html`.

No BigQuery, Vertex AI, or Gemini credentials are needed for visualization. Generation is entirely local file I/O.

## Generate a visualization

<Steps>
<Step title="Point at a bundle directory">

Pass `--bundle` with the root of an OKF bundle (the directory that contains concept markdown files and optional `index.md` navigation files).

</Step>
<Step title="Run the visualize subcommand">

<CodeGroup>
```bash title="Module invocation"
.venv/bin/python -m enrichment_agent visualize \
    --bundle ./bundles/ga4
```

```bash title="Console script"
enrichment-agent visualize \
    --bundle ./bundles/crypto_bitcoin
```
</CodeGroup>

</Step>
<Step title="Verify output">

On success, the CLI prints counts to stderr and writes the HTML file.

<ResponseExample>
```text
Wrote 14 concept(s), 42 edge(s), 287431 bytes → bundles/ga4/viz.html
```
</ResponseExample>

Open the output path in a browser. The default location is `<bundle>/viz.html`.

</Step>
</Steps>

### Custom output path and display name

```bash
.venv/bin/python -m enrichment_agent visualize \
    --bundle ./bundles/crypto_bitcoin \
    --out /tmp/btc.html \
    --name "Bitcoin OKF"
```

The `--name` value appears in the viewer header and browser title. When omitted, the bundle directory name is used.

## CLI reference

<ParamField body="--bundle" type="path" required>
Root directory of the OKF bundle to visualize.
</ParamField>

<ParamField body="--out" type="path">
Output HTML path. Defaults to `<bundle>/viz.html`.
</ParamField>

<ParamField body="--name" type="string">
Display name shown in the viewer header. Defaults to the bundle directory name.
</ParamField>

| Flag | Default | Description |
|------|---------|-------------|
| `--bundle` | *(required)* | Bundle root directory |
| `--out` | `<bundle>/viz.html` | Output HTML path |
| `--name` | bundle directory name | Header display name |

## How generation works

```mermaid
flowchart LR
  subgraph cli ["enrichment_agent/cli.py"]
    V["visualize subcommand"]
  end
  subgraph gen ["viewer/generator.py"]
    W["_walk_concepts"]
    G["_build_graph"]
    E["embed template + assets"]
  end
  subgraph out ["Output"]
    H["viz.html"]
  end
  V --> W
  W --> G
  G --> E
  E --> H
```

`generate_visualization(bundle_root, out_path, bundle_name=None)` in `enrichment_agent.viewer` performs four steps:

1. **Walk concepts** — Recursively find every `*.md` file under the bundle root.
2. **Parse frontmatter** — Load each file with `OKFDocument.parse`. Files that fail parsing are skipped silently.
3. **Extract links** — Scan markdown bodies for relative `.md` link targets and resolve them to concept IDs.
4. **Embed assets** — Inline `viz.css` and `viz.js` into `viz.html`, inject the graph JSON, and write a single HTML file.

<ResponseField name="return value" type="dict">
Generation returns counts: `concepts` (node count), `edges` (directed edge count), and `bytes` (output file size).
</ResponseField>

### Concept discovery rules

| Rule | Behavior |
|------|----------|
| `index.md` files | Excluded from the graph (navigation indexes, not concepts) |
| Parse failures | Skipped; bundle generation continues |
| Concept ID | Relative path from bundle root without `.md` suffix (e.g. `tables/events_`) |
| Frontmatter fields used | `type`, `title`, `description`, `resource`, `tags` |
| Missing frontmatter keys | Falls back to `"Unknown"` type, concept ID as title, empty strings for optional fields |

### Link extraction and edges

Cross-links are detected with a regex that matches markdown link targets ending in `.md`, optionally followed by an anchor fragment.

| Link form | Included as edge? |
|-----------|-------------------|
| Relative (`../tables/users.md`, `events.md`) | Yes, if target resolves inside the bundle |
| Absolute in-bundle (`/tables/users.md` in body) | Parsed at generation; rewired at view time in the detail panel |
| External (`https://…`) | No — skipped during extraction |
| Absolute path starting with `/` in link target | No — skipped during extraction |
| Dangling target (file not in bundle) | No edge created |
| Self-link | No edge created |
| Duplicate source→target pair | Deduplicated |

Edges are **directed**: source is the citing concept, target is the linked concept.

## Embedded graph data model

The generator serializes a JSON blob into `window.BUNDLE` inside the HTML:

```text
BUNDLE
├── nodes[]          # Cytoscape node elements
│   └── data
│       ├── id, label, type, description, resource, tags
│       ├── color    # from type palette
│       └── size     # 30 + min(60, len(body) // 200)
├── edges[]          # Cytoscape edge elements
│   └── data: { id, source, target }
├── bodies{}         # concept id → raw markdown body
├── types[]          # sorted unique type strings
└── palette{}        # known type → hex color
```

### Node color palette

| Concept type | Color |
|--------------|-------|
| `BigQuery Dataset` | `#8b5cf6` |
| `BigQuery Table` | `#3b82f6` |
| `Reference` | `#10b981` |
| Any other type | `#94a3b8` (default) |

Node diameter scales with body length (capped), so concepts with more prose appear slightly larger on the graph.

## Browser viewer

The generated `viz.html` is a split-pane application: a Cytoscape.js graph on the left (~60% width) and a concept detail panel on the right (~40%).

### Graph interactions

| Control | Behavior |
|---------|----------|
| Click node | Opens detail panel; selects and centers the node |
| Click canvas background | Clears selection |
| Search box | Dims nodes whose title, concept ID, or tags do not match the query |
| Type filter | Dims all nodes except the selected `type` |
| Layout selector | Re-layouts graph: `cose` (force-directed, default), `concentric`, `breadthfirst`, `circle`, `grid` |
| Reset view | Fits graph to viewport and clears selection |

On load, the viewer auto-selects the first `BigQuery Dataset` node if one exists; otherwise it selects the first concept.

### Detail panel

For the selected concept, the panel shows:

- **Type chip** — colored by the type palette
- **Title and concept ID**
- **Frontmatter** — description, resource (as external link), tags (as chips)
- **Rendered body** — markdown parsed in-browser with marked (GFM enabled)
- **Cited by** — reverse-link backlinks computed from the edge graph

Internal markdown links in the form `/path/to/concept.md` are rewired to navigate within the viewer instead of loading a file path. External links open in a new tab.

<Info>
Cytoscape.js `3.28.1` and marked `12.0.0` are loaded from jsDelivr CDN when the page opens. Bundle content itself is fully embedded in the HTML at generation time — no fetch of bundle files occurs in the browser.
</Info>

## Output layout

:::files
okf/bundles/<name>/
├── datasets/
├── tables/
├── references/
├── index.md              # navigation only; not graphed
└── viz.html              # default output (--out overrides path)
:::

The HTML file inlines all CSS and JavaScript from `enrichment_agent/viewer/static/`. Template placeholders (`__BUNDLE_NAME__`, `__BUNDLE_DATA__`) are replaced at generation time. You can commit `viz.html` next to the bundle, host it on a static file server, or share it as a standalone artifact.

## Programmatic use

Import `generate_visualization` directly for custom pipelines:

```python
from pathlib import Path
from enrichment_agent.viewer import generate_visualization

stats = generate_visualization(
    Path("./bundles/ga4"),
    Path("./bundles/ga4/viz.html"),
    bundle_name="GA4 E-commerce",
)
# stats == {"concepts": N, "edges": M, "bytes": K}
```

## Troubleshooting

<AccordionGroup>
<Accordion title="FileNotFoundError: Bundle directory not found">

`--bundle` must point to an existing directory. The generator does not create bundle content — produce a bundle first with `enrich` or author markdown concepts manually.

</Accordion>

<Accordion title="Graph has fewer nodes than expected">

- `index.md` files are intentionally excluded.
- Markdown files with invalid YAML frontmatter are skipped during parsing.
- Check that concept files use the standard `---` frontmatter delimiter.

</Accordion>

<Accordion title="Expected cross-links missing from the graph">

Links must target relative `.md` paths that resolve inside the bundle. External URLs, absolute `/` paths in link targets, and links to non-existent concepts do not produce edges. Verify link syntax matches `[label](../path/to/concept.md)`.

</Accordion>

<Accordion title="Viewer loads but graph area is empty">

Open the browser developer console. If Cytoscape or marked fail to load from the CDN (network policy, offline environment), the graph will not render. The embedded bundle data is still present in the page source.

</Accordion>

<Accordion title="enrichment_agent module not found">

Install the package from `okf/`:

```bash
python3 -m venv .venv
.venv/bin/pip install -e .[dev]
```

</Accordion>
</AccordionGroup>

## Related pages

<CardGroup>
<Card title="Open Knowledge Format" href="/open-knowledge-format">
Bundle structure, frontmatter fields, and cross-link semantics that the visualizer reads.
</Card>
<Card title="Produce OKF bundles" href="/produce-okf-bundles">
Run the enrichment agent to generate bundle directories you can visualize.
</Card>
<Card title="OKF enrichment CLI reference" href="/okf-enrichment-cli-reference">
Full `enrich` and `visualize` subcommand reference including BigQuery source flags.
</Card>
<Card title="OKF bundle recipes" href="/okf-bundle-recipes">
Copy-paste recipes for GA4, Stack Overflow, and Bitcoin bundles with sample `viz.html` outputs.
</Card>
</CardGroup>

---

## 12. Run the discovery agent

> Deploy the Knowledge Catalog discovery agent with ADK: required GCP APIs and IAM roles, environment variables, and root-agent or AgentTool integration patterns.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/12-run-the-discovery-agent.md
- Generated: 2026-06-15T02:55:25.668Z

### Source Files

- `samples/discovery/README.md`
- `samples/discovery/agent.py`
- `samples/discovery/tools.py`
- `samples/discovery/utils.py`
- `samples/discovery/SKILL.md`
- `samples/discovery/requirements.txt`

---
title: "Run the discovery agent"
description: "Deploy the Knowledge Catalog discovery agent with ADK: required GCP APIs and IAM roles, environment variables, and root-agent or AgentTool integration patterns."
---

The Knowledge Catalog discovery agent in `samples/discovery/` is a Google ADK `llm_agent.Agent` that answers natural-language questions by calling Knowledge Catalog semantic search through `dataplex_v1.CatalogServiceClient.search_entries`. The agent loads its behavior from `SKILL.md`, uses `gemini-3-flash-preview` on Vertex AI, and exposes a single tool—`knowledge_catalog_search`—that returns catalog entry metadata for the LLM to decompose, batch, merge, and rerank.

## Architecture

```mermaid
flowchart TB
  subgraph adk["ADK runtime"]
    CLI["adk run parent folder"]
    Root["root_agent or parent Agent"]
    Disc["discovery_agent / knowledge_catalog_discovery_agent"]
    Skill["SKILL.md instruction"]
    Tool["knowledge_catalog_search"]
  end

  subgraph gcp["Google Cloud"]
    Vertex["Vertex AI Gemini"]
    KC["Knowledge Catalog Search API"]
  end

  CLI --> Root
  Root -->|"AgentTool (optional)"| Disc
  Disc --> Skill
  Disc --> Vertex
  Disc --> Tool
  Tool -->|"search_entries semantic_search=true"| KC
```

| Component | Module | Responsibility |
| --- | --- | --- |
| Agent definition | `samples/discovery/agent.py` | Builds `discovery_agent` with model, name, description, instruction, and tools |
| Search tool | `samples/discovery/tools.py` | Calls `CatalogServiceClient.search_entries` against `projects/{project}/locations/global` |
| Project resolution | `samples/discovery/utils.py` | Reads `GOOGLE_CLOUD_PROJECT` for consumer project and model path |
| Agent instruction | `samples/discovery/SKILL.md` | Semantic decomposition, predicate rules, parallel search batching, result merging |

## Prerequisites

### Required GCP APIs

| API | Service name |
| --- | --- |
| Knowledge Catalog | `dataplex.googleapis.com` |
| Vertex AI | `aiplatform.googleapis.com` |
| Service Usage | `serviceusage.googleapis.com` |

### Required IAM permissions

| Permission | Typical role |
| --- | --- |
| `dataplex.projects.search` | `roles/dataplex.viewer` |
| `aiplatform.endpoints.predict` | `roles/aiplatform.user` |
| `serviceusage.services.use` | `roles/serviceusage.serviceUsageConsumer` |

<Note>
Configure Application Default Credentials before running the agent. See [Installation](/installation) for `gcloud auth application-default login` and project setup.
</Note>

## Install dependencies

<Steps>
<Step title="Clone and enter the sample">

```bash
git clone https://github.com/GoogleCloudPlatform/knowledge-catalog.git
cd knowledge-catalog/samples/discovery
```

</Step>
<Step title="Create a virtual environment and install packages">

```bash
python3 -m venv /tmp/kcsearch
source /tmp/kcsearch/bin/activate
pip3 install -r requirements.txt
```

Packages from `requirements.txt`:

| Package | Purpose |
| --- | --- |
| `google-adk` | ADK agent runtime, `llm_agent.Agent`, ADK CLI |
| `google-cloud-dataplex` | `CatalogServiceClient` for Knowledge Catalog search |
| `google-api-core` | API error types (`PermissionDenied`) |

</Step>
</Steps>

## Environment variables

<ParamField body="GOOGLE_CLOUD_PROJECT" type="string" required>
Consumer GCP project ID. Used by `get_consumer_project()` in `utils.py` to build the Vertex model path and the Knowledge Catalog search parent `projects/{id}/locations/global`. Raises `ValueError` if unset.
</ParamField>

<ParamField body="GOOGLE_GENAI_USE_VERTEXAI" type="boolean" required>
Set to `True` so ADK routes Gemini calls through Vertex AI instead of the Gemini API.
</ParamField>

<RequestExample>

```bash
export GOOGLE_CLOUD_PROJECT=my-consumer-project
export GOOGLE_GENAI_USE_VERTEXAI=True
```

</RequestExample>

The agent resolves the model at startup:

```python
GEMINI_MODEL = f"projects/{consumer_project}/locations/global/publishers/google/models/gemini-3-flash-preview"
```

## Deployment patterns

The sample supports two ADK integration paths. Both use `adk run` against the **parent folder** that contains the agent package directory.

### Pattern 1: Root agent

Use when the discovery agent is the only agent in the deployment.

1. In `samples/discovery/agent.py`, rename `discovery_agent` to `root_agent`.
2. Run ADK against the parent of the agent source directory.

For the stock sample layout, the agent source lives in `samples/discovery/`, so the parent is `samples/`:

```bash
adk run samples
```

<Warning>
ADK requires the exported symbol to be named `root_agent`. The sample ships with `discovery_agent` so it can also be imported as a sub-agent.
</Warning>

### Pattern 2: Sub-agent via AgentTool

Use when a custom orchestrator delegates catalog search to the discovery agent. Copy the discovery package into your parent agent folder:

:::files
my_custom_agent/
├── agent.py
└── knowledge_catalog_discovery_agent/
    ├── SKILL.md
    ├── agent.py
    ├── tools.py
    └── utils.py
:::

Import `discovery_agent` from the copied package and wrap it with ADK `AgentTool` per the [ADK multi-agent docs](https://adk.dev/agents/multi-agents/#c-explicit-invocation-agenttool). Run against the parent folder:

```bash
adk run my_custom_agent
```

<Info>
The sub-agent is registered as `knowledge_catalog_discovery_agent` (see `agent.py` `name=` and `SKILL.md` frontmatter). Parent agents invoke it explicitly through `AgentTool` rather than as the default root.
</Info>

## Agent definition

The agent is constructed in `agent.py`:

| Field | Value |
| --- | --- |
| `name` | `knowledge_catalog_discovery_agent` |
| `description` | Searches Knowledge Catalog for data entries based on natural-language user queries |
| `model` | `google_llm.Gemini(model=GEMINI_MODEL)` |
| `instruction` | Contents of `SKILL.md` loaded by `load_instruction()` |
| `tools` | `[tools.knowledge_catalog_search]` |

## Search tool reference

`knowledge_catalog_search(query: str)` in `tools.py` calls the Knowledge Catalog Search API.

| Request field | Value |
| --- | --- |
| `name` | `projects/{GOOGLE_CLOUD_PROJECT}/locations/global` |
| `query` | Natural-language or predicate-qualified search string |
| `page_size` | `50` |
| `semantic_search` | `True` |
| API endpoint | `dataplex.googleapis.com` |

<ResponseField name="results" type="array">
On success, a list of objects with `entry_name`, `system`, `resource_id`, and `display_name` extracted from `result.dataplex_entry`.
</ResponseField>

<ResponseExample>

```json
{
  "results": [
    {
      "entry_name": "projects/my-project/locations/global/entryGroups/@bigquery/entries/my-table",
      "system": "BIGQUERY",
      "resource_id": "my-project.my_dataset.my_table",
      "display_name": "my_table"
    }
  ]
}
```

</ResponseExample>

Error shapes returned to the LLM:

| Key | Condition |
| --- | --- |
| `{"Error obtaining consumer project": "..."}` | `GOOGLE_CLOUD_PROJECT` missing |
| `{"error": "Permission denied: ..."}` | `PermissionDenied` from the API |
| `{"error": "An unexpected error occurred: ..."}` | Other exceptions |

## Agent search behavior

`SKILL.md` drives multi-step retrieval beyond a single API call:

1. **Understand the query** — preserve user-supplied predicates such as `type=table`.
2. **Semantic decomposition** — break business questions into data-engineering terms; generate up to three distinct query variations plus a **baseline search** (the verbatim user request).
3. **Predicate extraction** — map keywords to official predicates; embed `projectid=` constraints inside the `query` string argument.
4. **Parallel search** — batch searches to minimize round trips.
5. **Merge and rank** — deduplicate by `entry_name`, filter irrelevant hits, sort by relevance, return full entry names.

<AccordionGroup>
<Accordion title="Official search predicates">

| Predicate | Operators | Common triggers |
| --- | --- | --- |
| `type` | `=` | `table`, `dataset` |
| `system` | `=` | `bigquery`, `cloud_sql`, `dataplex` |
| `description` | `=` | `description` (only when user explicitly refers to description) |
| `name` | `:`, `=`, `!=` | `name` (only when user explicitly refers to resource name) |
| `displayname` | `:`, `=`, `!=` | `display name` |
| `projectid` | `=`, `:` | `project`, `project id` |
| `parent` | `=`, `:` | `parent` |

Logical operators `AND` and `OR` must be uppercase. Negation uses a leading hyphen (for example `-name:foo`). Knowledge Catalog search does not interpret double quotes in free text.

</Accordion>
<Accordion title="Example predicate queries">

| Natural language | Search query |
| --- | --- |
| BigQuery tables containing foo in project bar | `system=bigquery AND type=table AND name:foo AND projectid=bar` |
| Tables not containing foo | `type=table AND -name:foo` |
| Tables from project foo-1 or bar-1 | `type=table AND (projectid:foo-1 OR projectid:bar-1)` |
| All datasets | `type=dataset` |

</Accordion>
</AccordionGroup>

## Verification

After `adk run`, send a natural-language query such as *"Show me BigQuery tables in project my-project"*.

Expected signals:

- The agent issues one or more `knowledge_catalog_search` calls with predicates embedded in the query string.
- Successful responses include `results` with `entry_name`, `system`, `resource_id`, and `display_name`.
- Permission failures surface the `Permission denied` error string from the tool rather than crashing the agent loop.

## Troubleshooting

| Symptom | Likely cause | Fix |
| --- | --- | --- |
| `GOOGLE_CLOUD_PROJECT environment variable is required` | Missing env var | Export `GOOGLE_CLOUD_PROJECT` before `adk run` |
| `Permission denied` in tool output | Missing `dataplex.projects.search` | Grant `roles/dataplex.viewer` or equivalent on the consumer project |
| Vertex AI auth errors | Missing ADC or `GOOGLE_GENAI_USE_VERTEXAI` | Run `gcloud auth application-default login`; set `GOOGLE_GENAI_USE_VERTEXAI=True` |
| `adk run` cannot find agent | Wrong folder or symbol name | Use parent folder path; ensure `root_agent` is exported for standalone mode |
| Empty or irrelevant results | Query lacks predicates or uses double quotes | Follow `SKILL.md` predicate rules; avoid quoted free text |

See [Troubleshooting](/troubleshooting) for cross-cutting auth and billing issues.

## Related pages

<CardGroup>
<Card title="Overview" href="/overview">
Knowledge Catalog tooling surface: discovery agents, enrichment agents, OKF bundles, and kcmd workspaces.
</Card>
<Card title="Installation" href="/installation">
Python setup, package installs, and Application Default Credentials for Vertex AI and BigQuery.
</Card>
<Card title="Enrichment workflows" href="/enrichment-workflows">
How enrichment agents produce metadata context that discovery agents later search.
</Card>
<Card title="Sync catalog metadata" href="/sync-catalog-metadata">
Pull catalog entries into a kcmd workspace so enrichment and discovery share the same metadata layer.
</Card>
</CardGroup>

---

## 13. Evaluate enrichment output

> Score enrichment runs with dynamic golden-free metrics or golden-based eval: structural validity, hallucination checks, fact recall, consistency across runs, and report artifacts.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/13-evaluate-enrichment-output.md
- Generated: 2026-06-15T02:55:19.585Z

### Source Files

- `agents/enrichment/eval/__main__.py`
- `agents/enrichment/eval/dynamic_eval.py`
- `agents/enrichment/eval/golden_eval.py`
- `agents/enrichment/eval/metrics.py`
- `agents/enrichment/eval/goldens/README.md`
- `agents/enrichment/eval/goldens/TEMPLATE.json`

---
title: "Evaluate enrichment output"
description: "Score enrichment runs with dynamic golden-free metrics or golden-based eval: structural validity, hallucination checks, fact recall, consistency across runs, and report artifacts."
---

The `agents/enrichment/eval` package scores enrichment agent output from `agents/enrichment/`. Run `python -m eval` from `agents/enrichment/` to score an existing output directory (`catalog/` plus `trajectory.json`) or to execute golden cases end-to-end with `--run`. Scores are normalized 0–1 internally and displayed 0–100 in the terminal scorecard and Markdown reports.

<Info>
Judge-based metrics use Vertex AI via Application Default Credentials. Set `GOOGLE_CLOUD_PROJECT` (or pass `--project` in `--run` mode) and run `gcloud auth application-default login`. Without auth, deterministic metrics still run; judge metrics show `n/a`.
</Info>

## Evaluation modes

| Mode | Command pattern | Input | Output |
|------|-----------------|-------|--------|
| **Score (dynamic)** | `--output-dir <dir>` | Agent output dir with `catalog/` and `trajectory.json` | Scorecard + `eval_report.md` in the output dir |
| **Score (golden)** | `--output-dir <dir> --golden <file>` | Same, plus a golden JSON answer key | Scorecard + report in `$TMPDIR/kc_golden_eval_reports/` |
| **Run case** | `--run --goldens <file> --project <p>` | Golden with a `run` block | Agent runs N times, then scores each; reports in a timestamped batch folder |

```text
agents/enrichment/
├── eval/
│   ├── __main__.py          # CLI entry: python -m eval
│   ├── dynamic_eval.py      # Golden-free scoring
│   ├── golden_eval.py       # Golden-based scoring
│   ├── metrics.py           # Deterministic + LLM-as-judge metrics
│   ├── aggregate.py         # Multi-run roll-up + consistency
│   ├── runner.py            # --run agent orchestration
│   ├── loaders.py           # catalog/ + trajectory.json readers
│   └── goldens/             # Bundled cases + TEMPLATE.json
└── src/agent_runner.py      # Spawned by --run
```

## Prerequisites

<Steps>
<Step title="Install dependencies">

From `agents/enrichment/`:

```bash
pip install -r eval/requirements.txt
```

For `--run` (which spawns the agent), also install agent dependencies and build `kcmd`:

```bash
pip install -r src/requirements.txt
cd ../mdcode && npm run build
```

</Step>

<Step title="Configure Vertex AI auth">

```bash
export GOOGLE_CLOUD_PROJECT=<your-project>
gcloud auth application-default login
```

`--run` sets `GOOGLE_CLOUD_PROJECT` and `GOOGLE_GENAI_USE_VERTEXAI=True` from `--project`.

</Step>

<Step title="Produce or locate agent output">

Score mode expects an output directory containing:

- `catalog/` — generated Metadata-as-Code (entry YAML, `.overview.md`, optional `.queries.md`)
- `trajectory.json` — tool calls, responses, token usage, latency, `agent_type`

This is the same `--output_dir` passed to `agent_runner.py`.

</Step>
</Steps>

## Score an existing run

### Dynamic (golden-free) eval

Dynamic eval needs no reference answers. It grounds hallucination checks in `trajectory.json` tool responses and scores structural validity, performance, groundedness, and rubric dimensions.

```bash
cd agents/enrichment
python -m eval --output-dir /tmp/enrich_out
python -m eval --output-dir /tmp/enrich_out --model gemini-2.5-pro
```

<Check>
Verification: the terminal prints a scorecard with metrics out of 100, and `eval_report.md` appears next to `trajectory.json` with full untruncated rationales.
</Check>

### Golden-based eval

Pass `--golden` to compare output against a hand-authored answer key. Golden eval runs the full dynamic metric set plus golden-specific metrics (concept recall, fact recall, section coverage, and others depending on mode).

```bash
python -m eval --output-dir /tmp/enrich_out --golden eval/goldens/supply_chain.json
python -m eval --output-dir /tmp/enrich_out --golden eval/goldens/thelook_ecommerce.json --persona analyst
```

Score several goldens at once:

```bash
python -m eval --output-dir /tmp/enrich_out \
  --goldens eval/goldens/supply_chain.json,eval/goldens/phone_services.json
```

Golden reports land in `$TMPDIR/kc_golden_eval_reports/` as `golden_report_<golden>__<run>.md`.

## Run golden cases end-to-end

`--run` generates Metadata-as-Code via the agent (using the golden's `run` block), repeats each case `--runs` times, scores every run, and aggregates results.

```bash
python -m eval --run --goldens eval/goldens/thelook_ecommerce.json \
  --project <your_gcp_project> --model gemini-2.5-pro --runs 3
```

Run multiple bundled cases:

```bash
python -m eval --run --project <p> --goldens \
  eval/goldens/thelook_ecommerce.json,eval/goldens/financial_services.json,\
eval/goldens/phone_services.json,eval/goldens/supply_chain.json
```

Dry-run to inspect the plan without executing:

```bash
python -m eval --run --goldens eval/goldens/supply_chain.json --project <p> --dry-run
```

### Bundled runnable goldens

| Golden | Mode | Setup |
|--------|------|-------|
| `thelook_ecommerce.json` | table | Copies `bigquery-public-data.thelook_ecommerce` into your project |
| `financial_services.json` | doc | Grounds on `eval/corpora/financial_services` |
| `phone_services.json` | doc | Grounds on `eval/corpora/phone_services` |
| `supply_chain.json` | doc | Grounds on `eval/corpora/supply_chain` |

Doc-mode goldens use `{project}.global.kc-eval-<name>` entry groups; `{project}` is replaced at run time.

## Metrics reference

Scores are 0–1 internally. The scorecard and reports display 0–100. `None` means the metric self-skipped (excluded from the average).

### Dynamic metrics (always attempted)

| Metric | Type | Gated | Description |
|--------|------|-------|-------------|
| `structural_validity` | Deterministic | Yes | Entry YAML parses, required fields present, entry type matches mode, overviews are clean Markdown |
| `perf` | Deterministic | No | Token usage, latency, output size — report-only, always passes |
| `hallucination_free` | Judge | Yes | Fraction of extracted factual claims grounded in retrieved source (+ table schema in table mode) |
| `redundancy_index` | Judge | Yes | Novel synthesis vs tautological schema restatement |
| `disambiguation_efficacy` | Judge | Yes | Entry grain and purpose explicit enough to distinguish from similar entries |
| `absence_of_contradictions` | Judge | Yes | No conflicting join keys, enums, metrics, or freshness across entries |

### Golden-specific metrics (when declared)

| Metric | Type | Mode | Golden field | Description |
|--------|------|------|--------------|-------------|
| `concept_recall` | Judge | doc | `expected_topics` | Expected concepts produced as entries (semantic match) |
| `concept_precision` | Judge | doc | `expected_topics`, `acceptable_extra_concepts` | Produced entries map to expected concepts |
| `fact_recall` | Judge | doc / table | `expected_topics[].golden_facts` or `tables[].golden_facts` | Golden facts conveyed in matched entries |
| `enrichment_diversity` | Deterministic | both | `expected_headings` | Expected sections present (queries sidecar satisfies "Sample Queries") |
| `business_terms_presence` | Judge | both | `business_terms` | Expected terms covered (semantic/flavor match) |
| `business_terms_validity` | Judge | both | `business_terms` | Dedicated per-term MaC files — typically low today |
| `entry_grounding` | Deterministic | table | — | Generated entries correspond to real dataset tables |
| `trajectory` | Deterministic | both | `trajectory` | `must_call` / `must_not_call` tool categories from `trajectory.json` |
| `context_preservation` | Judge | both | `prebaked_facts` | Pre-existing facts preserved through enrichment |
| `persona_alignment` | Judge | doc | `personas` + `--persona` | Output emphasizes persona focus areas, retains shared concepts |

Default pass thresholds for concept and fact metrics: 0.7 (`concept_recall`, `concept_precision`, `fact_recall`).

### Cross-run consistency metrics

When `--run --runs N` produces N independent agent outputs (distinct output dirs), two informational stability metrics are added. They never gate the case and are excluded from the average.

| Metric | Description |
|--------|-------------|
| `concept_consistency` | Same set of concepts produced across runs (semantic match via judge) |
| `content_consistency` | Recurring concepts state consistent facts across runs |

With fewer than two independent runs, consistency metrics show `n/a` with guidance to use `--run --runs N`.

<Warning>
`--runs` on a single `--output-dir` in score mode re-scores the same output and does not measure cross-run stability. Use `--run --runs N` for independent runs.
</Warning>

## Report artifacts

### Dynamic eval report

Written to `<output_dir>/eval_report.md` alongside `trajectory.json`. Contains full rationales, insights, per-run breakdowns (when aggregated), and telemetry.

### Golden eval reports

| Artifact | Location | Contents |
|----------|----------|----------|
| Per-run report | `$TMPDIR/kc_golden_eval_reports/golden_run_<time>_<id>/<golden>/run<N>.md` | Full metrics for one agent run |
| Aggregate report | `.../<golden>/aggregate.md` | Mean scores, per-metric `run_scores`, per-run breakdown |
| Manifest | `.../manifest.json` | Run ID, project, model, cases, per-case averages |
| Score-only golden | `$TMPDIR/kc_golden_eval_reports/golden_report_<golden>__<run>.md` | Report when scoring existing output |

### Terminal scorecard

The CLI prints a formatted table with metric name, score (0–100), and truncated rationale. Multi-run cases prefix rationales with `runs k/n [s1, s2, …]`. Use `--json` for machine-readable output.

<ResponseExample>

```text
Dynamic eval — /tmp/enrich_out/mdcode
  mode: doc  (agent_type=doc)

  metric                              score   rationale
  ----------------------------------- ------- ----------------------------------------
  structural_validity                 100.0   All 8 generated entries are valid...
  perf                                100.0   Completed in 142s. Used 48,231 tokens...
  hallucination_free                   92.3   2 of 27 claims unsupported...
  redundancy_index                     78.5   ...
  ----------------------------------- -------
  AVERAGE                              89.2

  tokens: 48,231 (in 41,102 / out 7,129)  ·  tool calls: 14  ·  latency: 142.0s
```

</ResponseExample>

## Golden file schema

Start from `eval/goldens/TEMPLATE.json`. Keep fields that match your enrichment mode.

```jsonc
{
  "expected_topics": [
    {
      "canonical": "Reorder Point",
      "flavor_hints": ["reorder level", "ROP"],
      "golden_facts": ["ROP = average daily demand × lead time, plus safety stock."]
    }
  ],
  "acceptable_extra_concepts": [{"name": "SKU", "aliases": ["stock keeping unit"]}],
  "tables": [{"table": "order_items", "golden_facts": ["grain is one unit per order line"]}],
  "expected_headings": ["Lineage", "Sample Queries"],
  "business_terms": ["session", "event"],
  "trajectory": {"must_call": [], "must_not_call": ["dataset_pull"]},
  "personas": {
    "analyst": {
      "instruction": "Focus on operational metrics",
      "focus_areas": ["inventory turnover"],
      "shared_concepts": ["order lifecycle"]
    }
  },
  "run": {
    "mode": "table",
    "topic": "Metadata enrichment",
    "folders": "eval/corpora/my_corpus",
    "entry_group": "{project}.global.kc-eval-foo",
    "setup": {
      "copy_public_dataset": {
        "source": "bigquery-public-data.thelook_ecommerce",
        "dataset": "thelook_ecommerce"
      }
    }
  }
}
```

Mode is auto-detected from `trajectory.json` `agent_type` (`doc`, `table`, or `context_overlay`). Context overlay runs skip entry-type checks and table-only metrics.

### Building goldens

1. **Author deliberately** — write `golden_facts` and `expected_topics` for scenarios you care about.
2. **Work backward from documented data** — hold out human-written descriptions, run the agent, use held-out text as `golden_facts`.
3. **Harvest from human review** — capture approved or corrected entries as goldens.

<Note>
A golden is an imperfect oracle. Spot-check low scores — a correct output not listed in the golden can register as a false miss.
</Note>

## CLI flags

<ParamField body="--output-dir" type="string">
Agent output directory containing `catalog/` and `trajectory.json`. Required in score mode. Comma-separated to pair with multiple `--goldens`. In `--run` mode, optional report root (defaults to `$TMPDIR/kc_golden_eval_reports`).
</ParamField>

<ParamField body="--golden" type="string">
Single golden JSON file for golden-based scoring.
</ParamField>

<ParamField body="--goldens" type="string">
Comma-separated golden files. Run or score several cases at once.
</ParamField>

<ParamField body="--run" type="boolean">
Execute each golden's `run` block on the agent, then score. Requires `--project`.
</ParamField>

<ParamField body="--project" type="string" required>
GCP project for agent execution and dataset copy. Sets `GOOGLE_CLOUD_PROJECT` for the judge.
</ParamField>

<ParamField body="--runs" type="integer">
Times to run each case. Default 3 in `--run` mode. Only valid with golden case-runs; rejected for dynamic eval on a single output dir.
</ParamField>

<ParamField body="--concurrency" type="integer">
Max concurrent agent processes in `--run`. Default 2, overridable via `KC_EVAL_MAX_CONCURRENCY`.
</ParamField>

<ParamField body="--model" type="string">
Vertex AI model for agent (`--run`) and judge. Default `gemini-2.5-pro`.
</ParamField>

<ParamField body="--persona" type="string">
Persona ID from the golden's `personas` block. Golden mode only.
</ParamField>

<ParamField body="--dry-run" type="boolean">
Print the `--run` plan without executing.
</ParamField>

<ParamField body="--json" type="boolean">
Emit raw JSON results instead of the formatted scorecard.
</ParamField>

## Mode-specific behavior

### Doc mode

Uses `expected_topics` for concept recall/precision and fact recall. Trajectory checks derive tool categories from `trajectory.json` `tool_uses` (mapped to `drive_fetch`, `dataset_pull`, `github_fetch`).

### Table mode

Uses `tables[].golden_facts` for fact recall. Adds `entry_grounding` (no invented tables). Hallucination grounding includes pulled schema, reference sidecars (`.ref.yaml`, `.ref.overview.md`), and generated YAML.

### Context overlay mode

Keeps `agent_type=context_overlay`. Structural validity skips entry-type enforcement. Table-only metrics (`entry_grounding`, per-table fact recall) do not apply. Score with `business_terms`, `expected_headings`, and trajectory-grounded `hallucination_free`.

## Troubleshooting

| Symptom | Cause | Fix |
|---------|-------|-----|
| Judge metrics show `n/a` | Missing Vertex auth | Set `GOOGLE_CLOUD_PROJECT`, run `gcloud auth application-default login` |
| `[score] skip … no trajectory.json` | Agent run incomplete | Check agent deps, built `kcmd`, and agent logs in the output dir |
| `error: --run needs --project` | Missing project in run mode | Pass `--project <your-gcp-project>` |
| `error: --runs only applies to golden case-runs` | `--runs` with dynamic eval | Drop `--runs` or use `--run --goldens` |
| Agent exits non-zero in `--run` | Setup or quota failure | Check `[run] FAILED` stderr tail; verify BigQuery access for dataset copy |
| Consistency metrics `n/a` | Single independent run | Use `--run --runs 3` for distinct outputs |
| `business_terms_validity` low | Agent emits no per-term files | Expected gap; check `business_terms_presence` instead |
| `context_preservation` low | Agent regenerates from scratch | Merge-into-existing path not yet implemented |

<AccordionGroup>
<Accordion title="Environment variables">

| Variable | Purpose |
|----------|---------|
| `GOOGLE_CLOUD_PROJECT` | Vertex AI project for judge (and agent in `--run`) |
| `GOOGLE_GENAI_USE_VERTEXAI` | Set to `True` by `--run` |
| `GOOGLE_CLOUD_LOCATION` | Vertex location (default `global`) |
| `KC_EVAL_MAX_CONCURRENCY` | Default `--concurrency` cap (default 2) |
| `KC_AGENT_DIR` | Override path to `agents/enrichment/src` |

</Accordion>

<Accordion title="Hallucination check internals">

The judge extracts atomic domain claims from overviews, then verifies each claim against the full grounding corpus in overlapping chunks (45K chars, 1.5K overlap). Claims are checked in parallel (up to 3 workers). A claim is hallucinated only if no chunk supports it. Table mode adds schema/reference metadata to the grounding corpus.

</Accordion>
</AccordionGroup>

## Next

<CardGroup>
<Card title="Run catalog enrichment agent" href="/run-catalog-enrichment-agent">
Execute table, doc, or context_overlay enrichment modes and produce the output directories this evaluator scores.
</Card>
<Card title="Enrichment workflows" href="/enrichment-workflows">
End-to-end flow from source metadata through enrichment to catalog publication.
</Card>
<Card title="Produce OKF bundles" href="/produce-okf-bundles">
Alternative enrichment path that emits OKF bundles instead of mdcode workspaces.
</Card>
<Card title="Publish enriched metadata" href="/publish-enriched-metadata">
Push scored mdcode workspaces to Knowledge Catalog with kcmd after eval passes your bar.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
Auth, billing, kcmd build, and model credential failures across the tooling surface.
</Card>
</CardGroup>

---

## 14. kcmd CLI reference

> kcmd commands, init flags per source type, pull and push options including dry-run, force, validate-only, reference pull, and authentication via gcloud ADC.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/14-kcmd-cli-reference.md
- Generated: 2026-06-15T02:55:37.457Z

### Source Files

- `agents/mdcode/src/tool/main.ts`
- `agents/mdcode/src/tool/commands.ts`
- `agents/mdcode/README.md`
- `toolbox/mdcode/src/tool/main.ts`
- `toolbox/mdcode/src/tool/commands.ts`
- `agents/mdcode/src/libts/gcp/context.ts`

---
title: "kcmd CLI reference"
description: "kcmd commands, init flags per source type, pull and push options including dry-run, force, validate-only, reference pull, and authentication via gcloud ADC."
---

`kcmd` is the Metadata as Code CLI in `agents/mdcode`. It initializes a workspace (`catalog.yaml` plus a `catalog/` tree), syncs Knowledge Catalog (Dataplex) metadata with `pull` and `push`, fetches read-only reference layers with `reference`, and exposes an MCP server for agent workflows. All catalog API calls authenticate through gcloud Application Default Credentials (ADC).

<Info>
The published npm package name is `kcmd` (version `0.1.0`). The compiled binary reports CLI version `1.0.0`. Build from source with `npm run build` in `agents/mdcode`, or run via `npx kcmd`.
</Info>

## Command summary

| Command | Purpose | Key flags |
| --- | --- | --- |
| `kcmd init` | Create `catalog.yaml` for a source scope | Source-type flags (exactly one required), `--pull` |
| `kcmd pull` | Download editable metadata into `catalog/` | `--dry-run` |
| `kcmd push` | Publish local edits to the catalog service | `--dry-run`, `--force`, `--validate-only` |
| `kcmd reference` | Pull read-only `.ref.yaml` reference layers | None |
| `kcmd mcp` | Start the MCP server over stdio | `--path` |

`pull`, `push`, and `reference` require a workspace root containing `catalog.yaml`. Command handlers resolve the snapshot from the current working directory (`.`).

## Authentication

`kcmd` does not accept API keys on the command line. `ApiContext.default()` shells out to gcloud for project, region, and token:

```bash
gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID
gcloud config set compute/region YOUR_REGION
```

On HTTP 401 responses, the API client refreshes the ADC token and retries once. If project, region, or token cannot be resolved, initialization fails with:

`Unable to retrieve project, location, or token. Ensure gcloud is configured.`

Set `GCP_LOG` to any truthy value to enable request/response debug logging.

## `kcmd init`

Creates `catalog.yaml` in the current directory and prints the manifest to stdout. Provide **exactly one** primary source flag. An optional `--pull` runs `kcmd pull` immediately after initialization.

### Source types and ID formats

| Source type | Flag | ID format | Local layout |
| --- | --- | --- | --- |
| BigQuery dataset | `--bigquery-dataset` | `projectId.datasetId` | YAML (`.yaml`) |
| Knowledge base | `--kb` | `projectId.locationId.entryGroupId` | Markdown (`.md`) |
| Entry group | `--entry-group` | `projectId.locationId.entryGroupId` | YAML |
| BigLake Iceberg namespace | `--biglake-namespace` + `--iceberg` | `projectId.catalogId.namespaceId` | YAML |
| Business glossary | `--glossary` | See glossary formats below | YAML under `catalog/glossaries/` |

<ParamField body="--entry-group" type="string">
Dataplex EntryGroup identifier as `project.location.id`. The entry group may not exist yet; init succeeds and the group can be created on a later `push`.
</ParamField>

<ParamField body="--bigquery-dataset" type="string[]">
One or more BigQuery datasets as `project.datasetId`. Repeat the flag for multiple datasets (`--bigquery-dataset ds1 --bigquery-dataset ds2`). Each dataset must exist in BigQuery before init.
</ParamField>

<ParamField body="--kb" type="string">
Knowledge Base EntryGroup as `project.location.id`. Selects the Markdown layout for human-authored wiki content.
</ParamField>

<ParamField body="--biglake-namespace" type="string">
BigLake namespace as `project.catalogId.namespaceId`. Must be paired with `--iceberg`; non-Iceberg metastores are rejected.
</ParamField>

<ParamField body="--iceberg" type="boolean">
Required when using `--biglake-namespace`. Without it, init exits with an error.
</ParamField>

<ParamField body="--glossary" type="string">
Glossary scope in one of these forms:
- **Single glossary by ID**: `project.location.glossary-id`
- **Multiple glossaries**: `project.location.glossary-a,glossary-b` (comma-separated)
- **By display name**: `project.location.My Business Glossary` (falls back from ID lookup)
- **Location mode** (all glossaries in a location): `project.location`
</ParamField>

<ParamField body="--pull" type="boolean">
After writing `catalog.yaml`, immediately run `kcmd pull` to populate `catalog/`.
</ParamField>

<RequestExample>

```bash title="BigQuery workspace"
kcmd init --bigquery-dataset my-project.my_dataset --pull
```

```bash title="Knowledge base (Markdown layout)"
kcmd init --kb my-project.us-central1.my-kb-id
```

```bash title="BigLake Iceberg namespace"
kcmd init --biglake-namespace my-project.my-catalog.my-namespace --iceberg
```

```bash title="Glossary (location mode)"
kcmd init --glossary my-project.us-central1
```

</RequestExample>

If no source flag is provided, init prints an error and exits with code `1`.

## `kcmd pull`

Lists resources in the manifest `scope`, fetches matching entries and configured aspects from the Dataplex Catalog API, and writes them under `catalog/`. When `snapshot.entryLinks` is declared in `catalog.yaml`, `pull` also calls `lookupEntryLinks` and inlines links into entry YAML (column-level links under `aspects.schema.fields[].links`).

<ParamField body="--dry-run" type="boolean">
List resources that would be pulled without writing local files. Each candidate is logged as `[DRY-RUN] Pull Resource: <resource-name>`.
</ParamField>

<ResponseExample>

```text title="Successful pull"
Pulling catalog entries...
Successfully updated local snapshot.
```

</ResponseExample>

Entries the service returns as missing or inaccessible (non-200 `lookupEntry`) are skipped silently during pull.

## `kcmd push`

Reads modifiable local entries (files with a non-`.ref.yaml` path), compares them to remote state, and applies creates, updates, and EntryLink reconciliation. Files ending in `.ref.yaml` are never pushed.

Push behavior highlights:

- **Auto-provision entries**: Missing entries trigger `createEntry`; missing parent entry groups trigger `createEntryGroup`.
- **Glossary hierarchy**: `kcmd` does **not** create `Glossary`, `GlossaryCategory`, or `GlossaryTerm` resources. Missing glossary nodes fail fast. Existing glossary resources can have descriptions and labels updated.
- **EntryLink reconciliation**: When `publishing.entryLinks` is set, push compares local and remote links (project ID/number normalized, `@dataplex` proxy unwrapped), deletes remote-only links, and creates local-only links.

<ParamField body="--dry-run" type="boolean">
Simulate push mutations without calling the catalog API. Logs planned creates, modifies, deletes, and EntryLink operations prefixed with `[DRY-RUN]`.
</ParamField>

<ParamField body="--force" type="boolean">
Documented intent: overwrite service metadata, ignoring conflicts. **Not yet wired** in `CatalogSync.push`; passing this flag currently has no effect on sync behavior.
</ParamField>

<ParamField body="--validate-only" type="boolean">
Documented intent: validate the local snapshot against the service without publishing. **Not yet wired** in `CatalogSync.push`; passing this flag currently has no effect on sync behavior.
</ParamField>

<Warning>
Use `--dry-run` to preview push side effects today. `--force` and `--validate-only` are exposed on the CLI but are not implemented in the sync engine yet (`CatalogSync.validate()` and `CatalogSync.status()` are also unimplemented). A planned `kcmd status` command is described in design docs but is not registered in the current CLI.
</Warning>

<RequestExample>

```bash title="Preview push changes"
kcmd push --dry-run
```

```bash title="Publish local edits"
kcmd push
```

</RequestExample>

## `kcmd reference`

Pulls read-only metadata defined in the manifest `reference:` block into sibling `*.ref.yaml` files. Requires `reference.scope` and optional `reference.snapshot` (entries, aspects, entryLinks) in `catalog.yaml`. When `reference.snapshot.entryLinks` is set, reference pull includes pre-edit link state so diffs against enriched `.yaml` files surface only your changes.

Reference files are indexed separately from editable files and are excluded from `push` via `isModifiable()`.

<Steps>
<Step title="Configure reference in catalog.yaml">
Add a `reference:` block with `scope` and `snapshot` matching the grounding metadata you need (for example, authoritative schemas).
</Step>
<Step title="Run reference pull">
From the workspace root, run `kcmd reference`.
</Step>
<Step title="Verify output">
Confirm `*.ref.yaml` siblings appear next to editable entry files under `catalog/`.
</Step>
</Steps>

## `kcmd mcp`

Starts an MCP server on stdio bound to a workspace path. Tools: `list-entries`, `lookup-entry`, and `modify-entry` (local snapshot updates only; does not call `push`).

<ParamField body="--path" type="string" default=".">
Absolute or relative path to the workspace root containing `catalog.yaml`. Defaults to the current directory when omitted in code, but agent configs typically pass an absolute path.
</ParamField>

<CodeGroup>

```json title="Gemini CLI / MCP client config"
{
  "mcpServers": {
    "kcmd": {
      "command": "npx",
      "args": ["-y", "kcmd", "mcp", "--path", "/absolute/path/to/workspace"]
    }
  }
}
```

```bash title="Local binary"
kcmd mcp --path /path/to/workspace
```

</CodeGroup>

## Exit codes and errors

| Code | Meaning |
| --- | --- |
| `0` | Command succeeded |
| `1` | Command failed (validation error, sync failure, missing manifest, unknown subcommand) |

Sync failures print `Error pulling/pushing catalog entries:` or `Error pulling reference entries:` followed by `result.details`. Uncaught exceptions print `Error: <message>`.

## Workspace layout

After `init` and `pull`, a typical BigQuery workspace looks like:

:::files
/
├── catalog.yaml          # Manifest: scope, snapshot, publishing, reference
└── catalog/
    └── bigquery/
        └── project-id/
            └── dataset-id/
                ├── table.yaml
                └── table.ref.yaml   # From kcmd reference (read-only)
:::

Knowledge Base workspaces use `.md` files with YAML frontmatter instead of standalone `.yaml` entry files. See the manifest and layout pages for full `catalog.yaml` field reference.

## Related pages

<CardGroup>
<Card title="Installation" href="/installation">
Prerequisites, package install, and gcloud ADC setup.
</Card>
<Card title="Sync catalog metadata" href="/sync-catalog-metadata">
End-to-end init, pull, status check, and push workflow by source type.
</Card>
<Card title="catalog.yaml manifest reference" href="/catalog-manifest-reference">
`scope`, `snapshot`, `publishing`, `reference`, and entry link configuration.
</Card>
<Card title="kcmd MCP server reference" href="/kcmd-mcp-reference">
MCP tool schemas, workspace binding, and agent integration patterns.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
Auth failures, push conflicts, and glossary provisioning errors.
</Card>
</CardGroup>

---

## 15. catalog.yaml manifest reference

> scope, snapshot, publishing, reference, aliases, entry and aspect types, entryLinks reconciliation rules, and layout selection for YAML versus Markdown knowledge-base mode.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/15-catalog.yaml-manifest-reference.md
- Generated: 2026-06-15T02:56:28.508Z

### Source Files

- `agents/mdcode/README.md`
- `agents/mdcode/src/libts/manifest.ts`
- `agents/mdcode/docs/concept.md`
- `agents/mdcode/docs/design.md`
- `toolbox/mdcode/docs/spec.md`
- `agents/mdcode/src/libts/layout.ts`

---
title: catalog.yaml manifest reference
description: scope, snapshot, publishing, reference, aliases, entry and aspect types, entryLinks reconciliation rules, and layout selection for YAML versus Markdown knowledge-base mode.
---

The `catalog.yaml` manifest is the configuration contract for a kcmd Metadata as Code workspace. It declares which Knowledge Catalog resources the workspace owns, which entry and aspect types to pull locally, which subsets to publish back, and optional read-only reference layers for grounding. `kcmd init` writes a minimal manifest containing only `scope`; you extend `snapshot`, `publishing`, and `reference` before running `pull`, `reference`, or `push`.

## Manifest location

The manifest lives at the workspace root:

```text
my-workspace/
├── catalog.yaml          # Sync configuration
└── catalog/              # Pulled and edited metadata artifacts
```

`CatalogManifest.load()` validates the file with a Zod schema, resolves the scope into a `CatalogSource`, and selects the on-disk layout from that source type.

## Top-level structure

| Field | Required | Purpose |
| --- | --- | --- |
| `scope` | Yes | Primary catalog resource(s) this workspace manages |
| `resourceAlias` | No | Custom aliases for aspect, glossary, or entryLink types |
| `snapshot` | No | Entry, aspect, and entryLink types to pull into `catalog/` |
| `publishing` | No | Subset of snapshot types written back on `push` |
| `reference` | No | Read-only scope and snapshot for `*.ref.yaml` grounding layers |
| `entryLinkTypes` | No | Optional top-level list; persisted by `save()` but link sync is driven by `snapshot.entryLinks` and `publishing.entryLinks` |

<RequestExample>

```yaml title="Full manifest example"
scope: bq-dataset.my-project-id.my-dataset-id

resourceAlias:
  guidelines:
    aspect: data-agents-project.global.guidelines

snapshot:
  entries:
    - dataplex-types.global.bigquery-table
  aspects:
    - dataplex-types.global.schema
    - dataplex-types.global.bigquery-table
    - dataplex-types.global.storage
    - dataplex-types.global.overview
    - guidelines
  entryLinks:
    - definition
    - synonym

publishing:
  aspects:
    - dataplex-types.global.overview
    - guidelines
  entryLinks:
    - definition

reference:
  scope: bq-dataset.my-project-id.my-dataset-id
  snapshot:
    entries:
      - dataplex-types.global.bigquery-table
    aspects:
      - dataplex-types.global.schema
      - dataplex-types.global.overview
    entryLinks:
      - definition
      - synonym
```

</RequestExample>

## scope

<ParamField body="scope" type="string | string[]" required>
Unified identifier for the workspace source of truth. Format: `<type>.<name>`.

For BigQuery multi-dataset workspaces, `scope` may be a YAML array of `bq-dataset.<project>.<dataset>` strings. Array scopes support only `bq-dataset`; mixing scope types in one array is rejected.
</ParamField>

### Supported scope types

| Scope prefix | `kcmd init` flag | Name format | On-disk layout |
| --- | --- | --- | --- |
| `bq-dataset` | `--bigquery-dataset` | `<project>.<dataset>` (repeat flag for multiple) | YAML (Standard) |
| `kb` | `--kb` | `<project>.<location>.<entry-group-id>` | Markdown (Documents) |
| `entryGroup` | `--entry-group` | `<project>.<location>.<entry-group-id>` | YAML (Standard) |
| `biglake-namespace` / `biglake-iceberg-namespace` | `--biglake-namespace --iceberg` | `<project>.<catalog>.<namespace>` | YAML (Standard) |
| `glossary` | `--glossary` | `<project>.<location>` or `<project>.<location>.<glossary-id>` (comma-separated IDs or display names) | YAML (Standard) |

The scope type determines:

- Which GCP APIs `kcmd` calls during `pull` and `push`
- The directory hierarchy under `catalog/`
- Whether entries are **ingested** (system-managed, e.g. BigQuery tables) or user-managed (Knowledge Base pages, custom entries)

<AccordionGroup>
<Accordion title="BigQuery multi-dataset scope">

```yaml
scope:
  - bq-dataset.my-project.dataset-a
  - bq-dataset.my-project.dataset-b
```

`kcmd` joins the datasets internally and emits one scope entry per dataset when saving the manifest.

</Accordion>
<Accordion title="Glossary location mode">

```yaml
scope: glossary.my-project.us-central1
```

When no glossary ID is provided after `<project>.<location>`, the workspace operates in location mode and resolves glossaries at pull time.

</Accordion>
</AccordionGroup>

## snapshot

The `snapshot` block defines what metadata `kcmd pull` downloads into editable local files.

<ParamField body="snapshot.entries" type="string[]">
Entry types to include. Each value must be a three-part type reference: `<project>.<location>.<type-id>` (for example `dataplex-types.global.bigquery-table`). Required aspects for listed entry types are fetched implicitly during type registration.
</ParamField>

<ParamField body="snapshot.aspects" type="string[]">
Aspect types to pull. Values may be fully qualified three-part references or short aliases such as `overview` or `schema`. Aliases resolve through built-in defaults and any custom `resourceAlias` mappings.
</ParamField>

<ParamField body="snapshot.entryLinks" type="string[]">
EntryLink types to fetch during `pull`. When present and non-empty, `kcmd` calls `lookupEntryLinks` for every pulled entry and inlines results into entry artifacts. Omit or leave empty to skip link fetching entirely.
</ParamField>

### How pulled entryLinks are stored

| Link source path | Local placement |
| --- | --- |
| Entry-level (no `Schema.<field>` path) | Top-level `links.<type>[]` on the entry YAML or Markdown frontmatter |
| Column-level (`Schema.<field>` path) | `aspects.schema.fields[].links.<type>[]` |

Glossary term targets are written in human-readable form (`<project>.<location>.<glossary-display>.<term-display>`) with the full UID resource path preserved in each link's `id` field for round-trip `push`.

## publishing

The `publishing` block declares which local edits `kcmd push` writes back to Knowledge Catalog. **Publishing must be a subset of snapshot**: aspect, entry, and entryLink types listed in `publishing` must also appear in the corresponding `snapshot` lists, or manifest loading fails.

<ParamField body="publishing.entries" type="string[]">
Entry types eligible for publish-side create/delete semantics. Must be listed in `snapshot.entries`.
</ParamField>

<ParamField body="publishing.aspects" type="string[]">
Aspect types written on `push`. Aspects not listed here are kept locally for context but not uploaded. For ingested entries (BigQuery), required aspects are never pushed even if listed.
</ParamField>

<ParamField body="publishing.entryLinks" type="string[]">
EntryLink types reconciled on `push`. Must be listed in `snapshot.entryLinks`. Omit the field or leave it empty to avoid taking responsibility for link mutations — useful for read-only link workspaces.
</ParamField>

<ResponseExample>

```yaml title="Enrichment-focused publishing"
snapshot:
  entries:
    - dataplex-types.global.bigquery-table
  aspects:
    - overview
    - descriptions
    - queries
    - schema          # local context only
  entryLinks:
    - definition

publishing:
  aspects:
    - overview
    - descriptions
    - queries
  entryLinks:
    - definition
```

</ResponseExample>

Files ending in `.ref.yaml` are always skipped during `push`.

## reference

The optional `reference` block configures read-only grounding layers fetched by `kcmd reference` (not `pull`). Reference artifacts are saved as sibling `*.ref.yaml` files and are never pushed.

<ParamField body="reference.scope" type="string | string[]" required>
Source scope for reference data. Uses the same `<type>.<name>` format as the primary `scope`. Commonly points at the same BigQuery dataset or a glossary used for enrichment grounding.
</ParamField>

<ParamField body="reference.snapshot" type="object">
Same shape as the primary `snapshot` block: `entries`, `aspects`, and optional `entryLinks`. When `reference.snapshot.entryLinks` is set, `.ref.yaml` baselines include pre-edit link state so diffs against live `.yaml` files surface only enrichment changes.
</ParamField>

<Steps>
<Step title="Pull editable metadata">

Run `kcmd pull` to populate `catalog/` from the primary `scope` and `snapshot` configuration.

</Step>
<Step title="Pull reference baselines">

Run `kcmd reference` to fetch `*.ref.yaml` siblings from `reference.scope` and `reference.snapshot`.

</Step>
<Step title="Edit and push">

Modify editable files only. `push` uploads aspects and entryLinks declared in `publishing`, ignoring `.ref.yaml`.

</Step>
</Steps>

## resourceAlias

<ParamField body="resourceAlias" type="record<string, record<string, string>>">
Maps a short alias to exactly one resource. Each alias entry must contain a single key among `aspect`, `glossary`, or `entryLink`.

```yaml
resourceAlias:
  guidelines:
    aspect: data-agents-project.global.guidelines
  ecommerce:
    glossary: data-gov-project.global.ecommerce-glossary
```

Custom aliases cannot override built-in defaults. Duplicate aliases or duplicate resource mappings raise manifest parse errors.
</ParamField>

### Built-in aliases

These resolve without declaring `resourceAlias`:

| Alias | Resolves to | Resource kind |
| --- | --- | --- |
| `bigquery-dataset` | `dataplex-types.global.bigquery-dataset` | aspect |
| `bigquery-table` | `dataplex-types.global.bigquery-table` | aspect |
| `schema` | `dataplex-types.global.schema` | aspect |
| `storage` | `dataplex-types.global.storage` | aspect |
| `overview` | `dataplex-types.global.overview` | aspect |
| `definition` | `dataplex-types.global.definition` | entryLink |
| `synonym` | `dataplex-types.global.synonym` | entryLink |
| `related` | `dataplex-types.global.related` | entryLink |
| `schema-join` | `dataplex-types.global.schema-join` | entryLink |

Use project IDs, not project numbers, when authoring metadata references.

## Entry and aspect type references

All manifest type lists use Dataplex three-part references: `<project>.<location>.<resource-id>`.

| Manifest list | Alias support | Validation |
| --- | --- | --- |
| `snapshot.entries` | No — use fully qualified names | Must split into exactly three dot-separated parts |
| `snapshot.aspects` | Yes | Resolved alias must be three-part |
| `snapshot.entryLinks` | Yes | Resolved alias must be three-part |
| `publishing.*` | Same as snapshot | Must be subset of corresponding snapshot list |

During workspace initialization, `kcmd` fetches type definitions from the Catalog API and registers required aspects for listed entry types automatically.

## entryLinks reconciliation

When `publishing.entryLinks` is declared, `push` reconciles local vs remote `EntryLink` resources per entry for those types only.

### Reconciliation algorithm

1. **Serialize local links** — `toServiceEntryLinks` reads top-level `links` and `aspects.schema.fields[].links`, filtering to types in `publishing.entryLinks`.
2. **Fetch remote links** — `lookupEntryLinks` queries the catalog for the same link types.
3. **Normalize both sides** — Comparison keys use an unwrap-and-normalize strategy:
   - Strip `@dataplex` proxy shells from target references
   - Canonicalize project segments to project ID (not number)
   - Include source path (for example `Schema.customer_id`) so column-level links stay distinct
4. **Keep matches** — Links with equal keys are preserved; no delete-and-recreate cycle.
5. **Delete orphans** — Remote links of configured types with no local match are deleted.
6. **Create new** — Local links with no remote match are created with generated link IDs.

<AccordionGroup>
<Accordion title="Comparison key format">

Each link is keyed as:

```text
<normalized-link-type>|<normalized-target>|<source-path>
```

The `source-path` segment is empty for entry-level links.

</Accordion>
<Accordion title="Glossary target resolution on push">

Local link `target` values may be human-readable glossary references. When `id` is set (populated on `pull`), `push` prefers `id` to reconstruct the exact catalog UID. Otherwise it resolves through the workspace `scope` or `reference.scope` via `serviceName()`.

</Accordion>
<Accordion title="EntryLink aspect updates">

EntryLink updates when only link aspects differ are not yet implemented (`TODO` in sync). Matching links are kept as-is.

</Accordion>
</AccordionGroup>

### Local entryLink artifact shape

```yaml
aspects:
  schema:
    fields:
      - name: customer_id
        dataType: STRING
        links:
          definition:
            - target: my-project.global.business-glossary.customer-id
              id: projects/my-project/locations/global/glossaries/biz/terms/customer-id

links:
  related:
    - target: my-other-project.us.docs-eg.runbook-page
```

## Layout selection

Layout is **not** a manifest field. `CatalogSnapshot` selects the layout from `manifest.source.layout`, which each scope type sets at initialization:

| Layout | Scope types | Entry file pattern | Unstructured text |
| --- | --- | --- | --- |
| **Standard** (YAML) | `bq-dataset`, `entryGroup`, `biglake-*`, `glossary` | `<entry-id>.yaml` plus optional `<entry-id>.<aspect>.md` sidecars | Sidecar Markdown files |
| **Documents** (Markdown) | `kb` | `<entry-id>.md` | YAML frontmatter + Markdown body (`overview.content`) |

```text
Standard layout (BigQuery)
├── catalog.yaml
└── catalog/bigquery/<project>/<dataset>/
    ├── orders.yaml
    ├── orders.ref.yaml
    └── orders.overview.md

Documents layout (Knowledge Base)
├── catalog.yaml
└── catalog/<namespace>/<project>/<location>/
    ├── page1.md
    └── playbooks/mbr.md
```

Knowledge Base entries combine metadata and content in one file:

```markdown
---
type: dataplex-types.global.entry
title: My Page Title
catalogEntry:
  id: my-page-id
---
# Page body

This Markdown body maps to the `overview.content` aspect.
```

## Validation errors

Manifest loading fails fast on structural problems:

| Error | Cause |
| --- | --- |
| `scope '<value>' is invalid` | Missing dot separator in scope string |
| `scope array cannot be empty` | Empty `scope: []` |
| `Unsupported scope type in multiple scopes` | Non-`bq-dataset` entry in scope array |
| `Invalid Entry/Aspect/EntryLink Type` | Type reference does not resolve to three parts |
| `Publishing ... is not listed in snapshot` | `publishing` entry, aspect, or entryLink not in `snapshot` |
| `Alias ... has multiple mappings` | `resourceAlias` entry contains more than one resource key |
| `Cannot define predefined alias` | Custom alias collides with a built-in default |

## Initialization vs full configuration

`kcmd init` writes a scope-only manifest:

<ResponseExample>

```yaml title="After kcmd init --bigquery-dataset my-project.my-dataset"
scope: bq-dataset.my-project.my-dataset
```

</ResponseExample>

Add `snapshot`, `publishing`, and optional `reference` before the first `pull`. For enrichment workflows, declare every aspect and link type agents may read in `snapshot`, then restrict `publishing` to the types humans or CI should deploy.

<Tabs>
<Tab title="BigQuery dataset">

```yaml
scope: bq-dataset.ecommerce-prod.ecommerce-dataset
snapshot:
  entries:
    - dataplex-types.global.bigquery-dataset
    - dataplex-types.global.bigquery-table
  aspects:
    - overview
    - schema
    - storage
  entryLinks:
    - definition
publishing:
  aspects:
    - overview
  entryLinks:
    - definition
```

</Tab>
<Tab title="Knowledge Base">

```yaml
scope: kb.ecommerce-prod.global.mbr-kb
snapshot:
  entries:
    - dataplex-types.global.document
  aspects:
    - overview
publishing:
  aspects:
    - overview
```

</Tab>
<Tab title="With reference layer">

```yaml
scope: bq-dataset.my-project.my-dataset
snapshot:
  entries:
    - dataplex-types.global.bigquery-table
  aspects:
    - overview
    - schema
  entryLinks:
    - definition
publishing:
  aspects:
    - overview
  entryLinks:
    - definition
reference:
  scope: bq-dataset.my-project.my-dataset
  snapshot:
    aspects:
      - schema
    entryLinks:
      - definition
```

</Tab>
</Tabs>

## Related pages

<CardGroup cols={2}>
<Card title="Metadata as Code" href="/metadata-as-code">
kcmd workspace model, pull/push sync, reference layers, and glossary scope overview.
</Card>
<Card title="Sync catalog metadata" href="/sync-catalog-metadata">
Initialize workspaces per source type, pull snapshots, check status, and push edits.
</Card>
<Card title="kcmd CLI reference" href="/kcmd-cli-reference">
`init`, `pull`, `push`, `reference`, and authentication flags.
</Card>
<Card title="Publish enriched metadata" href="/publish-enriched-metadata">
Push mdcode workspaces after enrichment without modifying reference layers.
</Card>
</CardGroup>

---

## 16. kcmd MCP server reference

> MCP server startup, workspace path binding, and agent tools for pull, push, list-entries, lookup-entry, and modify-entry in agentic metadata workflows.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/16-kcmd-mcp-server-reference.md
- Generated: 2026-06-15T02:56:39.859Z

### Source Files

- `agents/mdcode/src/tool/mcp.ts`
- `agents/mdcode/src/tool/main.ts`
- `agents/mdcode/README.md`
- `toolbox/mdcode/src/tool/mcp.ts`
- `toolbox/enrichment/README.md`
- `toolbox/enrichment/src/tools/md/server.ts`

---
title: "kcmd MCP server reference"
description: "MCP server startup, workspace path binding, and agent tools for pull, push, list-entries, lookup-entry, and modify-entry in agentic metadata workflows."
---

The `kcmd mcp` subcommand starts a stdio Model Context Protocol (MCP) server inside the same `kcmd` binary that powers the Metadata as Code CLI. At startup the server binds to one workspace directory, loads `catalog.yaml`, and exposes three tools—`list-entries`, `lookup-entry`, and `modify-entry`—that read and write local catalog artifacts through `CatalogSnapshot`. Catalog synchronization (`pull` and `push`) runs through the CLI, not as MCP tools; agent workflows typically combine MCP entry editing with CLI sync steps.

```mermaid
sequenceDiagram
  participant Agent as MCP client agent
  participant MCP as kcmd MCP server
  participant Snap as CatalogSnapshot
  participant Disk as catalog/ workspace
  participant CLI as kcmd CLI
  participant API as Knowledge Catalog API

  Agent->>MCP: stdio connect (kcmd mcp --path WORKSPACE)
  MCP->>Snap: CatalogSnapshot.fromPath(WORKSPACE)
  Snap->>Disk: load catalog.yaml + index entries

  Agent->>MCP: list-entries
  MCP->>Snap: listEntries()
  Snap-->>Agent: JSON array of entry names

  Agent->>MCP: lookup-entry(name)
  MCP->>Snap: lookupEntry(name)
  Snap->>Disk: merge .yaml + .ref.yaml + sidecars
  Snap-->>Agent: JSON Entry object

  Agent->>MCP: modify-entry(name, field, updates)
  MCP->>Snap: updateEntry(...)
  Snap->>Disk: write editable layer
  Snap-->>Agent: JSON updated Entry

  Note over Agent,CLI: Sync is CLI-only today
  Agent->>CLI: kcmd pull / kcmd push
  CLI->>API: CatalogSync.pull() / .push()
  CLI->>Disk: refresh local snapshot
```

## Prerequisites

<Steps>
<Step title="Initialize a workspace">

Create a kcmd workspace with `catalog.yaml` and a pulled snapshot before starting the MCP server. See [Sync catalog metadata](/sync-catalog-metadata) for `init`, `pull`, and `reference` steps.

</Step>
<Step title="Authenticate with gcloud ADC">

The MCP server calls `gcp.ApiContext.default()` at startup to load Dataplex entry and aspect type definitions. Configure Application Default Credentials:

```bash
gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID
```

</Step>
<Step title="Install or build kcmd">

Use the published npm package or a local build from `agents/mdcode`:

<CodeGroup>
```bash title="npm package"
npm install -g kcmd
```

```bash title="Local build"
cd agents/mdcode
npm install && npm run build
# binary: dist/kcmd
```
</CodeGroup>

</Step>
</Steps>

## Start the server

The `mcp` command registers with the `kcmd` CLI and delegates to `startServer()` in the MCP module.

| Item | Value |
|------|-------|
| Command | `kcmd mcp` |
| Transport | `StdioServerTransport` (stdio JSON-RPC) |
| Server name | `kcmd` |
| Server version | `1.0.0` |
| SDK | `@modelcontextprotocol/sdk` |

<ParamField body="--path" type="string">
Absolute or relative path to the workspace root—the directory that contains `catalog.yaml`. Always pass an explicit absolute path in MCP client configuration to avoid ambiguity when the client cwd differs from the workspace.
</ParamField>

<RequestExample>

```bash title="Start MCP server"
kcmd mcp --path /absolute/path/to/workspace
```

</RequestExample>

<Note>
The server loads the workspace snapshot once at startup. External changes made while the server is running—such as `kcmd pull` from another terminal—are not visible until you restart the MCP process.
</Note>

### MCP client configuration

Register the server in your agent's MCP settings. The pattern works with any MCP-capable client (Gemini CLI, Cursor, Claude Desktop, custom ADK runners).

<RequestExample>

```json title="mcp.json"
{
  "mcpServers": {
    "kcmd": {
      "command": "npx",
      "args": ["-y", "kcmd", "mcp", "--path", "/absolute/path/to/workspace"]
    }
  }
}
```

</RequestExample>

For local development, point `command` at the compiled binary:

```json
{
  "mcpServers": {
    "kcmd": {
      "command": "/path/to/agents/mdcode/dist/kcmd",
      "args": ["mcp", "--path", "/absolute/path/to/workspace"]
    }
  }
}
```

### Debug with MCP Inspector

From `agents/mdcode`, run the inspector against the compiled binary:

```bash
npm run x:mcp
# equivalent: npx @modelcontextprotocol/inspector dist/kcmd mcp
```

Pass `--path` through the inspector UI or append it to the spawned args.

## Workspace path binding

`startServer(basePath)` constructs a `CatalogSnapshot` from the workspace root:

1. **Manifest** — reads `catalog.yaml` at `{basePath}/catalog.yaml`; throws if missing.
2. **Layout selection** — `createLayout()` picks `StandardLayout` (YAML + sidecars) or `DocumentsLayout` (Markdown frontmatter) based on the manifest `scope`.
3. **Type registry** — fetches entry and aspect type definitions from Knowledge Catalog for types declared in `snapshot.entries` and `snapshot.aspects`.
4. **Entry index** — walks `catalog/` and indexes editable files (`.yaml` or `.md`) and reference layers (`.ref.yaml`).

Entry names returned by `list-entries` are the local `name` fields inside each artifact—for example `bigquery/my-project/my-dataset/my-table-id`—not Dataplex resource paths.

<Warning>
`modify-entry` writes only to the editable local layer. Reference files (`*.ref.yaml`) are read during `lookup-entry` merges but are never modified by MCP tools. Publishing reference-layer changes requires `kcmd push`, which skips `*.ref.yaml` files.
</Warning>

## MCP tools

The server registers three tools. All successful responses return MCP `content` with `type: "text"` and a pretty-printed JSON body. Errors set `isError: true` with a text message.

### `list-entries`

Lists every indexed entry name in the bound workspace.

| Field | Value |
|-------|-------|
| Parameters | None |
| Returns | JSON string array of entry names |
| Errors | Startup failures (missing manifest, auth, type load) surface before the tool is callable |

<ResponseExample>

```json title="list-entries response"
[
  "bigquery/my-project/my-dataset/events",
  "bigquery/my-project/my-dataset/products"
]
```

</ResponseExample>

### `lookup-entry`

Returns the full metadata for one entry, including merged reference and local layers plus Markdown sidecar content.

<ParamField body="name" type="string" required>
Local entry name as returned by `list-entries`.
</ParamField>

<ResponseField name="content" type="text">
JSON-serialized `Entry` object with `name`, `type`, `resource`, optional `aspects`, and optional `links`.
</ResponseField>

For Standard layout entries, `lookup-entry` merges `.ref.yaml` under the editable `.yaml` (local overrides reference) and inlines sidecar `.md` bodies into aspect `content` fields.

<RequestExample>

```json title="lookup-entry input"
{
  "name": "bigquery/my-project/my-dataset/events"
}
```

</RequestExample>

<ResponseExample>

```json title="lookup-entry response (excerpt)"
{
  "name": "bigquery/my-project/my-dataset/events",
  "type": "dataplex-types.global.bigquery-table",
  "resource": {
    "displayName": "events",
    "description": "GA4 ecommerce events table"
  },
  "aspects": {
    "dataplex-types.global.schema": {
      "fields": [{ "name": "event_name", "dataType": "STRING" }]
    },
    "dataplex-types.global.overview": {
      "content": "# Events\n\nSession and purchase events.",
      "contentType": "MARKDOWN"
    }
  }
}
```

</ResponseExample>

On failure (unknown name, missing files), the tool returns `isError: true` with message `Error looking up entry: …`.

### `modify-entry`

Updates one field on a local entry and persists the change to disk. Returns the post-update entry from a fresh `lookup-entry`.

<ParamField body="name" type="string" required>
Entry name to modify.
</ParamField>

<ParamField body="field" type="string" required>
Either `resource` or an aspect key (for example `dataplex-types.global.overview` or a manifest alias like `overview`).
</ParamField>

<ParamField body="updates" type="object" required>
Structured JSON dictionary with the new values for the targeted field.
</ParamField>

#### Field-specific behavior

| `field` value | What `updates` replaces | Persistence notes |
|---------------|-------------------------|-------------------|
| `resource` | Only `resource.description` is applied from `updates` | Other resource keys in `updates` are ignored by `updateEntry` |
| Aspect key | The entire aspect object for that key | Markdown aspects with `content` may be written to `.overview.md` sidecars |

<RequestExample>

```json title="modify-entry — update overview aspect"
{
  "name": "bigquery/my-project/my-dataset/events",
  "field": "dataplex-types.global.overview",
  "updates": {
    "content": "# Events\n\nEnriched by an agent.",
    "contentType": "MARKDOWN"
  }
}
```

</RequestExample>

<RequestExample>

```json title="modify-entry — update resource description"
{
  "name": "bigquery/my-project/my-dataset/events",
  "field": "resource",
  "updates": {
    "description": "GA4 obfuscated ecommerce events, partitioned by date."
  }
}
```

</RequestExample>

#### Modification constraints

- The aspect must appear in `snapshot.aspects` inside `catalog.yaml`; otherwise `updateEntry` throws `The aspect '…' is not registered in the snapshot.`
- For ingested entry groups (BigQuery, BigLake), required aspects on the entry type cannot be modified.
- Only aspects listed under `publishing.aspects` are pushed to Knowledge Catalog on `kcmd push`.
- `entryLinks` and `links` blocks are not directly exposed as MCP tools; manage links by editing YAML or using library APIs.

On failure, the tool returns `isError: true` with message `Error modifying entry: …`.

## Catalog sync: pull and push

`pull` and `push` are **CLI commands**, not MCP tools in the current implementation. Agentic metadata workflows use them alongside MCP entry editing:

```text
kcmd init …  →  kcmd pull  →  [agent: list / lookup / modify via MCP]  →  kcmd push
```

| Operation | CLI command | Role in agent workflows |
|-----------|-------------|-------------------------|
| Pull snapshot | `kcmd pull [--dry-run]` | Refresh local `catalog/` from Knowledge Catalog before agent work |
| Pull reference | `kcmd reference` | Add read-only `*.ref.yaml` grounding layers |
| Push changes | `kcmd push [--force] [--validate-only] [--dry-run]` | Publish MCP and manual edits back to the catalog |

<Info>
The design specification lists `pull` and `push` as planned MCP tools, but `agents/mdcode/src/tool/mcp.ts` registers only `list-entries`, `lookup-entry`, and `modify-entry`. Until sync tools ship over MCP, agents should shell out to `kcmd pull` and `kcmd push`, or embed `CatalogSnapshot` / `CatalogSync` from the `kcmd` library directly.
</Info>

See [kcmd CLI reference](/kcmd-cli-reference) for full flag documentation.

## Agent workflow patterns

### MCP-only editing loop

Suited for interactive agents (Gemini CLI, IDE MCP clients) that modify metadata in place:

1. `kcmd pull` to sync the latest remote state.
2. Start `kcmd mcp --path WORKSPACE` in the MCP client config.
3. Agent calls `list-entries` → `lookup-entry` per asset → `modify-entry` for aspects in `publishing.aspects`.
4. `kcmd push` to publish; use `--validate-only` first to catch schema errors.

### Library-embedded enrichment (no MCP)

The toolbox enrichment agent (`kcagent enrich`) imports `kcmd` as a library and registers a custom `update_documentation` `FunctionTool` that calls `catalog.updateEntry()` directly—same persistence path as `modify-entry`, without stdio MCP overhead. It loads additional grounding MCP servers (for example `md-fileset`) from `tools/mcp.json`. See [Toolbox enrichment demo](/toolbox-enrichment-demo).

### Composing multiple MCP servers

A typical enrichment workspace binds:

| Server | Purpose |
|--------|---------|
| `kcmd` | List, read, and write catalog entries in the mdcode workspace |
| `md-fileset` | Search and read organizational markdown for grounding |
| Custom servers | GitHub, Drive, or domain-specific context |

Provider-neutral pattern: each server is a stdio subprocess declared in `mcp.json`; the orchestrating agent merges tool sets regardless of model provider.

## Authentication and permissions

| Concern | Mechanism |
|---------|-----------|
| Token source | `gcloud auth application-default print-access-token` |
| Project / region | `gcloud config get-value project` and `compute/region` |
| Startup type loading | Read-only Dataplex API calls for entry/aspect type metadata |
| Push (CLI) | Requires catalog write permissions for the workspace `scope` |

Set `GCP_LOG=1` to enable verbose API logging from `ApiContext`.

## Troubleshooting

| Symptom | Likely cause | Verification |
|---------|--------------|--------------|
| Server exits immediately | Missing `catalog.yaml` at `--path` | Confirm `{path}/catalog.yaml` exists |
| `Unable to retrieve project, location, or token` | ADC not configured | Run `gcloud auth application-default login` |
| `The aspect '…' is not registered` | Aspect omitted from `snapshot.aspects` | Edit `catalog.yaml` and restart MCP |
| `The aspect '…' is not modifiable` | Required aspect on ingested entry | Target a non-required aspect (for example `overview`) |
| Agent sees stale entries after `kcmd pull` | Snapshot loaded at server start | Restart the MCP server process |
| Push rejects agent-written content | Aspect not in `publishing.aspects` or schema violation | Run `kcmd push --validate-only` |

See [Troubleshooting](/troubleshooting) for auth, billing, and push-conflict guidance.

## Related pages

<CardGroup>
<Card title="kcmd CLI reference" href="/kcmd-cli-reference">
`init`, `pull`, `push`, `reference`, and authentication flags that complement MCP entry tools.
</Card>
<Card title="Metadata as Code" href="/metadata-as-code">
Workspace model, layouts, reference layers, and entry link semantics the MCP server operates on.
</Card>
<Card title="Sync catalog metadata" href="/sync-catalog-metadata">
Initialize workspaces and run pull/push sync around agent editing sessions.
</Card>
<Card title="Publish enriched metadata" href="/publish-enriched-metadata">
Push MCP-edited workspaces and reconcile aspects without touching read-only reference layers.
</Card>
<Card title="Enrichment workflows" href="/enrichment-workflows">
End-to-end flows from source metadata through agent enrichment to catalog publication.
</Card>
<Card title="Toolbox enrichment demo" href="/toolbox-enrichment-demo">
TypeScript demo combining `kcmd` sync, `kcagent enrich`, and `md-fileset` MCP grounding.
</Card>
</CardGroup>

---

## 17. OKF enrichment-agent CLI reference

> enrichment-agent enrich and visualize subcommands, BigQuery source flags, web crawl constraints, concept scoping, model selection, and environment variables.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/17-okf-enrichment-agent-cli-reference.md
- Generated: 2026-06-15T02:56:17.577Z

### Source Files

- `okf/src/enrichment_agent/cli.py`
- `okf/pyproject.toml`
- `okf/src/enrichment_agent/sources/bigquery.py`
- `okf/src/enrichment_agent/web/fetcher.py`
- `okf/src/enrichment_agent/tools/web_tools.py`
- `okf/README.md`

---
title: OKF enrichment-agent CLI reference
description: enrichment-agent enrich and visualize subcommands, BigQuery source flags, web crawl constraints, concept scoping, model selection, and environment variables.
---

The `enrichment-agent` CLI produces [Open Knowledge Format](/open-knowledge-format) bundles from pluggable metadata sources. Today the only source is BigQuery (`--source bq`). The CLI exposes two subcommands: `enrich` runs a two-pass BQ-then-web enrichment pipeline into a bundle directory, and `visualize` renders a self-contained HTML graph viewer from an existing bundle.

Install the package from the `okf/` directory (see [Installation](/installation)), then invoke either entry point:

<CodeGroup>
```bash title="Console script"
enrichment-agent enrich --help
```

```bash title="Python module"
python -m enrichment_agent enrich --help
```
</CodeGroup>

## Enrichment pipeline

`enrich` runs in three phases:

1. **BigQuery pass** — For each concept the source advertises, an ADK agent reads metadata (and optional row samples) and writes one OKF markdown document per concept.
2. **Web pass** (optional) — When seed URLs are provided, a separate agent crawls documentation pages via the `fetch_url` tool and enriches existing concepts or mints `references/<slug>` docs.
3. **Index regeneration** — Auto-generated `index.md` files are written at each bundle directory level.

Use `--no-web` or omit all seed flags to run BQ-only. See [Enrichment workflows](/enrichment-workflows) and [Produce OKF bundles](/produce-okf-bundles) for end-to-end context.

## `enrich` subcommand

Enrich concepts from a source into an OKF bundle directory.

<RequestExample>
```bash
enrichment-agent enrich \
  --source bq \
  --dataset bigquery-public-data.ga4_obfuscated_sample_ecommerce \
  --web-seed-file samples/ga4_merch_store/seeds.txt \
  --out ./bundles/ga4
```
</RequestExample>

<ResponseExample>
```text
Enriched 12 concept(s) into ./bundles/ga4; web pass used 3 seed(s)
```
</ResponseExample>

On success the command prints a summary to stderr and exits `0`. With `-v` / `--verbose`, enrichment-agent loggers emit DEBUG-level tool-call detail while third-party loggers (`google`, `google_genai`, `google_adk`, `urllib3`, `httpx`) stay at WARNING.

### Source flags

<ParamField body="--source" type="string" required>
Source adapter to use. Currently only `bq` (BigQuery) is supported.
</ParamField>

<ParamField body="--dataset" type="string" required>
BigQuery dataset identifier in `project.dataset` form (for example `bigquery-public-data.ga4_obfuscated_sample_ecommerce`). Required when `--source bq`; the CLI exits if omitted.
</ParamField>

<ParamField body="--billing-project" type="string">
Google Cloud project ID billed for BigQuery API calls and query bytes. Defaults to the Application Default Credentials default project. Public datasets are readable, but the caller's project is billed for queries.
</ParamField>

<ParamField body="--out" type="path" required>
Bundle root directory. Created if it does not exist. Each concept becomes a `.md` file under this tree.
</ParamField>

#### BigQuery concepts

`BigQuerySource` discovers concepts automatically:

| Concept type | ID pattern | Notes |
|---|---|---|
| Dataset | `datasets/<dataset_id>` | One per `--dataset` |
| Table (singleton) | `tables/<table_id>` | Tables without a shard suffix |
| Table (sharded family) | `tables/<prefix>` | Tables matching `prefix` + 6–8 digit suffix collapse into one wildcard concept; enrichment reads the last shard as representative |

### Concept scoping

<ParamField body="--concept" type="string">
Enrich only the named concept ID. Repeatable. IDs are slash-separated path segments matching bundle layout (for example `tables/events_` for a sharded GA4 events family, or `datasets/ga4_obfuscated_sample_ecommerce`). Unknown IDs raise `ValueError` before enrichment starts.
</ParamField>

<RequestExample>
```bash
enrichment-agent enrich \
  --source bq \
  --dataset bigquery-public-data.ga4_obfuscated_sample_ecommerce \
  --concept tables/events_ \
  --no-web \
  --out ./bundles/ga4
```
</RequestExample>

### Model selection

<ParamField body="--model" type="string">
Gemini model ID passed to both the BQ and web ADK agents and to index regeneration. Default: `gemini-flash-latest`.
</ParamField>

Model credentials are not CLI flags. Configure one of the provider paths documented in [Installation](/installation):

| Provider | Environment variables |
|---|---|
| Google AI Studio | `GEMINI_API_KEY` |
| Vertex AI | `GOOGLE_GENAI_USE_VERTEXAI=true`, `GOOGLE_CLOUD_PROJECT`, `GOOGLE_CLOUD_LOCATION` |

BigQuery reads metadata via `gcloud auth application-default login` and the ADC default project unless `--billing-project` overrides it.

### Web crawl flags

The web pass is skipped when `--no-web` is set or when no seeds are collected. Seeds come from `--web-seed` (repeatable) and/or `--web-seed-file` (repeatable; one URL per line, `#` comments allowed). Duplicate URLs are deduplicated in order.

<ParamField body="--web-seed" type="string">
Individual seed URL. Repeatable.
</ParamField>

<ParamField body="--web-seed-file" type="path">
Path to a text file with one seed URL per line. Lines after `#` are comments. Repeatable; all files are merged.
</ParamField>

<ParamField body="--no-web" type="boolean">
Skip the web pass entirely. Seeds are ignored.
</ParamField>

<ParamField body="--web-max-pages" type="integer">
Hard cap on pages the web agent may fetch in one run. Default: `100`. Enforced inside the `fetch_url` tool; when the budget is spent, further fetches return `"max_pages reached"`.
</ParamField>

<ParamField body="--web-max-depth" type="integer">
Maximum hop distance from any seed URL. Seeds are depth `0`; their outbound links are depth `1`, and so on. Default: `2`. URLs beyond this depth are rejected.
</ParamField>

<ParamField body="--web-allowed-host" type="string">
Extra hostname the crawler may fetch beyond seed hostnames. Repeatable. By default, only hostnames extracted from seed URLs are allowed.
</ParamField>

<ParamField body="--web-allowed-path-prefix" type="string">
Only fetch URLs whose path starts with one of these prefixes (for example `/docs/`). Repeatable. Default: no path restriction.
</ParamField>

<ParamField body="--web-denied-path-substring" type="string">
Reject URLs whose path contains any of these substrings (for example `/login`, `/pricing`). Repeatable.
</ParamField>

#### Crawl enforcement

Constraints are enforced inside `fetch_url`, not by prompt alone. A rejected fetch returns an `error` field instead of page content; the agent should not retry the same URL.

| Constraint | Behavior |
|---|---|
| Scheme | Only `http` and `https` |
| Host | Must be in allowed-host set (seed hosts ∪ `--web-allowed-host`) |
| Path prefix | Must match at least one `--web-allowed-path-prefix` when any are set |
| Denied substring | Rejected if path contains any `--web-denied-path-substring` |
| Depth | Must be reachable from a seed via followed links and ≤ `--web-max-depth` |
| Reachability | URLs not returned as links from a fetched page are rejected ("not reachable from a seed") |
| Deduplication | Already-fetched URLs are rejected |
| Page budget | Rejected when `fetched_count >= --web-max-pages` |
| Content type | HTML only; non-HTML responses fail |
| Page size | Markdown body truncated at 40 KiB per page |
| Timeout | 10-second fetch timeout per page |

### Logging

<ParamField body="-v, --verbose" type="boolean">
Set `enrichment_agent` loggers to DEBUG for full tool-call arguments and responses.
</ParamField>

## `visualize` subcommand

Generate a self-contained HTML graph viewer from an OKF bundle. No backend or install is required on the viewing side. See [Visualize OKF bundles](/visualize-okf-bundles) for viewer features.

<RequestExample>
```bash
enrichment-agent visualize \
  --bundle ./bundles/crypto_bitcoin \
  --out /tmp/btc.html \
  --name "Bitcoin OKF"
```
</RequestExample>

<ResponseExample>
```text
Wrote 18 concept(s), 24 edge(s), 2847192 bytes → /tmp/btc.html
```
</ResponseExample>

<ParamField body="--bundle" type="path" required>
Path to the bundle root directory. Raises `FileNotFoundError` if the directory does not exist.
</ParamField>

<ParamField body="--out" type="path">
Output HTML path. Default: `<bundle>/viz.html`.
</ParamField>

<ParamField body="--name" type="string">
Display name shown in the viewer header. Default: bundle directory name.
</ParamField>

The generator walks all `*.md` files except `index.md`, parses OKF frontmatter, extracts cross-links, and embeds the graph as JSON in a single HTML file with bundled CSS and JavaScript.

## Quick recipes

<Steps>
<Step title="BQ-only enrichment">
Run metadata enrichment without web grounding:

```bash
enrichment-agent enrich \
  --source bq \
  --dataset <project>.<dataset> \
  --no-web \
  --out ./bundles/<name>
```
</Step>

<Step title="BQ plus web enrichment">
Provide documentation seeds for the web pass:

```bash
enrichment-agent enrich \
  --source bq \
  --dataset <project>.<dataset> \
  --web-seed-file <path/to/seeds.txt> \
  --web-max-pages 50 \
  --out ./bundles/<name>
```
</Step>

<Step title="Visualize the bundle">
After enrichment completes:

```bash
enrichment-agent visualize --bundle ./bundles/<name>
```

Open `<bundle>/viz.html` in a browser.
</Step>
</Steps>

Copy-paste dataset recipes with exact seed files and expected outputs are in [OKF bundle recipes](/okf-bundle-recipes).

## Environment variables

| Variable | Used by | Purpose |
|---|---|---|
| `GEMINI_API_KEY` | Gemini via ADK | API key for Google AI Studio |
| `GOOGLE_GENAI_USE_VERTEXAI` | Gemini via ADK | Set `true` to route through Vertex AI |
| `GOOGLE_CLOUD_PROJECT` | Vertex AI | GCP project for model calls |
| `GOOGLE_CLOUD_LOCATION` | Vertex AI | Region (for example `us-central1`) |
| ADC (via `gcloud auth application-default login`) | BigQuery client | Metadata reads and row sampling; billing project from ADC unless `--billing-project` is set |

No enrichment-agent-specific environment variables exist beyond the Gemini/Vertex configuration consumed by `google-adk` and `google-genai`.

## Exit codes and errors

| Condition | Result |
|---|---|
| Missing `--dataset` for `--source bq` | Exit `1` with message `--dataset is required for --source bq` |
| Unknown `--source` value | Exit `1` with `Unknown source: …` |
| Invalid `--concept` ID (unknown to source) | `ValueError: Unknown concept(s): …` |
| Invalid dataset format (not `project.dataset`) | `ValueError` from `BigQuerySource` |
| Missing bundle directory for `visualize` | `FileNotFoundError` |

## Distinction from catalog enrichment agent

This CLI (`enrichment-agent` in the `okf/` package) produces standalone OKF bundle directories from BigQuery. The separate catalog enrichment agent (`agent_runner.py` in `agents/enrichment/`) runs table, doc, and context_overlay modes against Knowledge Catalog workspaces. Flag reference for that agent is on [Enrichment agent flags](/enrichment-agent-flags).

## Related pages

<Card href="/installation" title="Installation" icon="download">
Prerequisites, package install, and credential setup for BigQuery and Gemini.
</Card>

<Card href="/produce-okf-bundles" title="Produce OKF bundles" icon="package">
End-to-end workflow for BQ-then-web enrichment into a versionable bundle.
</Card>

<Card href="/visualize-okf-bundles" title="Visualize OKF bundles" icon="network">
Graph viewer features: force-directed layout, detail panels, backlinks, and search.
</Card>

<Card href="/okf-bundle-recipes" title="OKF bundle recipes" icon="book-open">
GA4, Stack Overflow, and Bitcoin sample commands with seed files.
</Card>

<Card href="/troubleshooting" title="Troubleshooting" icon="life-buoy">
Auth, billing, web crawl cap, and model credential failures.
</Card>

---

## 18. Enrichment agent flags reference

> agent_runner.py flags by mode: table, doc, context_overlay; source inputs, usage signal, glossaries, feedback, GitHub MCP, refinement, and required Vertex project and model values.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/18-enrichment-agent-flags-reference.md
- Generated: 2026-06-15T02:56:29.547Z

### Source Files

- `agents/enrichment/src/agent_runner.py`
- `agents/enrichment/README.md`
- `agents/enrichment/src/modes/table_mode.py`
- `agents/enrichment/src/modes/doc_mode.py`
- `agents/enrichment/src/tools/feedback_tools.py`
- `agents/enrichment/src/tools/bq_usage_tools.py`
- `agents/enrichment/src/linking.py`

---
title: "Enrichment agent flags reference"
description: "agent_runner.py flags by mode: table, doc, context_overlay; source inputs, usage signal, glossaries, feedback, GitHub MCP, refinement, and required Vertex project and model values."
---

`agents/enrichment/src/agent_runner.py` is the unified CLI entrypoint for the Knowledge Catalog enrichment agent. It parses Abseil flags, configures Vertex AI from `--project`, `--location`, and `--model`, then dispatches to `doc_mode`, `table_mode`, or `context_overlay_mode`. The agent runs read-only `kcmd init` and `kcmd pull` (or `kcmd reference` in overlay mode); you publish with `kcmd push`.

<Note>
Set `PYTHONPATH=agents/enrichment/src` before invoking the runner. Run `python3 agents/enrichment/src/agent_runner.py --help` for the live flag list.
</Note>

## Invocation

```bash
export PYTHONPATH=agents/enrichment/src

python3 agents/enrichment/src/agent_runner.py \
  --mode=<doc|table|context_overlay> \
  --project=<gcp_project> \
  --model=<vertex_model> \
  --output_dir=<local_dir> \
  [mode-specific flags]
```

On startup the runner sets:

| Environment variable | Source |
|---|---|
| `GOOGLE_GENAI_USE_VERTEXAI` | `True` |
| `GOOGLE_CLOUD_PROJECT` | `--project` |
| `GOOGLE_CLOUD_LOCATION` | `--location` (default `global`) |

## Mode selection

<ParamField body="mode" type="enum">
One of `doc`, `table`, or `context_overlay`. Empty string triggers inference: if `--dataset` is set, mode becomes `table`; otherwise `doc`. `context_overlay` is never inferred — pass `--mode=context_overlay` explicitly.
</ParamField>

```mermaid
flowchart TD
  A[agent_runner.py] --> B{--refine_instruction set?}
  B -->|yes| R[refine.run_one_refinement]
  B -->|no| C{--mode or inferred}
  C -->|doc| D[doc_mode.run]
  C -->|table| T[table_mode.run]
  C -->|context_overlay| O[context_overlay_mode.run]
```

| Mode | Target | kcmd behavior | Primary output |
|---|---|---|---|
| `table` | BigQuery dataset | `kcmd init --bigquery-dataset` + `pull` | Enriched overviews on live `@bigquery` entries |
| `doc` | Knowledge-base entry group | `kcmd init --entry-group` + `pull` | LLM-emitted generic KB entries |
| `context_overlay` | Dataset + editable entry group | `kcmd reference` (read-only 1P tables) | New overlay entries in `--entry_group` |

## Required configuration

These flags are validated in `main()` before any mode runs:

<ParamField body="project" type="string" required>
Google Cloud project that hosts the Vertex AI model. Raises `UsageError` when omitted.
</ParamField>

<ParamField body="model" type="string" required>
Vertex AI model ID for reasoning-heavy steps (for example `gemini-2.5-pro`). Raises `UsageError` when omitted. Structured sub-steps use a pinned Flash model internally.
</ParamField>

<ParamField body="location" type="string">
Vertex AI region for the model. Default: `global`. Example: `--location=us-central1`.
</ParamField>

<ParamField body="output_dir" type="string">
Local directory for the generated mdcode tree, `trajectory.json`, and `refine_session.json`. Required for enrichment runs (modes exit early without it) and mandatory with `--refine_instruction`.
</ParamField>

### Mode-specific requirements

| Flag | `doc` | `table` | `context_overlay` |
|---|---|---|---|
| `--project` | required | required | required |
| `--model` | required | required | required |
| `--output_dir` | required* | required* | required* |
| `--dataset` | — | required | required |
| `--entry_group` | required | — | required |

\*Not enforced by `UsageError` in `main()`, but each mode prints an error and returns when `output_dir` is missing.

<ParamField body="dataset" type="string">
BigQuery dataset as `project.dataset` (for example `my-proj.analytics`). Required in `table` and `context_overlay` modes.
</ParamField>

<ParamField body="entry_group" type="string">
Entry group as `project.location.entryGroupId`. Required in `doc` and `context_overlay` modes. In **doc** mode the entry group must already exist (the agent does not create it). In **context_overlay** mode this is where new overlay entries are written.
</ParamField>

## Flag applicability by mode

`R` = required, `✓` = optional, `—` = not used by the mode.

| Flag | `doc` | `table` | `context_overlay` |
|---|---|---|---|
| `--mode` | ✓ | ✓ | ✓ |
| `--topic` | ✓ | ✓ | ✓ |
| `--folders` / `--folder` | ✓ | ✓ | ✓ |
| `--docs` | ✓ | — | ✓ |
| `--tables` | — | — | ✓ |
| `--include_usage` | — | ✓ | ✓ |
| `--usage_window_days` | — | ✓ | ✓ |
| `--usage_scope` | — | ✓ | ✓ |
| `--anonymize_users` | — | ✓ | ✓ |
| `--glossaries` | — | ✓ | — |
| `--feedback_dir` | ✓ | ✓ | ✓ |
| `--feedback_files` | ✓ | ✓ | ✓ |
| `--repo` | ✓ | ✓ | ✓ |
| `--repo_ref` | ✓ | ✓ | ✓ |
| `--repo_subdir` | ✓ | ✓ | ✓ |
| `--mcp_config` | ✓ | ✓ | ✓ |
| `--interactive` | ✓ | ✓ | ✓ |
| `--refine_instruction` | ✓ | ✓ | ✓ |

## Source context flags

<ParamField body="topic" type="string">
Free-text use case or instruction steering enrichment. Default: `Metadata enrichment`.
</ParamField>

<ParamField body="folders" type="list">
Comma-separated mixed list routed per entry: Google Drive folder URLs/IDs and/or local directories of `.md` files (read recursively). In **doc** mode, Drive folders seed depth-1 children; in **table** and **context_overlay** modes they join the relevance-router candidate pool.
</ParamField>

<ParamField body="folder" type="list">
Deprecated alias for `--folders`. Both lists are merged.
</ParamField>

<ParamField body="docs" type="list">
Comma-separated mixed list: Google Doc URLs/IDs and/or local `.md` files. **doc** mode: depth-0 spine documents. **context_overlay** mode: routed to tables alongside folder content. **Not passed to table mode** — use `--folders` there.
</ParamField>

<ParamField body="tables" type="list">
**context_overlay only.** Restrict enrichment to specific tables — short names or `proj.ds.table` FQNs. Empty means every table in `--dataset`.
</ParamField>

### Local vs Drive routing

Each entry in `--docs` or `--folders` is classified independently:

1. Starts with `http://` or `https://` → Google Drive
2. Ends in `.md` or `.markdown` → local Markdown file
3. Path-shaped (`/abs`, `./rel`, `../rel`, `~/path`, or contains `/`) → local directory or file
4. Bare name that exists on disk → local
5. Otherwise → Google Drive ID

Absolute paths are recommended; relative paths resolve from the agent's working directory.

## BigQuery usage signal

Applies to `table` and `context_overlay` modes. Fetches query history from `region-<R>.INFORMATION_SCHEMA.JOBS_BY_PROJECT` (with `JOBS_BY_USER` fallback) and emits a `<table>.queries.md` sidecar conforming to the Dataplex `queries` aspect type.

<ParamField body="include_usage" type="bool">
Fetch and emit per-table usage signals. Default: `true`. Set `--include_usage=false` to skip the BigQuery scan entirely.
</ParamField>

<ParamField body="usage_window_days" type="integer">
Days of query history to aggregate. Default: `30`.
</ParamField>

<ParamField body="usage_scope" type="enum">
One of `auto`, `project`, or `user`. Default: `auto`.

- `auto` — try `JOBS_BY_PROJECT`, fall back to `JOBS_BY_USER` on permission failure
- `project` — require `JOBS_BY_PROJECT`
- `user` — only the caller's own queries (always works, narrower signal)
</ParamField>

<ParamField body="anonymize_users" type="bool">
Replace user emails with stable SHA hashes in the usage signal. Default: `false`.
</ParamField>

## Glossary column linking

**table mode only.** When `--glossaries` is set, the agent pulls glossary terms as reference, runs the LinkingAgent (`linking.apply_column_linking`), and injects field-level `links.definition` into each `<table>.yaml`.

<ParamField body="glossaries" type="list">
Comma-separated Dataplex glossaries as `project.location.glossaryId` (for example `my-proj.us.business-glossary`). Invalid entries disable linking with a warning.
</ParamField>

## User feedback proposals

All three modes accept feedback. Proposals are the **highest-priority** context source and override conflicting information from Drive docs, semantic search, and `INFORMATION_SCHEMA` patterns.

<ParamField body="feedback_dir" type="string">
Directory walked recursively for `.md` and `.json` feedback files. Each file holds pure JSON shaped `{"proposals": [...]}`.
</ParamField>

<ParamField body="feedback_files" type="list">
Explicit comma-separated feedback file paths. Combinable with `--feedback_dir`.
</ParamField>

Each proposal includes `target_asset` (TABLE or COLUMN FQN), `proposed_enrichment`, and optionally `eval_candidate.golden_sql`. In table and overlay modes proposals route per table; in doc mode they apply globally. Valid `golden_sql` values become `[Source: User Feedback]` entries in the queries aspect.

<AccordionGroup>
<Accordion title="Proposal JSON shape">
```json
{
  "proposals": [
    {
      "classification": {"detection_signal": "...", "gap_type": "..."},
      "target_asset": {"type": "COLUMN", "name": "proj.ds.table.column"},
      "current_context_flaw": "what was wrong",
      "proposed_enrichment": {"action": "ADD_SYNONYM", "value": "..."},
      "evidence": {"reasoning": "...", "trajectory_quote": "..."},
      "confidence_grade": 0.9,
      "eval_candidate": {
        "is_valid_candidate": true,
        "user_query_intent": "How many orders last week?",
        "golden_sql": "SELECT COUNT(*) FROM `proj.ds.orders` ..."
      }
    }
  ]
}
```
</Accordion>
</AccordionGroup>

## GitHub source-code input

All modes accept an optional GitHub repository explored agentically via the GitHub MCP server. Failure is non-fatal — the run degrades to no code context.

<ParamField body="repo" type="string">
GitHub repo as `owner/name` or a GitHub URL. Empty disables code context.
</ParamField>

<ParamField body="repo_ref" type="string">
Branch, tag, or SHA. Empty uses the repo default branch.
</ParamField>

<ParamField body="repo_subdir" type="string">
Path prefix to scope exploration (for example `src/server`).
</ParamField>

<ParamField body="mcp_config" type="string">
Path to an `mcp.json` describing the GitHub MCP server. Falls back to `KC_ENRICH_MCP_CONFIG`, then the built-in remote HTTP server.
</ParamField>

| Environment variable | Purpose |
|---|---|
| `GITHUB_PERSONAL_ACCESS_TOKEN` | Token for the MCP server (default env var name) |
| `KC_ENRICH_MCP_CONFIG` | Path to `mcp.json` when `--mcp_config` is unset |
| `KC_ENRICH_GITHUB_MCP_SERVER` | Server key in `mcp.json` (default `github_remote`; use `github` for local stdio) |
| `KC_ENRICH_GITHUB_MCP_URL` | Override remote MCP URL (default `https://api.githubcopilot.com/mcp/`) |

In **doc** mode, code component cards surface as their own KB entries. In **table** and **context_overlay** modes, cards join the relevance-router pool so code referencing a table grounds that table's overview and queries aspect.

<RequestExample>
```bash
export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_...
export PYTHONPATH=agents/enrichment/src

python3 agents/enrichment/src/agent_runner.py \
  --mode=table \
  --dataset=my-proj.analytics \
  --folders=./local_md_corpus \
  --repo=my-org/etl-pipeline --repo_ref=main \
  --project=my-gcp-project \
  --model=gemini-2.5-pro \
  --output_dir=/tmp/enrich_out
```
</RequestExample>

## Refinement flags

After the initial run, refinement reuses loaded context and never re-reads source docs or re-pulls the dataset.

<ParamField body="interactive" type="bool">
Stay in a `refine>` REPL after enrichment for free-text changes. Default: `false`. No-op on a non-TTY.
</ParamField>

<ParamField body="refine_instruction" type="string">
Apply one refinement turn to the saved session in `--output_dir`, then exit. Skips the enrichment pipeline entirely. Requires a prior run's `refine_session.json`. Used by the webapp persist-and-re-invoke flow.
</ParamField>

When `--refine_instruction` is set, only `--output_dir`, `--project`, and `--model` are needed (plus the instruction itself). The runner calls `refine.run_one_refinement` and returns without dispatching to a mode.

<RequestExample>
```bash
# Initial run
python3 agents/enrichment/src/agent_runner.py \
  --mode=doc \
  --docs="https://docs.google.com/document/d/abc123" \
  --entry_group=my-proj.us.my-kb \
  --project=my-gcp-project \
  --model=gemini-2.5-pro \
  --output_dir=/tmp/enrich_out

# Single refinement turn (no pipeline re-run)
python3 agents/enrichment/src/agent_runner.py \
  --refine_instruction="make the overview more concise" \
  --project=my-gcp-project \
  --model=gemini-2.5-pro \
  --output_dir=/tmp/enrich_out
```
</RequestExample>

## Supplementary environment variables

| Variable | Used by | Effect |
|---|---|---|
| `KCMD_BIN` | `kcmd_tools.py` | Override path to the `kcmd` binary |
| `KC_ENRICH_CACHE_MODE` | `drive_tools`, `bq_usage_tools` | `off`, `raw`, or `summary` (default `summary`) |
| `KC_ENRICH_CACHE` | `drive_tools` | Legacy alias: `off` forces cache off |
| `KC_ENRICH_USAGE_SNAPSHOT_TABLE` | `bq_usage_tools` | Reserved for a future snapshot-table fast path |

## Mode examples

<CodeGroup>
```bash title="Table mode"
python3 agents/enrichment/src/agent_runner.py \
  --mode=table \
  --dataset=my-proj.analytics \
  --folders=https://drive.google.com/drive/folders/abc,./corpus \
  --topic="Customer 360 data" \
  --glossaries=my-proj.us.business-glossary \
  --include_usage=true --usage_window_days=90 \
  --project=my-gcp-project \
  --location=us-central1 \
  --model=gemini-2.5-pro \
  --output_dir=/tmp/enrich_out
```

```bash title="Doc mode"
python3 agents/enrichment/src/agent_runner.py \
  --mode=doc \
  --docs="https://docs.google.com/document/d/abc,./notes/spine.md" \
  --folders=./local_md_corpus \
  --topic="Order pipeline documentation" \
  --entry_group=my-proj.us.my-kb \
  --project=my-gcp-project \
  --model=gemini-2.5-pro \
  --output_dir=/tmp/enrich_out
```

```bash title="Context overlay mode"
python3 agents/enrichment/src/agent_runner.py \
  --mode=context_overlay \
  --dataset=my-proj.analytics \
  --entry_group=my-proj.us.overlays \
  --tables=orders,customers \
  --folders=./corpus \
  --docs=./notes/table-notes.md \
  --project=my-gcp-project \
  --model=gemini-2.5-pro \
  --output_dir=/tmp/enrich_out
```
</CodeGroup>

## Error cases

| Condition | Result |
|---|---|
| Missing `--project` or `--model` | `UsageError` before mode dispatch |
| `context_overlay` without `--dataset` or `--entry_group` | `UsageError` |
| `doc` without `--entry_group` | `UsageError` |
| `--refine_instruction` without `--output_dir` | `UsageError` |
| Missing `--output_dir` on enrichment run | Mode prints error and exits (no mdcode written) |
| Extra positional arguments | `UsageError: Too many command-line arguments` |
| Invalid `--dataset` format | `ValueError` — must be `project.dataset` |
| Invalid `--entry_group` format | `ValueError` — must be `project.location.entryGroupId` |
| Unparseable `--repo` | `ValueError` from `github_tools.parse_repo` |
| GitHub MCP unreachable | Warning logged; run continues without code context |

<Warning>
Doc mode requires the target entry group to exist before the run. Create it with `gcloud dataplex entry-groups create` or equivalent, then pass `--entry_group`.
</Warning>

## Related pages

<CardGroup>
<Card title="Run the catalog enrichment agent" href="/run-catalog-enrichment-agent">
End-to-end workflow for table, doc, and context_overlay modes with Drive, feedback, and refinement.
</Card>
<Card title="Enrichment workflows" href="/enrichment-workflows">
How enrichment agents read metadata, ground on external sources, and hand off to kcmd push.
</Card>
<Card title="Installation" href="/installation">
Prerequisites, Python setup, ADC scopes, and Vertex AI credential configuration.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
Auth, billing, glossary provisioning, and model credential failures.
</Card>
</CardGroup>

---

## 19. OKF bundle recipes

> Copy-paste enrichment recipes for GA4 merchandise store, Stack Overflow, and Bitcoin public datasets with seed files, exact commands, and expected bundle outputs.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/19-okf-bundle-recipes.md
- Generated: 2026-06-15T02:57:24.545Z

### Source Files

- `okf/samples/ga4_merch_store/README.md`
- `okf/samples/stackoverflow/README.md`
- `okf/samples/crypto_bitcoin/README.md`
- `okf/samples/ga4_merch_store/seeds.txt`
- `okf/samples/stackoverflow/seeds.txt`
- `okf/bundles/stackoverflow/datasets/stackoverflow.md`
- `okf/bundles/crypto_bitcoin/index.md`

---
title: "OKF bundle recipes"
description: "Copy-paste enrichment recipes for GA4 merchandise store, Stack Overflow, and Bitcoin public datasets with seed files, exact commands, and expected bundle outputs."
---

The `okf/samples/` directory pairs three copy-paste enrichment recipes with checked-in reference bundles under `okf/bundles/`. Each recipe runs `enrichment-agent enrich` against a BigQuery public dataset, optionally seeds a two-pass web crawl from a `seeds.txt` file, and writes an OKF v0.1 bundle of markdown concept documents with YAML frontmatter and auto-generated `index.md` navigation files.

<Info>
All commands below assume your working directory is `okf/` in the Knowledge Catalog repository. Install the agent with `python3.13 -m venv .venv` and `.venv/bin/pip install --index-url https://pypi.org/simple/ -e .[dev]`.
</Info>

## Recipe comparison

| Recipe | BigQuery dataset | Seed file | Output directory (recipe) | Checked-in reference bundle | Enrichment pattern |
|--------|------------------|-----------|---------------------------|----------------------------|--------------------|
| GA4 Google Merchandise Store | `bigquery-public-data.ga4_obfuscated_sample_ecommerce` | `samples/ga4_merch_store/seeds.txt` | `./bundles/ga4_merch_store` | `bundles/ga4/` | Single sharded `events_` table plus GA4 export reference docs |
| Stack Overflow | `bigquery-public-data.stackoverflow` | `samples/stackoverflow/seeds.txt` | `./bundles/stackoverflow` | `bundles/stackoverflow/` | Many entity tables; one schema page enriches multiple concepts |
| Bitcoin (crypto) | `bigquery-public-data.crypto_bitcoin` | `samples/crypto_bitcoin/seeds.txt` | `./bundles/crypto_bitcoin` | `bundles/crypto_bitcoin/` | Four tightly related fact tables with cross-table foreign-key prose |

## Shared prerequisites

<Steps>
<Step title="Install the enrichment agent">

From `okf/`:

```bash
python3.13 -m venv .venv
.venv/bin/pip install --index-url https://pypi.org/simple/ -e .[dev]
```

</Step>

<Step title="Configure BigQuery access">

Public datasets are readable, but your project is billed for query bytes:

```bash
gcloud auth application-default login
gcloud config set project <your-billing-project>
```

</Step>

<Step title="Configure model credentials">

Use either AI Studio or Vertex AI:

<Tabs>
<Tab title="AI Studio">

```bash
export GEMINI_API_KEY=<your-key>
```

</Tab>
<Tab title="Vertex AI">

```bash
export GOOGLE_GENAI_USE_VERTEXAI=true
export GOOGLE_CLOUD_PROJECT=<your-project-id>
export GOOGLE_CLOUD_LOCATION=<region>
```

</Tab>
</Tabs>

</Step>
</Steps>

## GA4 Google Merchandise Store

The GA4 recipe targets the public GA4 e-commerce export from the Google Merchandise Store. The BQ pass produces dataset and table concepts; the web pass seeds canonical GA4 BigQuery Export documentation and may mint reference docs for metrics and joins.

### Seed file

`okf/samples/ga4_merch_store/seeds.txt` lists GA4 BigQuery Export URLs. Lines starting with `#` are comments; blank lines are ignored. The web agent crawls outward from each seed URL, following links it judges relevant, restricted to seed hostnames by default.

```text title="samples/ga4_merch_store/seeds.txt"
# GA4 BigQuery Export — top-level overview and index
https://support.google.com/analytics/answer/7029846

# GA4 BigQuery Export — schema reference (events, items, params)
https://support.google.com/analytics/answer/7029846?hl=en
```

### Enrich command

<RequestExample>

```bash title="GA4 enrichment"
.venv/bin/python -m enrichment_agent enrich \
    --source bq \
    --dataset bigquery-public-data.ga4_obfuscated_sample_ecommerce \
    --web-seed-file samples/ga4_merch_store/seeds.txt \
    --out ./bundles/ga4_merch_store
```

</RequestExample>

### Iteration flags

<ParamField body="--concept" type="string">
Scope enrichment to one concept id. For GA4, use `tables/events_` to iterate on the sharded events table only. Repeatable.
</ParamField>

<ParamField body="--no-web" type="boolean">
Skip the web pass and emit BQ-only concepts.
</ParamField>

<ParamField body="--web-max-pages" type="integer" default="100">
Hard cap on pages the web agent may fetch in one run.
</ParamField>

Smoke-run example:

```bash title="Single-concept BQ-only"
.venv/bin/python -m enrichment_agent enrich \
    --source bq \
    --dataset bigquery-public-data.ga4_obfuscated_sample_ecommerce \
    --concept tables/events_ \
    --no-web \
    --out ./bundles/ga4_merch_store
```

### Expected bundle output

The recipe writes to `./bundles/ga4_merch_store/`. The repository's checked-in reference bundle is at `bundles/ga4/` (same dataset, pre-generated output you can browse without re-running).

:::files
bundles/ga4/
├── index.md
├── datasets/
│   ├── index.md
│   └── ga4_obfuscated_sample_ecommerce.md
├── tables/
│   ├── index.md
│   └── events_.md
├── references/
│   ├── index.md
│   ├── joins/
│   └── metrics/
└── viz.html
:::

| Directory | Contents |
|-----------|----------|
| `datasets/` | One OKF doc for the dataset concept with overview, schema links, and example SQL |
| `tables/` | One doc for `events_` with field-level schema, metrics links, and query patterns |
| `references/` | Standalone docs minted from seeded GA4 pages (metrics definitions, join specs) |
| `index.md` | Auto-generated progressive-disclosure index at each directory level |

<ResponseExample>

```text title="Successful run (stderr)"
Enriched N concept(s) into bundles/ga4_merch_store; web pass used 2 seed(s)
```

</ResponseExample>

## Stack Overflow public dataset

The Stack Overflow recipe targets `bigquery-public-data.stackoverflow`, a mirror of the Stack Exchange Data Dump. Unlike GA4's single primary events table, this recipe exercises **multi-concept enrichment**: one community schema page often describes several tables (`posts_questions`, `posts_answers`, `users`), so a single fetched page may update multiple concept docs.

<Warning>
Stack Overflow tables are large. Keep `--web-max-pages` modest while iterating, and prefer `--concept` for smoke runs.
</Warning>

### Seed file

`okf/samples/stackoverflow/seeds.txt` points at canonical Stack Exchange schema references:

```text title="samples/stackoverflow/seeds.txt"
https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede
https://data.stackexchange.com/help
https://archive.org/details/stackexchange
```

### Enrich command

<RequestExample>

```bash title="Stack Overflow enrichment"
.venv/bin/python -m enrichment_agent enrich \
    --source bq \
    --dataset bigquery-public-data.stackoverflow \
    --web-seed-file samples/stackoverflow/seeds.txt \
    --out ./bundles/stackoverflow
```

</RequestExample>

Single-concept iteration:

```bash title="posts_questions only"
.venv/bin/python -m enrichment_agent enrich \
    --source bq \
    --dataset bigquery-public-data.stackoverflow \
    --web-seed-file samples/stackoverflow/seeds.txt \
    --concept tables/posts_questions \
    --web-max-pages 20 \
    --out ./bundles/stackoverflow
```

### Expected bundle output

:::files
bundles/stackoverflow/
├── index.md
├── datasets/
│   └── stackoverflow.md
├── tables/
│   ├── posts_questions.md
│   ├── posts_answers.md
│   ├── users.md
│   ├── votes.md
│   ├── comments.md
│   ├── badges.md
│   ├── tags.md
│   └── … (additional post-type tables)
├── references/
│   ├── sede_tables.md
│   ├── post_type_ids.md
│   ├── flag_types.md
│   └── … (enumerated types and SEDE docs)
└── viz.html
:::

| Layer | Scale in reference bundle | Role |
|-------|---------------------------|------|
| `datasets/` | 1 dataset doc | Container overview, table index, common query patterns |
| `tables/` | 17 table docs | One OKF doc per BQ table with schema and cross-links |
| `references/` | 33 reference docs | Standalone concepts minted from seeded schema and SEDE pages |

Dataset docs include frontmatter fields such as `type: BigQuery Dataset`, `resource` (BigQuery API URL), `title`, `description`, `tags`, and `timestamp`, plus markdown body sections for schema, query patterns, and citations.

## Bitcoin public dataset

The Bitcoin recipe targets `bigquery-public-data.crypto_bitcoin` — blocks, transactions, inputs, and outputs produced by the open-source `bitcoin-etl` pipeline. This recipe contrasts with GA4 (single denormalized events table) and Stack Overflow (many independent entities) by surfacing **cross-table foreign-key relationships** in prose: each `transactions` row references `blocks`, `inputs`, and `outputs`.

<Warning>
The `transactions` table is hundreds of GB. Prefer `--concept` for smoke runs and keep `--web-max-pages` low while iterating.
</Warning>

### Seed file

`okf/samples/crypto_bitcoin/seeds.txt`:

```text title="samples/crypto_bitcoin/seeds.txt"
https://github.com/blockchain-etl/bitcoin-etl
https://cloud.google.com/blog/products/gcp/bitcoin-in-bigquery-blockchain-analytics-on-public-data
```

The bitcoin-etl README documents schemas that map directly onto `crypto_bitcoin` tables. The Google Cloud blog post predates the `crypto_bitcoin` dataset name (it references the older `bitcoin_blockchain` dataset) but provides authoritative blockchain-analytics context.

### Enrich command

<RequestExample>

```bash title="Bitcoin enrichment"
.venv/bin/python -m enrichment_agent enrich \
    --source bq \
    --dataset bigquery-public-data.crypto_bitcoin \
    --web-seed-file samples/crypto_bitcoin/seeds.txt \
    --out ./bundles/crypto_bitcoin
```

</RequestExample>

Single-concept smoke run:

```bash title="transactions table only"
.venv/bin/python -m enrichment_agent enrich \
    --source bq \
    --dataset bigquery-public-data.crypto_bitcoin \
    --web-seed-file samples/crypto_bitcoin/seeds.txt \
    --concept tables/transactions \
    --web-max-pages 10 \
    --out ./bundles/crypto_bitcoin
```

### Expected bundle output

:::files
bundles/crypto_bitcoin/
├── index.md
├── datasets/
│   └── crypto_bitcoin.md
├── tables/
│   ├── blocks.md
│   ├── transactions.md
│   ├── inputs.md
│   └── outputs.md
└── viz.html
:::

The `transactions` concept doc links to sibling table docs and describes nested `inputs` and `outputs` RECORD fields, block references (`block_hash`, `block_number`, `block_timestamp`), and partitioning by `block_timestamp_month`. The dataset doc includes example SQL for block counts and transaction volume over time.

## Enrichment lifecycle

```mermaid
sequenceDiagram
    participant CLI as enrichment_agent enrich
    participant BQ as BigQuerySource
    participant BQAgent as BQ pass agent
    participant WebAgent as Web pass agent
    participant Bundle as OKF bundle directory

    CLI->>BQ: --dataset project.dataset
    CLI->>Bundle: --out ./bundles/name
    BQAgent->>BQ: List concepts (dataset + tables)
    BQAgent->>Bundle: Write one .md per concept
    Note over Bundle: Auto-generate index.md per directory
    CLI->>WebAgent: --web-seed-file seeds.txt
    WebAgent->>WebAgent: fetch_url (cap: --web-max-pages)
    WebAgent->>Bundle: Enrich existing concepts or mint references/
```

Pass behavior:

1. **BQ pass** — Writes one OKF doc per concept the source advertises, using BigQuery metadata alone.
2. **Web pass** — Receives seed URLs, fetches pages via `fetch_url`, and for each page chooses to enrich one or more existing concepts, mint a standalone `references/<slug>.md` doc, or skip. A hard `--web-max-pages` cap (default 100) and same-domain allowed-hosts filter are enforced inside the tool.

## Common CLI flags

<ParamField body="--source" type="string" required>
Source adapter. Currently only `bq` (BigQuery) is supported.
</ParamField>

<ParamField body="--dataset" type="string" required>
BigQuery identifier in `project.dataset` form.
</ParamField>

<ParamField body="--out" type="path" required>
Bundle root directory to write or update.
</ParamField>

<ParamField body="--web-seed-file" type="path">
Path to a file with one seed URL per line (`#` comments allowed). Repeatable.
</ParamField>

<ParamField body="--web-seed" type="string">
Inline seed URL. Repeatable. Alternative to `--web-seed-file`.
</ParamField>

<ParamField body="--web-allowed-host" type="string">
Extra hostname the web agent may fetch beyond seed hostnames. Repeatable.
</ParamField>

<ParamField body="--web-max-depth" type="integer" default="2">
Hard cap on hop distance from any seed URL. Seeds are depth 0.
</ParamField>

<ParamField body="--model" type="string" default="gemini-flash-latest">
Gemini model id for both BQ and web agents.
</ParamField>

<ParamField body="--billing-project" type="string">
Google Cloud project to bill for BigQuery queries. Defaults to ADC default project.
</ParamField>

## Verify output

<Steps>
<Step title="Confirm stderr summary">

After `enrich` completes, stderr reports the concept count and whether the web pass ran:

```text
Enriched 22 concept(s) into bundles/stackoverflow; web pass used 3 seed(s)
```

</Step>

<Step title="Inspect bundle structure">

Open the bundle root `index.md` and follow subdirectory links. Each concept file has YAML frontmatter (`type`, `resource`, `title`, `description`, `tags`, `timestamp`) and a markdown body with schema, query patterns, and cross-links.

</Step>

<Step title="Generate a graph viewer">

```bash
.venv/bin/python -m enrichment_agent visualize --bundle ./bundles/<name>
```

This writes `<bundle>/viz.html` — a self-contained force-directed graph of concepts, cross-links, and rendered markdown. Checked-in examples: `bundles/ga4/viz.html`, `bundles/stackoverflow/viz.html`, `bundles/crypto_bitcoin/viz.html`.

</Step>
</Steps>

<Check>
A successful recipe run produces a versionable directory tree you can commit, diff in pull requests, or hand to any OKF consumer (static file server, LLM context loader, or the bundled graph viewer).
</Check>

## Related pages

<CardGroup>
<Card title="Produce OKF bundles" href="/produce-okf-bundles">
Two-pass BQ-then-web enrichment workflow, concept scoping, and bundle directory conventions.
</Card>
<Card title="OKF enrichment CLI reference" href="/okf-enrichment-cli-reference">
Full `enrich` and `visualize` subcommand flags, environment variables, and defaults.
</Card>
<Card title="Open Knowledge Format" href="/open-knowledge-format">
OKF v0.1 bundle structure, frontmatter fields, `index.md` progressive disclosure, and cross-link semantics.
</Card>
<Card title="Visualize OKF bundles" href="/visualize-okf-bundles">
Generate and share self-contained `viz.html` graph viewers from any bundle.
</Card>
</CardGroup>

---

## 20. Toolbox enrichment demo

> End-to-end TypeScript demo: kcmd init and pull, kcagent enrich with md-fileset MCP tools, fileset skills, prompt configuration, and BigQuery demo dataset setup.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/20-toolbox-enrichment-demo.md
- Generated: 2026-06-15T02:57:30.687Z

### Source Files

- `toolbox/enrichment/README.md`
- `toolbox/enrichment/src/agent/enrich/command.ts`
- `toolbox/enrichment/src/agent/enrich/agent.ts`
- `toolbox/enrichment/src/tools/md/server.ts`
- `samples/enrichment/src/tools/fileskb/README.md`
- `toolbox/enrichment/package.json`

---
title: "Toolbox enrichment demo"
description: "End-to-end TypeScript demo: kcmd init and pull, kcagent enrich with md-fileset MCP tools, fileset skills, prompt configuration, and BigQuery demo dataset setup."
---

The `toolbox/enrichment` package ships two compiled CLIs — `kcagent` and `md-fileset` — that run an ADK-based enrichment loop over a local kcmd workspace: `kcagent enrich` loads a catalog snapshot, iterates entries, grounds each asset on MCP tools and fileset skills, and writes enriched `dataplex-types.global.overview` markdown back into the workspace via an `update_documentation` function tool.

## What this demo covers

| Stage | Tool / artifact | Outcome |
| --- | --- | --- |
| BigQuery setup | `bq query` | `demo_ecommerce.events` table from GA4 public data |
| Catalog snapshot | `kcmd init` or `catalog.yaml` + `kcmd pull` | Local `catalog/` entry YAML under a kcmd workspace |
| Information sources | `md-fileset` MCP + `fileset-source` skill | Agent searches and reads markdown in `fileset/` |
| Enrichment | `kcagent enrich` | Per-entry overview documentation updated in the workspace |
| Publication (optional) | `kcmd push` | Publishes local overview edits to Knowledge Catalog |

<Note>
The Python sample under `samples/enrichment/` follows a similar pattern with `fileskb`; this page documents the TypeScript toolbox path that bundles `md-fileset` as a native MCP server.
</Note>

## Prerequisites

<Steps>
<Step title="Install dependencies">

```bash
git clone https://github.com/GoogleCloudPlatform/knowledge-catalog
cd toolbox/enrichment
npm install
npm run build
```

Build produces `dist/kcagent`, `dist/md-fileset`, and the `kcmd` binary at `../mdcode/dist/kcmd`.

</Step>

<Step title="Authenticate and set project">

```bash
export DEMO_CLOUD_PROJECT="<your-gcp-project-id>"

gcloud auth application-default login
gcloud config set project $DEMO_CLOUD_PROJECT
gcloud config set compute/region us
```

`kcagent` resolves Vertex AI project and location from `kcmd.gcp.ApiContext.default()`, which reads gcloud ADC configuration.

</Step>
</Steps>

## Architecture

```mermaid
sequenceDiagram
  participant User
  participant kcmd
  participant kcagent
  participant ADK as ADK Runner
  participant MCP as md-fileset MCP
  participant Fileset as fileset/ markdown
  participant Catalog as catalog/ workspace

  User->>kcmd: init + pull (or manual catalog.yaml + pull)
  kcmd->>Catalog: write entry YAML + aspects
  User->>kcagent: enrich --catalog-path --tools-path --prompt-path
  kcagent->>Catalog: listEntries / lookupEntry
  loop Per catalog entry
    kcagent->>ADK: runEphemeral(asset prompt)
    ADK->>MCP: list / search / read fileset tools
    MCP->>Fileset: listContents / searchContents / readFile
    Fileset-->>MCP: markdown snippets
    MCP-->>ADK: tool results
    ADK->>Catalog: update_documentation → overview aspect
  end
```

For each entry, `enrichCommand` constructs a per-entry prompt from the entry name, schema aspect, existing overview, and the shared `prompt.md` file, then runs an ephemeral ADK session. The agent's built-in instruction (in `agent.ts`) defines documentation structure, citation requirements, and when to skip updates.

## Demo workspace layout

:::files
demo/
├── catalog.yaml              # kcmd manifest (scope + snapshot aspects)
├── catalog/                  # pulled entry YAML (created by kcmd pull)
├── prompt.md                 # per-run enrichment instructions
├── fileset/                  # markdown knowledge base for md-fileset
└── tools/
    ├── mcp.json              # MCP server definitions
    └── skills/
        └── fileset-source/
            └── SKILL.md      # agent skill describing md-fileset usage
:::

## Set up the BigQuery demo dataset

Create a partitioned `events` table in your project from the GA4 obfuscated ecommerce public dataset:

<RequestExample>

```bash title="Create demo_ecommerce schema and table"
bq query --use_legacy_sql=false <<EOF
CREATE SCHEMA IF NOT EXISTS \`${DEMO_CLOUD_PROJECT}.demo_ecommerce\`
OPTIONS (
  location = 'US',
  labels = [('usage', 'demo')]
);

CREATE TABLE IF NOT EXISTS \`${DEMO_CLOUD_PROJECT}.demo_ecommerce.events\`
PARTITION BY event_date_dt
AS
SELECT
  *,
  PARSE_DATE('%Y%m%d', event_date) AS event_date_dt
FROM
  \`bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*\`;
EOF
```

</RequestExample>

The same dataset shape is scripted in `toolbox/mdcode/demo/bq/setup.ts` for the mdcode demo harness.

## Initialize and pull the catalog snapshot

<Tabs>
<Tab title="kcmd init (recommended)">

```bash
mkdir -p demo && cd demo

../../mdcode/dist/kcmd init \
  --bigquery-dataset ${DEMO_CLOUD_PROJECT}.demo_ecommerce

../../mdcode/dist/kcmd pull
```

`kcmd init` scaffolds `catalog.yaml` with the correct `bq-dataset` scope.

</Tab>
<Tab title="Manual catalog.yaml">

```bash
mkdir -p demo && cd demo

cat <<EOF > catalog.yaml
scope: bq-dataset.${DEMO_CLOUD_PROJECT}.demo_ecommerce

snapshot:
  entries:
    - dataplex-types.global.bigquery-dataset
    - dataplex-types.global.bigquery-table
  aspects:
    - dataplex-types.global.overview
EOF

../../mdcode/dist/kcmd pull
```

</Tab>
</Tabs>

<ResponseExample>

After `kcmd pull`, the workspace contains `catalog/` YAML files for the dataset and its tables, including schema and any existing overview aspects.

</ResponseExample>

## Configure prompt, MCP tools, and skills

### Prompt file

```bash title="prompt.md"
cat <<EOF > prompt.md
Enrich the documentation of the assets using the internal organizational information.
Use the following sources:

* Fileset source
EOF
```

The prompt is appended to each per-entry message alongside asset metadata and existing documentation.

### MCP server (`tools/mcp.json`)

```json title="tools/mcp.json"
{
  "mcpServers": {
    "md-fileset": {
      "command": "../dist/md-fileset",
      "args": ["--dir", "fileset"]
    }
  }
}
```

`loadMcpTools` reads `tools/mcp.json`, expands `$VAR` / `${VAR}` environment references in `command`, `args`, and `env`, and registers each server as an ADK `MCPToolset`. Stdio servers use `command` + `args`; HTTP servers use `httpUrl`.

### Fileset skill (`tools/skills/fileset-source/SKILL.md`)

```markdown title="tools/skills/fileset-source/SKILL.md"
---
name: fileset-source
description: >
  Use the fileset source to find relevant markdown documents and extract information
  about assets.
---

The `md-fileset` mcp server provides the following tools to extract relevant
information from a directory hierarchy of markdown files:

* **list_fileset_contents** - browse and navigate the directory tree
* **read_fileset_file** - read the full contents of a file
* **search_fileset_contents** - regex search with file, line, and snippet results
```

`loadSkills` loads every `SKILL.md` under `tools/skills/` via `adk.loadAllSkillsInDir` and exposes them as a `SkillToolset` with `UnsafeLocalCodeExecutor`.

### Populate the fileset directory

Copy markdown from `samples/enrichment/sample/docs/` into `demo/fileset/`. These files describe GA4 ecommerce context, query patterns, and usage notes that ground enrichment beyond raw BigQuery schema.

<Info>
Skills and MCP servers are provider-neutral extension points: swap `md-fileset` for any MCP-compatible information source (for example the Python `fileskb` server in `samples/enrichment`) without changing the `kcagent enrich` command shape.
</Info>

## Run enrichment

<Steps>
<Step title="Start the agent">

```bash
../dist/kcagent enrich \
  --catalog-path . \
  --tools-path tools \
  --prompt-path prompt.md
```

</Step>

<Step title="Observe runtime output">

The CLI prints structured ADK events:

- `[Thought]` — model reasoning (when `includeThoughts` is enabled)
- `[Tool Invoke]` / `[Tool Result]` — MCP and function tool calls
- `[Agent]` — final agent messages

Each entry is processed sequentially; `Processing: <entry-name>` marks the current asset.

</Step>

<Step title="Verify workspace changes">

Inspect updated overview aspects in `catalog/` entry YAML files. Entries where the agent found no additional information are left unchanged per the agent instruction.

</Step>
</Steps>

## `kcagent enrich` CLI reference

<ParamField body="--catalog-path" type="string" required>
Path to the kcmd workspace root containing `catalog.yaml` and the `catalog/` directory.
</ParamField>

<ParamField body="--tools-path" type="string" required>
Directory containing `mcp.json` and optional `skills/` subdirectory.
</ParamField>

<ParamField body="--prompt-path" type="string" required>
Path to a markdown or plain-text prompt file appended to each per-entry enrichment message.
</ParamField>

## `md-fileset` MCP tools

| Tool | Parameters | Behavior |
| --- | --- | --- |
| `list_fileset_contents` | `path` (optional, default `''`) | Lists files and subdirectories under a relative path |
| `read_fileset_file` | `path` (required) | Returns full file contents; rejects paths outside the fileset root |
| `search_fileset_contents` | `query` (required), `path` (optional) | Case-insensitive regex search across `.md` files; returns file, line number, and matching line |

The `md-fileset` binary requires `--dir <root>` pointing at the markdown root directory. Run it standalone via stdio transport for MCP inspector debugging:

```bash
npm run run:mdtool
```

## Agent model and documentation contract

The enrichment agent (`kcagent-enrich`) uses `gemini-2.5-flash` on Vertex AI with thinking enabled. Generated documentation must:

- Follow a markdown template with summary paragraphs, **Data Details**, **Usage Details**, and **Citations** sections
- Synthesize sources rather than verbatim copy
- Ground statements in retrieved facts; skip `update_documentation` when no new information is found
- Call `update_documentation` with markdown content, which writes `dataplex-types.global.overview` with `contentType: MARKDOWN` via `catalog.updateEntry`

## Publish enriched metadata (optional)

After enrichment, push local overview edits back to Knowledge Catalog:

```bash
../../mdcode/dist/kcmd status
../../mdcode/dist/kcmd push
```

Use `--dry-run` on `pull` or `push` to preview sync behavior without writing remote state.

## Clean up demo resources

```bash
bq rm -r -f -d ${DEMO_CLOUD_PROJECT}:demo_ecommerce
```

<Warning>
The cleanup command in `toolbox/enrichment/README.md` references `demo-dataset`; the demo creates `demo_ecommerce`. Use the dataset name above.
</Warning>

## Troubleshooting

| Symptom | Likely cause | Check |
| --- | --- | --- |
| `catalog-path is not a directory` | Wrong `--catalog-path` | Run from the `demo/` workspace root |
| Empty MCP tool list | Missing or invalid `tools/mcp.json` | Confirm file exists; warnings print on parse failure |
| No fileset matches | Empty `fileset/` or wrong `--dir` | Verify sample docs were copied |
| Vertex AI auth errors | ADC or quota project misconfigured | Re-run `gcloud auth application-default login` |
| MCP session leaks on exit | ADK MCP lifecycle | `patchadk.ts` registers session cleanup hooks |

## Developer commands

| Command | Purpose |
| --- | --- |
| `npm run build` | Compile `kcagent`, `md-fileset`, and TypeScript library |
| `npm run compile` | Type-check without emit |
| `npm run test` | Run package tests |
| `npm run debug` | Launch ADK web UI against `agent.ts` |

## Related pages

<CardGroup>
<Card title="Installation" href="/installation">
Prerequisites, Node.js setup, credential configuration for BigQuery and Vertex AI.
</Card>
<Card title="Sync catalog metadata" href="/sync-catalog-metadata">
`kcmd init`, `pull`, `status`, and `push` for BigQuery dataset workspaces.
</Card>
<Card title="Enrichment workflows" href="/enrichment-workflows">
How enrichment agents read metadata, ground on external sources, and hand off to kcmd push.
</Card>
<Card title="Publish enriched metadata" href="/publish-enriched-metadata">
Push mdcode workspace edits and reconcile aspects after enrichment.
</Card>
<Card title="kcmd MCP reference" href="/kcmd-mcp-reference">
Expose kcmd pull/push and entry operations as MCP tools in agentic workflows.
</Card>
</CardGroup>

---

## 21. Troubleshooting

> Common auth, billing, push conflict, web crawl cap, glossary provisioning, and model credential failures with verification signals from tests and README constraints.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/21-troubleshooting.md
- Generated: 2026-06-15T02:58:06.052Z

### Source Files

- `okf/README.md`
- `agents/enrichment/README.md`
- `agents/mdcode/README.md`
- `okf/tests/test_web_fetcher.py`
- `okf/tests/test_bigquery_source.py`
- `samples/discovery/README.md`
- `agents/enrichment/eval/__main__.py`

---
title: Troubleshooting
description: Common auth, billing, push conflict, web crawl cap, glossary provisioning, and model credential failures with verification signals from tests and README constraints.
---

When a Knowledge Catalog workflow fails, the error usually falls into one of six buckets: **credentials**, **billing project**, **`kcmd push`**, **OKF web crawl limits**, **glossary provisioning**, or **model access**. This page maps symptoms to root causes, fix steps, and how to confirm recovery using tests, eval metrics, and CLI checks.

## Symptom quick reference

| Symptom | Likely cause | First check |
| --- | --- | --- |
| `Unable to retrieve project, location, or token` | `gcloud` ADC or config incomplete | `gcloud auth application-default print-access-token` |
| BigQuery query fails on public datasets | Billing project unset or wrong | `gcloud config get-value project` or `--billing-project` |
| Drive folder appears empty during enrichment | ADC missing `drive.readonly` scope | Re-login with extended scopes |
| `Glossary '…' does not exist` on `kcmd push` | Glossary tree not provisioned in Dataplex | Create glossary/terms first, then `kcmd pull` |
| `fetch_url` returns `"max_pages reached"` | OKF web crawl budget exhausted | Lower scope or raise `--web-max-pages` |
| Judge metrics show `n/a` in eval | Vertex auth not configured | `GOOGLE_CLOUD_PROJECT` + ADC |
| Enrichment agent exits without `trajectory.json` | Missing deps or unbuilt `kcmd` | Build `agents/mdcode` and re-run |

---

## Authentication (`gcloud` ADC)

All three surfaces — **`kcmd`**, the **catalog enrichment agent**, and the **discovery agent** — rely on Google Cloud authentication. None of them embed long-lived API keys for Dataplex or BigQuery.

### `kcmd` and MCP

`kcmd` obtains tokens by shelling out to `gcloud auth application-default print-access-token` and reads the active project/region from `gcloud config`. If any of these are missing, initialization fails immediately:

```
Unable to retrieve project, location, or token. Ensure gcloud is configured.
```

<Steps>
<Step title="Configure ADC and gcloud defaults">

```bash
gcloud auth application-default login
gcloud config set project <your-project-id>
gcloud config set compute/region <your-region>   # e.g. us-central1
```

Verify:

```bash
gcloud -q auth application-default print-access-token | head -c 20
gcloud -q config get-value project
```

</Step>
<Step title="Confirm kcmd can reach the catalog">

From a workspace directory:

```bash
kcmd pull --dry-run
```

A non-zero exit with `Error pulling catalog entries:` usually indicates insufficient Dataplex IAM or an invalid `catalog.yaml` scope.

</Step>
</Steps>

### Catalog enrichment agent (Drive + Vertex)

The enrichment agent needs ADC with **both** `cloud-platform` and **`drive.readonly`** scopes. Without `drive.readonly`, Drive list calls return 403 — and the agent logs a warning that can look like an empty folder:

```
[Folder] ⚠️  Drive list failed for folder '…': 403 …
          If this is 403 insufficientPermissions, your ADC token is
          missing the drive.readonly scope — re-run: gcloud auth
          application-default login --scopes='openid,https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/drive.readonly'
```

<ParamField body="Required ADC scopes" type="string">
`openid`, `https://www.googleapis.com/auth/cloud-platform`, `https://www.googleapis.com/auth/drive.readonly`
</ParamField>

### Discovery agent

The discovery sample requires APIs enabled (`dataplex.googleapis.com`, `aiplatform.googleapis.com`, `serviceusage.googleapis.com`) and IAM roles that include `dataplex.projects.search`, `aiplatform.endpoints.predict`, and `serviceusage.services.use`. Set:

```bash
export GOOGLE_CLOUD_PROJECT=<PROJECT_ID>
export GOOGLE_GENAI_USE_VERTEXAI=True
```

---

## Billing and BigQuery project

Public BigQuery datasets are readable, but **query bytes bill to the caller's project**. Both the OKF enrichment agent and sample recipes assume a billing project is configured.

### OKF enrichment agent

<ParamField body="--billing-project" type="string">
Google Cloud project to bill for BigQuery metadata queries. Defaults to the ADC default project when omitted.
</ParamField>

Prerequisites:

```bash
gcloud auth application-default login
gcloud config set project <your-billing-project>
```

Or pass explicitly:

```bash
python -m enrichment_agent enrich \
  --source bq \
  --dataset <project>.<dataset> \
  --billing-project <billing-project> \
  --out ./bundles/<name>
```

### Catalog enrichment agent — usage signal

Table and context-overlay modes optionally scan `INFORMATION_SCHEMA` for query history (`--include_usage`, default `true`). The `--usage_scope` flag controls fallback behavior:

| Value | Behavior |
| --- | --- |
| `auto` (default) | Try `JOBS_BY_PROJECT`; on permission denied, fall back to `JOBS_BY_USER` |
| `project` | Require project-wide job listing |
| `user` | Only your own queries (narrower, but always permitted) |

When `JOBS_BY_PROJECT` fails, the agent logs:

```
[bq_usage] JOBS_BY_PROJECT failed (…); falling back to JOBS_BY_USER
```

Region typos or billing-disabled projects also fall through; the fallback may still return **empty usage** rather than a hard error. If the `queries` aspect is empty after enrichment, retry with `--usage_scope=user` or `--include_usage=false`.

### Push-time queries aspect 403

If `dataplex.entryGroups.useQueriesAspect` is missing, `kcmd push` can fail with **403 on the `queries` aspect** while `overview` still succeeds. Remove `queries` from `publishing.aspects` in `catalog.yaml`, or grant the permission, then push again.

---

## `kcmd push` failures

### Glossary resources are never auto-created

`kcmd push` **does not create** `Glossary`, `GlossaryCategory`, or `GlossaryTerm` control-plane resources. It only **updates metadata** (descriptions, labels) on resources that already exist. Missing glossary nodes produce fail-fast errors such as:

- `Glossary '…' does not exist. kcmd does not create glossary resources…`
- `Glossary term '…' does not exist. kcmd does not create glossary resources…`
- `Parent glossary '…' does not exist in <project>/<location> (required by term …)`

<Steps>
<Step title="Provision the glossary hierarchy out-of-band">

```bash
gcloud dataplex glossaries create <glossaryId> \
  --project=<project> --location=<location>

gcloud dataplex glossary-terms create <termId> \
  --glossary=<glossaryId> --project=<project> --location=<location>
```

</Step>
<Step title="Pull, edit locally, then push metadata only">

```bash
kcmd init --glossary <project>.<location>.<glossaryId>
kcmd pull
# edit descriptions/labels under catalog/glossaries/
kcmd push
```

</Step>
</Steps>

`kcmd init --glossary` also fails at init time if the glossary ID or display name cannot be resolved:

```
Glossary '<id>' not found in <project>.<location> (tried ID and Display Name).
```

EntryLinks that **reference** glossary terms (for example `definition` links from BQ columns) **are** created and reconciled by `kcmd push` — the no-create rule applies only to the glossary tree itself.

### Entry group must exist before doc-mode enrichment

Doc mode requires `--entry_group` to **already exist**. The agent runs read-only `kcmd init`/`pull` and will not create the entry group:

```bash
gcloud dataplex entry-groups create <entryGroupId> \
  --project=<project> --location=<location>
```

### Reference layers are read-only

Files ending in `.ref.yaml` are **skipped during push**. If enrichment wrote only to reference files, `kcmd push` will not publish those edits. Copy changes into the editable `.yaml` / sidecar `.md` siblings instead.

### Push conflicts and `--force` (current behavior)

The Metadata-as-Code **design spec** calls for checksum-based conflict detection: push should abort when remote metadata changed, require `kcmd pull` to reconcile, and offer `--force` to override.

**Current implementation status:** `kcmd push` accepts `--force` and `--validate-only`, but conflict detection, `kcmd status`, and `validate()` are **not yet implemented** (`TODO: Handle conflicts` in the sync engine; `status()` and `validate()` throw `Not yet implemented`). Today, push failures are more often **resource errors** (missing glossary, failed EntryLink create, entry-group create denied) than checksum conflicts.

Practical workflow until conflict detection ships:

1. `kcmd pull` to refresh local state before editing.
2. `kcmd push --dry-run` to preview mutations.
3. On ambiguous remote edits, pull again and diff `catalog/` manually.

### EntryLink reconciliation errors

When `publishing.entryLinks` is declared, push compares local vs remote links. A failed create surfaces as:

```
Failed to create EntryLink: <message>
```

Common causes: target entry or glossary term UID in local YAML does not match a live catalog resource, or `publishing.entryLinks` includes types not listed under `snapshot.entryLinks`.

---

## OKF web crawl cap and fetch limits

The OKF enrichment agent runs a **web pass** after the BigQuery pass. Crawl policy is enforced inside the `fetch_url` tool — the LLM cannot bypass it.

### Hard limits (CLI defaults)

| Flag | Default | Enforced by |
| --- | --- | --- |
| `--web-max-pages` | `100` | Session page budget |
| `--web-max-depth` | `2` | Hop distance from seeds |
| `--web-allowed-host` | seed hosts only | Hostname allow-list |
| `--web-allowed-path-prefix` | none | Path prefix filter |
| `--web-denied-path-substring` | none | Path blocklist |

Use `--no-web` to skip the web pass entirely during iteration.

### `fetch_url` rejection reasons

When a fetch is rejected, the tool returns `{"error": "<reason>", …}` instead of page content. Do not retry the same URL.

<AccordionGroup>
<Accordion title="max_pages reached">

Budget exhausted. Stop crawling or re-run with a higher `--web-max-pages`. The web-ingestion prompt instructs the agent to stop when this error appears.

</Accordion>
<Accordion title="host not in allowed list">

URL hostname is outside seed hosts plus any `--web-allowed-host` values. Add the host explicitly if it is authoritative documentation.

</Accordion>
<Accordion title="path not in allowed prefixes / denied substring">

Path filters blocked the URL. Adjust `--web-allowed-path-prefix` or `--web-denied-path-substring`.

</Accordion>
<Accordion title="exceeds max_depth">

Link is too many hops from a seed. Raise `--web-max-depth` or choose closer seed URLs.

</Accordion>
<Accordion title="URL not reachable from a seed">

The agent invented a URL not returned as a link from a fetched page. Only follow links discovered in prior fetches.

</Accordion>
<Accordion title="fetch failed / non-HTML content-type">

Network errors wrap as `fetch failed: …`. Non-HTML responses (JSON, PDF without HTML) raise `non-HTML content-type`. Pick a different URL.

</Accordion>
</AccordionGroup>

### Page size truncation

HTML pages convert to markdown capped at **40 KiB**. Oversized pages include `[...truncated...]` in the body. This is expected for long reference pages.

### Web-pass write guardrails

During the web pass, `write_concept_doc` rejects **schema shrinkage** and **citation shrinkage** relative to the BQ pass. If the agent logs a write `error` mentioning `missing N schema field(s)` or `Citations section had N entries`, the web pass attempted to remove BQ-grounded content — refine seeds or re-run with `--concept` for a single table.

---

## Glossary column linking (enrichment agent)

Table mode with `--glossaries` maps columns to Dataplex terms and injects `links.definition` into table YAML. This path fails **silently** (with warnings) when prerequisites are missing:

| Log message | Fix |
| --- | --- |
| `[Linking] ⚠️  No glossary entries found in workspace — skipping.` | Run `kcmd pull` with glossary reference after provisioning terms |
| `[Linking] ⚠️  No glossary terms found — skipping.` | Glossary workspace has categories but no `.ref.yaml` terms |
| `[Linking] ⚠️  Schema aspect not found for <table>` | Re-run `kcmd pull` so schema aspects are present |

Required setup:

1. Initialize and pull the dataset with glossary-link manifest (`entryLinks` in snapshot/publishing/reference).
2. Pull glossary terms as reference into `catalog/glossaries/.../*.ref.yaml`.
3. Run `kcmd push` to publish new column links.

---

## Model credential failures

Knowledge Catalog tooling is **BYOC/BYOK**: you choose the model backend and supply credentials. There is no single global API key.

### OKF enrichment agent (Gemini via ADK)

Two supported paths:

<Tabs>
<Tab title="AI Studio">

```bash
export GEMINI_API_KEY=<your-key>
```

</Tab>
<Tab title="Vertex AI">

```bash
export GOOGLE_GENAI_USE_VERTEXAI=true
export GOOGLE_CLOUD_PROJECT=<id>
export GOOGLE_CLOUD_LOCATION=<region>
```

</Tab>
</Tabs>

Select the model with `--model`. Missing or invalid credentials surface as ADK/Gemini API errors at enrich time, not during `pytest` (unit tests mock network calls).

### Catalog enrichment agent (Vertex only)

`agent_runner.py` **always** sets Vertex mode from flags:

```
GOOGLE_GENAI_USE_VERTEXAI=True
GOOGLE_CLOUD_PROJECT=<--project>
GOOGLE_CLOUD_LOCATION=<--location>   # default global
```

`--project` and `--model` are **required**. Omitting them raises `UsageError` before any LLM call.

### Eval judge metrics

Dynamic and golden evaluators run deterministic metrics without auth. Judge-based metrics (`hallucination_free`, `redundancy_index`, `disambiguation_efficacy`, `absence_of_contradictions`) require:

```bash
export GOOGLE_CLOUD_PROJECT=<project>
gcloud auth application-default login
```

Without auth, the scorecard shows judge metrics as **`n/a`** with a warning. The eval CLI degrades gracefully rather than crashing. In `--run` mode, missing `trajectory.json` after an agent exit means the run did not complete — check that `kcmd` is built (`cd agents/mdcode && npm run build`) and Python deps are installed.

### GitHub MCP (optional code source)

`--repo` requires `GITHUB_PERSONAL_ACCESS_TOKEN` in the MCP server environment (default remote server or local `mcp.json`). Missing tokens cause GitHub tool calls to fail during enrichment, not at startup.

---

## Verify fixes with tests and eval

### OKF package tests

From `okf/`:

```bash
.venv/bin/pytest
```

| Test module | What it validates |
| --- | --- |
| `test_web_tools.py` | Crawl depth, path filters, unregistered URL rejection |
| `test_web_fetcher.py` | HTML-only fetch, truncation, network error wrapping |
| `test_bigquery_source.py` | BQ concept listing, view vs table sampling |
| `test_bundle_tools.py` | Web-pass schema/citation preservation |

### Catalog enrichment eval

From `agents/enrichment/`:

```bash
python -m eval --output-dir <enrichment-output-dir>
```

- **`structural_validity`** — mdcode parses; entry types match mode; overviews are clean Markdown.
- **`hallucination_free`** — requires judge auth; scores claim grounding against `trajectory.json`.
- **`entry_grounding`** (golden table mode) — no invented tables.

For end-to-end case runs:

```bash
python -m eval --run --goldens eval/goldens/thelook_ecommerce.json \
  --project <project> --runs 3
```

Prereqs: ADC, built `kcmd`, agent deps. Skipped runs log `no trajectory.json — the agent run did not complete`.

### `kcmd` scenario tests

From `agents/mdcode/`:

```bash
npm run build
npm run test
```

Scenario tests under `tests/scenarios/` cover `push_bq`, `push_kb`, `push_new_entry`, and related push/pull flows against mocked catalog APIs.

---

## Related pages

<CardGroup>
<Card title="Installation" href="/installation">
Prerequisites, Python and Node setup, and credential configuration for BigQuery, Vertex AI or Gemini, and gcloud ADC.
</Card>
<Card title="Sync catalog metadata" href="/sync-catalog-metadata">
Initialize workspaces, pull snapshots, check status, and push local edits back to Knowledge Catalog.
</Card>
<Card title="kcmd CLI reference" href="/kcmd-cli-reference">
`kcmd` commands, init flags per source type, pull/push options, and authentication via gcloud ADC.
</Card>
<Card title="OKF enrichment CLI reference" href="/okf-enrichment-cli-reference">
`enrichment-agent enrich` flags, web crawl constraints, concept scoping, and environment variables.
</Card>
<Card title="Evaluate enrichment output" href="/evaluate-enrichment-output">
Score enrichment runs with structural checks, judge metrics, and golden-based eval.
</Card>
<Card title="Publish enriched metadata" href="/publish-enriched-metadata">
Push mdcode workspaces with `kcmd` and reconcile entry links without modifying read-only reference layers.
</Card>
</CardGroup>

---

## 22. Contributing

> CLA requirements, fork-and-PR workflow, style expectations, and test commands for Python pytest and TypeScript npm run test across package directories.

- Page Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/pages/22-contributing.md
- Generated: 2026-06-15T02:58:52.917Z

### Source Files

- `CONTRIBUTING.md`
- `CODE_OF_CONDUCT.md`
- `LICENSE.md`
- `okf/pyproject.toml`
- `agents/mdcode/package.json`
- `okf/README.md`

---
title: "Contributing"
description: "CLA requirements, fork-and-PR workflow, style expectations, and test commands for Python pytest and TypeScript npm run test across package directories."
---

Contributions to Google Cloud Knowledge Catalog land through GitHub pull requests against `GoogleCloudPlatform/knowledge-catalog`. The repository ships under the Apache 2.0 license; every submission must be covered by a Google Contributor License Agreement (CLA), match the style of the package you touch, and pass that package's automated tests before review.

## License and community standards

All solutions in this repository are distributed under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) license. See `LICENSE.md` at the repository root for full terms.

Participation is governed by `CODE_OF_CONDUCT.md`, adapted from the Contributor Covenant. Reports go to `@googleapis/senseai-eco` or, if needed, `opensource@google.com`.

## Contributor License Agreement

Contributions must be accompanied by a Contributor License Agreement. You (or your employer) retain copyright; the CLA grants Google permission to use and redistribute your work as part of the project.

<Steps>
<Step title="Check or sign the CLA">

Visit [https://cla.developers.google.com/](https://cla.developers.google.com/) to view existing agreements or sign a new one.

</Step>
<Step title="Confirm CLA status on your PR">

You generally sign the CLA once across Google open-source projects. If you have signed before for another project, you usually do not need to sign again.

</Step>
</Steps>

<Note>
Corporate contributors may need an authorized representative to sign the CLA on behalf of their employer.
</Note>

## Fork-and-PR workflow

All submissions — including those from maintainers — go through GitHub pull requests.

<Steps>
<Step title="Fork and clone">

Fork [GoogleCloudPlatform/knowledge-catalog](https://github.com/GoogleCloudPlatform/knowledge-catalog) on GitHub, then clone your fork locally.

</Step>
<Step title="Create a feature branch">

Make changes on a branch off `main`. Scope each PR to a focused set of changes in one or more related packages.

</Step>
<Step title="Develop and test locally">

Install dependencies for the packages you modify and run their test commands (see tables below). Add or update unit tests when you change behavior.

</Step>
<Step title="Match existing style">

Keep formatting, naming, and patterns consistent with the surrounding code in each package directory.

</Step>
<Step title="Open a pull request">

Push your branch and open a PR against `main`. Address review feedback; maintainers use GitHub PR review for all merges.

</Step>
</Steps>

<Tip>
See [GitHub Help — About pull requests](https://help.github.com/articles/about-pull-requests/) for general PR mechanics.
</Tip>

## Repository layout

The monorepo groups independent packages. Touch the directory that owns the surface you are changing.

:::files
knowledge-catalog/
├── okf/                      # OKF enrichment agent (Python, pytest)
├── agents/
│   ├── mdcode/               # kcmd Metadata as Code CLI + library (TypeScript)
│   └── enrichment/           # Catalog enrichment agent (Python, eval tooling)
├── toolbox/
│   ├── mdcode/               # kcmd copy for toolbox consumers (TypeScript)
│   └── enrichment/           # kcagent TypeScript enrichment harness
└── samples/                  # Discovery and demo samples
:::

| Package path | Language | Primary artifact | Automated test entry point |
|---|---|---|---|
| `okf/` | Python ≥ 3.11 | `enrichment-agent` CLI, OKF bundles | `pytest` |
| `agents/mdcode/` | TypeScript (Bun test runner) | `kcmd` CLI, MCP server | `npm run test` |
| `toolbox/mdcode/` | TypeScript (Bun test runner) | `kcmd` CLI (toolbox layout) | `npm run test` |
| `toolbox/enrichment/` | TypeScript | `kcagent`, `md-fileset` | `npm run compile` (no `test` script) |
| `agents/enrichment/` | Python | `agent_runner.py` enrichment modes | No pytest suite; optional `python -m eval` |

<Info>
`agents/mdcode/` and `toolbox/mdcode/` are parallel copies of the kcmd package. Run tests in whichever tree you modify.
</Info>

## Style expectations

`CONTRIBUTING.md` requires that code **adheres to the existing style** in each package. The repository does not ship a root-level formatter or linter configuration; follow the conventions already present in the directory you edit.

### TypeScript (`agents/mdcode`, `toolbox/mdcode`, `toolbox/enrichment`)

TypeScript packages use `strict` compiler mode. The root `tsconfig.json` in `agents/mdcode` additionally enables `noUnusedLocals`, `noImplicitReturns`, and `noFallthroughCasesInSwitch`.

| Check | Command | When to run |
|---|---|---|
| Typecheck | `npm run compile` | Before every PR that touches TypeScript |
| Build | `npm run build` | When you change CLI entrypoints or compiled artifacts |
| Unit tests | `npm run test` | Required for `agents/mdcode` and `toolbox/mdcode` changes |

Match import style (`nodenext` modules), naming, and error-handling patterns in neighboring files. Scenario tests live under `tests/libts/scenarios.ts` and load YAML fixtures from `tests/scenarios/`.

### Python (`okf/`, `agents/enrichment/`)

The `okf` package declares `requires-python = ">=3.11"` and ships pytest configuration in `pyproject.toml` (`testpaths = ["tests"]`, `pythonpath = ["src"]`).

Python under `agents/enrichment/` follows Google-style patterns visible in source (for example, targeted `pylint: disable` comments for broad exceptions and import placement). Match indentation, docstring tone, and module layout in `src/` and `eval/`.

For OKF document content produced by agents, the enrichment prompt in `okf/src/enrichment_agent/prompts/enrichment_instruction.md` defines style rules: be concrete, do not invent metadata fields, and write valid markdown without preamble or reasoning narration.

## Running tests

### Python: `okf/` (pytest)

The OKF enrichment agent is the only Python package with a formal pytest suite. Seven test modules cover bundle tools, document parsing, indexing, BigQuery source behavior, web fetchers, and the visualization pipeline.

<Tabs>
<Tab title="Setup">

```bash
cd okf
python3 -m venv .venv
.venv/bin/pip install -e '.[dev]'
```

The `[dev]` extra installs `pytest>=7.0`.

</Tab>
<Tab title="Run all tests">

```bash
.venv/bin/pytest
```

</Tab>
<Tab title="Run a single module">

```bash
.venv/bin/pytest tests/test_document.py -v
```

</Tab>
</Tabs>

<Check>
A clean run reports all tests passed (currently 33 tests across the `okf/tests/` tree).
</Check>

### TypeScript: `agents/mdcode` and `toolbox/mdcode`

Both packages expose identical npm scripts. Tests are **Bun scenario tests** that replay YAML-defined init/pull/push workflows against in-memory mocks — no live GCP calls.

<CodeGroup>

```bash title="agents/mdcode"
cd agents/mdcode
npm install
npm run build    # optional but recommended before test
npm run test
```

```bash title="toolbox/mdcode"
cd toolbox/mdcode
npm install
npm run build
npm run test
```

</CodeGroup>

`npm run test` delegates to `npm run test:libts`, which executes:

```bash
npx bun test ./tests/libts/scenarios.ts
```

#### Run a subset of scenarios

Set `TEST_GLOB` to filter YAML scenario files under `tests/scenarios/`:

```bash
TEST_GLOB='pull_bq*.yaml' npm run test
```

Scenario files cover manifest initialization, pull/push sync for BigQuery datasets, knowledge bases, entry groups, BigLake namespaces, reference layers, entry links, and custom entries.

### TypeScript: `toolbox/enrichment`

The `toolbox/enrichment` package builds `kcagent` and `md-fileset` binaries. Its `package.json` defines `build`, `compile`, and `debug` scripts but **does not define a `test` script**, even though `toolbox/enrichment/README.md` mentions `npm run test`.

<Warning>
For `toolbox/enrichment` changes, run `npm run compile` for type safety and `npm run build` to verify compilation. Do not expect `npm run test` to succeed until a test script is added to `package.json`.
</Warning>

```bash
cd toolbox/enrichment
npm install
npm run compile
npm run build
```

### Python: `agents/enrichment/` (no pytest suite)

`agents/enrichment/` has no checked-in pytest tests. Validation for enrichment output is handled by the optional eval CLI under `agents/enrichment/eval/`, which scores runs via `python -m eval` rather than unit tests.

```bash
cd agents/enrichment
pip install -r src/requirements.txt
pip install -r eval/requirements.txt

# Score an existing enrichment output directory:
python -m eval --output-dir /path/to/enrich_out
```

Use eval when you change enrichment behavior and need quality signals; it is not a substitute for the pytest or `npm run test` gates required by `CONTRIBUTING.md` in other packages.

## Pre-submission checklist

| Requirement | Verification |
|---|---|
| CLA signed | [cla.developers.google.com](https://cla.developers.google.com/) |
| Apache 2.0 compatibility | Contributions inherit repo license terms |
| Style consistency | Matches surrounding code; `npm run compile` clean for TS |
| Unit tests for behavior changes | New or updated tests in the affected package |
| All package tests pass | `pytest` in `okf/`; `npm run test` in mdcode dirs you touched |
| PR opened against `main` | GitHub pull request with review |

<AccordionGroup>
<Accordion title="What reviewers expect">

Reviewers check that changes are scoped, tested, and consistent with package conventions. TypeScript PRs should not introduce `tsc --noEmit` errors under `npm run compile`. Python PRs touching `okf/` should keep the pytest suite green.

</Accordion>
<Accordion title="When you change multiple packages">

Run tests in every package directory you modify. A change spanning `okf/` and `agents/mdcode/` requires both `pytest` and `npm run test` in their respective roots.

</Accordion>
</AccordionGroup>

## Related pages

<CardGroup>
<Card title="Installation" href="/installation">
Prerequisites, Python and Node.js setup, and credential configuration before you run package tests locally.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
Auth, billing, and test-environment failures that block local verification.
</Card>
<Card title="Overview" href="/overview">
Knowledge Catalog tooling surface and how the packages under `agents/`, `okf/`, and `toolbox/` fit together.
</Card>
</CardGroup>

---