# Overview

> Knowledge Catalog tooling surface: OKF bundles, kcmd Metadata as Code workspaces, enrichment and discovery agents, and the shortest paths to produce and publish metadata context.

- Repository: GoogleCloudPlatform/knowledge-catalog
- GitHub: https://github.com/GoogleCloudPlatform/knowledge-catalog
- Human docs: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5
- Complete Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/llms-full.txt

## Source Files

- `README.md`
- `toolbox/README.md`
- `okf/README.md`
- `agents/enrichment/README.md`
- `samples/README.md`
- `agents/mdcode/README.md`

---

---
title: "Overview"
description: "Knowledge Catalog tooling surface: OKF bundles, kcmd Metadata as Code workspaces, enrichment and discovery agents, and the shortest paths to produce and publish metadata context."
---

GoogleCloudPlatform/knowledge-catalog ships four cooperating surfaces for Knowledge Catalog (Dataplex): `kcmd` Metadata as Code workspaces under `toolbox/mdcode` and `agents/mdcode`, catalog enrichment agents that emit mdcode or OKF bundles, an OKF specification and reference producer under `okf/`, and ADK-based discovery and enrichment samples under `samples/`. All publish paths converge on version-controlled artifacts—YAML/Markdown mdcode trees or plain-markdown OKF bundles—that agents and humans can read without proprietary SDKs.

## Repository layout

| Path | Role | Primary CLIs / entrypoints |
|------|------|---------------------------|
| `toolbox/mdcode/` | Metadata as Code CLI, library, MCP server | `kcmd init`, `pull`, `push`, `reference`, `mcp` |
| `toolbox/enrichment/` | TypeScript enrichment harness | `kcagent enrich` (depends on `kcmd`) |
| `agents/mdcode/` | Same `kcmd` package, vendored for Python agents | `kcmd` (built to `agents/mdcode/dist/kcmd`) |
| `agents/enrichment/` | Python catalog enrichment agent + eval | `agent_runner.py`, `python -m eval` |
| `okf/` | Open Knowledge Format v0.1 + reference producer | `python -m enrichment_agent enrich`, `visualize` |
| `samples/discovery/` | Knowledge Catalog search agent (ADK) | `adk run …` |
| `samples/enrichment/` | Python enrichment sample + publish helpers | `enrichment/enrich.py` |

<Note>
`toolbox/` and `agents/` mirror the same `kcmd` and enrichment capabilities in TypeScript and Python respectively. The Python enrichment agent shells out to `agents/mdcode/dist/kcmd` and never calls the Dataplex API directly.
</Note>

## Metadata formats

Two artifact families cover the same problem—making catalog knowledge agent-ready—with different sync models.

| Format | Location | Structure | Sync target | Best for |
|--------|----------|-----------|-------------|----------|
| **Metadata as Code (mdcode)** | `catalog.yaml` + `catalog/` tree | YAML entries, Markdown sidecars (`.overview.md`), read-only `.ref.yaml` reference layers | Knowledge Catalog via `kcmd pull` / `push` | Live catalog publication, glossary links, entry groups |
| **Open Knowledge Format (OKF)** | `okf/SPEC.md` | Directory of `.md` files with YAML frontmatter, optional `index.md` | Files only—tarball, git, static host; no catalog API required | Vendor-neutral exchange, offline bundles, progressive disclosure |

OKF is intentionally provider-neutral: any producer (human, agent, export pipeline) can write bundles; any consumer (LLM, static viewer, search index) can read them. The `okf/` enrichment agent is one reference producer; `enrichment_agent visualize` emits a self-contained `viz.html` graph viewer.

mdcode is the catalog-native path: `catalog.yaml` declares `scope`, `snapshot`, `publishing`, and optional `reference` blocks that control which entries, aspects, and `entryLinks` are pulled, grounded, and pushed.

## Architecture

```mermaid
flowchart TB
  subgraph sources [Source metadata]
    BQ[BigQuery datasets]
    Drive[Google Drive / local .md]
    GH[GitHub repos via MCP]
    Web[Web crawl seeds]
  end

  subgraph produce [Enrichment producers]
    PyAgent[agents/enrichment/agent_runner.py]
    TSAgent[toolbox/enrichment/kcagent]
    OKFAgent[okf/enrichment_agent]
  end

  subgraph artifacts [Version-controlled artifacts]
    Mdcode[catalog.yaml + catalog/ mdcode tree]
    OKFBundle[OKF bundle directory]
  end

  subgraph publish [Catalog publication]
    Kcmd[kcmd pull / push / reference]
    Viz[okf visualize → viz.html]
  end

  subgraph consume [Consumption]
    Catalog[Knowledge Catalog service]
    Agents[Downstream AI agents]
    MCP[kcmd MCP server]
  end

  BQ --> PyAgent
  Drive --> PyAgent
  GH --> PyAgent
  BQ --> OKFAgent
  Web --> OKFAgent
  BQ --> TSAgent

  PyAgent --> Kcmd
  TSAgent --> Kcmd
  Kcmd --> Mdcode
  OKFAgent --> OKFBundle

  Mdcode --> Kcmd
  Kcmd -->|push| Catalog
  OKFBundle --> Viz
  OKFBundle --> Agents
  Mdcode --> MCP
  MCP --> Agents
  Catalog --> Agents
```

## kcmd: Metadata as Code workspace

`kcmd` manages a local workspace bound to a `catalog.yaml` manifest. Initialization selects exactly one primary source type; BigQuery mode accepts multiple datasets via repeated `--bigquery-dataset`.

<ParamField body="--bigquery-dataset" type="string">
BigQuery scope as `project.dataset`. YAML layout for tables, views, schemas.
</ParamField>

<ParamField body="--kb" type="string">
Knowledge base entry group as `project.location.entryGroupId`. Markdown layout for wiki-style pages.
</ParamField>

<ParamField body="--entry-group" type="string">
Custom entry group scope. YAML layout for user-defined entries.
</ParamField>

<ParamField body="--glossary" type="string">
Business glossary scope as `project.location.glossary-id` (comma-separated or location mode). `push` updates existing glossary metadata but does not create glossary tree nodes.
</ParamField>

Core sync commands:

| Command | Effect |
|---------|--------|
| `kcmd init` | Scaffold `catalog.yaml` for the chosen scope |
| `kcmd pull` | Download editable metadata into `catalog/` |
| `kcmd reference` | Pull read-only `.ref.yaml` grounding layers (never pushed) |
| `kcmd push` | Upload local edits; reconcile `entryLinks` when configured |
| `kcmd mcp` | Expose `list-entries`, `lookup-entry`, `modify-entry` over MCP |

Authentication uses `gcloud` Application Default Credentials for both CLI and MCP.

## Enrichment agents

Three enrichment paths read source metadata, ground on external docs or code, and emit artifacts ready for inspection or publication.

### Python catalog enrichment agent (`agents/enrichment/`)

`agent_runner.py` dispatches three modes. The agent talks to the catalog only through read-only `kcmd init` / `pull` / `reference`; you run `kcmd push` to publish.

| Mode | Required flags | Output |
|------|----------------|--------|
| `table` | `--dataset`, `--project`, `--model`, `--output_dir` | Enriched table overviews on live `@bigquery` entries; optional `queries` aspect and glossary column links |
| `doc` | `--entry_group`, `--project`, `--model`, `--output_dir` | Knowledge-base mdcode snapshot from Drive or local Markdown |
| `context_overlay` | `--dataset`, `--entry_group`, `--project`, `--model`, `--output_dir` | New overlay entries in an owned entry group; live BigQuery entries stay read-only |

Shared optional inputs across modes: `--folders` (Drive or local `.md`), `--feedback_dir` / `--feedback_files` (highest-priority proposals), `--repo` (GitHub MCP code context), `--include_usage` (BigQuery `INFORMATION_SCHEMA` query history), `--interactive` / `--refine_instruction` (post-run refinement).

### TypeScript enrichment harness (`toolbox/enrichment/`)

`kcagent enrich` runs against an initialized mdcode workspace with MCP tools and a prompt file:

```bash
kcmd init --bigquery-dataset <projectId>.<datasetId>
kcmd pull
kcagent enrich --catalog-path . --tools-path tools --prompt-path prompt.md
```

The toolbox demo wires an `md-fileset` MCP server and fileset skills for organizational Markdown grounding.

### OKF enrichment agent (`okf/`)

`python -m enrichment_agent enrich` runs a two-pass pipeline: a BigQuery pass writes one OKF concept per advertised concept, then an optional web pass crawls `--web-seed` / `--web-seed-file` URLs with `--web-max-pages` and `--web-allowed-host` caps. Use `--no-web` for BQ-only runs. Sample recipes live under `okf/samples/` with checked-in bundles under `okf/bundles/`.

## Discovery agent

`samples/discovery/` implements an ADK agent that calls the Knowledge Catalog Search API. It decomposes natural-language questions, issues multiple search queries, and reranks results. Run as a root agent (`discovery_agent` → `root_agent` in `agent.py`) or embed as an `AgentTool` in a parent agent. Required APIs: `dataplex.googleapis.com`, `aiplatform.googleapis.com`, `serviceusage.googleapis.com`.

## Shortest paths to metadata context

<Steps>
<Step title="Sync existing catalog metadata">
Initialize a workspace, pull a snapshot, and inspect entries locally.

```bash
cd my-workspace
kcmd init --bigquery-dataset my-project.my-dataset
kcmd pull
```

Verify: `catalog/` contains YAML entries and `catalog.yaml` lists `scope` and `snapshot` types.
</Step>

<Step title="Produce an OKF bundle from BigQuery">
Generate a portable, git-friendly knowledge bundle without catalog push.

```bash
cd okf
python3 -m venv .venv && .venv/bin/pip install -e .[dev]
.venv/bin/python -m enrichment_agent enrich \
  --source bq \
  --dataset my-project.my-dataset \
  --web-seed-file seeds.txt \
  --out ./bundles/my-bundle
```

Verify: bundle directory contains concept `.md` files with frontmatter; optional `viz.html` via `enrichment_agent visualize --bundle ./bundles/my-bundle`.
</Step>

<Step title="Enrich catalog metadata with the Python agent">
Generate mdcode from BigQuery plus Drive or local Markdown, then publish.

```bash
export PYTHONPATH=agents/enrichment/src
python3 agents/enrichment/src/agent_runner.py \
  --mode=table \
  --dataset=my-project.my-dataset \
  --folders=./local_md_corpus \
  --project=my-project \
  --model=gemini-2.5-pro \
  --output_dir=/tmp/enrich_out

cd /tmp/enrich_out && kcmd push
```

Verify: `find /tmp/enrich_out -type f` shows `catalog/`, `trajectory.json`, and enriched overview sidecars.
</Step>

<Step title="Discover assets with the search agent">
Deploy the discovery agent for semantic catalog search.

```bash
cd samples/discovery
pip install -r requirements.txt
export GOOGLE_CLOUD_PROJECT=my-project
export GOOGLE_GENAI_USE_VERTEXAI=True
adk run samples/discovery
```

Verify: agent responds to natural-language data-asset queries using `knowledge_catalog_search`.
</Step>
</Steps>

## Evaluation and quality gates

Before publishing mdcode, score a run with the golden-free evaluator:

```bash
cd agents/enrichment
python -m eval --output-dir /tmp/enrich_out
```

Deterministic metrics (`structural_validity`, `perf`) always run. Judge-based metrics (`hallucination_free`, `redundancy_index`, `disambiguation_efficacy`, `absence_of_contradictions`) activate when Vertex AI ADC is configured. Optional `--golden` files add concept recall, fact recall, and consistency metrics across repeated runs.

## Prerequisites at a glance

| Requirement | Used by |
|-------------|---------|
| Node.js + npm | Build `kcmd` and `kcagent` (`npm run build` in package dirs) |
| Python 3.11+ (3.13 for OKF) | `agents/enrichment`, `okf/`, `samples/` |
| `gcloud` ADC | `kcmd`, BigQuery, Drive read scopes for enrichment |
| Vertex AI or Gemini credentials | Enrichment and discovery agents (`--project`, `--model`, or `GEMINI_API_KEY`) |
| GitHub PAT (optional) | `--repo` code context via GitHub MCP |

<Warning>
Public BigQuery datasets are readable but query bytes bill against your project. Web enrichment enforces `--web-max-pages` and domain filters inside the fetch tool.
</Warning>

## Choosing a path

| Goal | Start here |
|------|------------|
| Bi-directional sync with Knowledge Catalog | `kcmd init` → `pull` → edit → `push` |
| Portable, vendor-neutral knowledge for agents | OKF `enrich` → git or static host |
| Rich table docs grounded on Drive, usage SQL, glossaries | `agent_runner.py --mode=table` |
| Wiki/knowledge base from docs | `agent_runner.py --mode=doc` |
| Enriched context without touching live BQ entries | `agent_runner.py --mode=context_overlay` |
| Natural-language asset discovery | `samples/discovery` ADK agent |
| Agent-driven metadata edits in CI | `kcmd mcp` with `modify-entry` |

<Info>
Model backends in this repository default to Vertex AI / Gemini, but both OKF bundles and mdcode trees are consumable by any agent that reads files—no provider lock-in at the artifact layer.
</Info>

## Next

<CardGroup>
<Card title="Installation" href="/installation">
Prerequisites, Python and Node.js setup, package installs, and credential configuration for BigQuery, Vertex AI or Gemini, and gcloud ADC.
</Card>
<Card title="Quickstart" href="/quickstart">
First successful runs: initialize a kcmd workspace and pull metadata, produce an OKF bundle from BigQuery, or run the catalog enrichment agent and inspect output.
</Card>
<Card title="Open Knowledge Format" href="/open-knowledge-format">
OKF v0.1 bundle structure, concept documents, frontmatter fields, and cross-link semantics.
</Card>
<Card title="Metadata as Code" href="/metadata-as-code">
kcmd workspace model: catalog.yaml manifest, pull/push sync, reference layers, and glossary scope.
</Card>
<Card title="Enrichment workflows" href="/enrichment-workflows">
How enrichment agents read source metadata, ground on external docs or code, and hand off to kcmd push.
</Card>
</CardGroup>
