# Produce OKF bundles

> Run the OKF enrichment agent against a BigQuery source with optional web crawl seeds, concept scoping, and two-pass BQ-then-web enrichment into a versionable bundle directory.

- Repository: GoogleCloudPlatform/knowledge-catalog
- GitHub: https://github.com/GoogleCloudPlatform/knowledge-catalog
- Human docs: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5
- Complete Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/llms-full.txt

## Source Files

- `okf/README.md`
- `okf/src/enrichment_agent/cli.py`
- `okf/src/enrichment_agent/runner.py`
- `okf/src/enrichment_agent/agent.py`
- `okf/src/enrichment_agent/sources/bigquery.py`
- `okf/src/enrichment_agent/prompts/enrichment_instruction.md`
- `okf/src/enrichment_agent/prompts/web_ingestion_instruction.md`

---

---
title: "Produce OKF bundles"
description: "Run the OKF enrichment agent against a BigQuery source with optional web crawl seeds, concept scoping, and two-pass BQ-then-web enrichment into a versionable bundle directory."
---

The `enrichment-agent` CLI (`python -m enrichment_agent enrich`) reads BigQuery dataset metadata through a pluggable `Source` interface, runs a Google ADK agent per concept to emit OKF markdown documents, optionally augments those docs from seeded web pages, and regenerates `index.md` files across the output bundle directory.

## What you produce

An OKF bundle is a directory of markdown files with YAML frontmatter. Each BigQuery concept becomes one document; the web pass may add `references/` docs and augment existing primary concepts. The bundle is plain files—suitable for git, static hosting, or downstream agent consumption.

:::files
bundles/<name>/
├── index.md                    # Auto-generated directory index
├── datasets/
│   ├── index.md
│   └── <dataset_id>.md
├── tables/
│   ├── index.md
│   └── <table_id>.md           # Sharded families use prefix (e.g. events_.md)
└── references/                 # Optional, from web pass
    ├── metrics/
    ├── joins/
    └── <slug>.md
:::

<Info>
OKF bundles are vendor-neutral. The enrichment agent is one producer; the format itself is defined in the OKF specification and is not tied to a model provider or serving system.
</Info>

## Two-pass enrichment

Enrichment runs in two sequential passes orchestrated by `EnrichmentRunner.enrich_all()`:

| Pass | Agent | Input | Output |
|------|-------|-------|--------|
| BQ pass | `okf_bq_enrichment_agent` | BigQuery metadata per concept | One OKF doc per advertised concept |
| Web pass | `okf_web_ingestion_agent` | Seed URLs and crawl constraints | Augmented primary docs and optional `references/` docs |

```mermaid
sequenceDiagram
    participant CLI as enrichment_agent CLI
    participant Runner as EnrichmentRunner
    participant BQ as BigQuerySource
    participant BQAgent as okf_bq_enrichment_agent
    participant WebAgent as okf_web_ingestion_agent
    participant Bundle as bundle_root/

    CLI->>Runner: enrich_all(only?)
    Runner->>BQ: list_concepts()
    loop Each concept
        Runner->>BQAgent: enrich_concept(ref)
        BQAgent->>Bundle: write_concept_doc
    end
    opt web_seeds provided
        Runner->>WebAgent: run_web_pass()
        WebAgent->>Bundle: augment / mint references
    end
    Runner->>Bundle: regenerate_indexes()
```

**BQ pass.** For each `ConceptRef` from the source, the agent calls `read_concept_raw`, optionally `sample_rows`, and writes exactly one document via `write_concept_doc`. Documents include prose, `# Schema`, `# Common query patterns`, and `# Citations`.

**Web pass.** When seed URLs are provided, a separate agent crawls outward from seeds using `fetch_url`. For each fetched page it enriches existing concepts, mints `references/<slug>` docs, or skips. Hard limits are enforced inside the tool—not by prompt alone.

Skip the web pass with `--no-web`, or omit seeds entirely.

## Prerequisites

<Steps>
<Step title="Install the OKF package">

From the `okf/` directory:

```bash
python3.13 -m venv .venv
.venv/bin/pip install --index-url https://pypi.org/simple/ -e .[dev]
```

The CLI entry point is `enrichment-agent`; the module form `python -m enrichment_agent` is equivalent.

</Step>

<Step title="Configure BigQuery credentials">

```bash
gcloud auth application-default login
gcloud config set project <your-billing-project>
```

Public datasets are readable, but query bytes bill against the caller's project. Override billing with `--billing-project`.

</Step>

<Step title="Configure model credentials">

Use one of:

<Tabs>
<Tab title="AI Studio">

Set `GEMINI_API_KEY`.

</Tab>
<Tab title="Vertex AI">

```bash
export GOOGLE_GENAI_USE_VERTEXAI=true
export GOOGLE_CLOUD_PROJECT=<id>
export GOOGLE_CLOUD_LOCATION=<region>
```

</Tab>
</Tabs>

Default model is `gemini-flash-latest` (override with `--model`).

</Step>
</Steps>

## Run enrichment

<RequestExample>

```bash
.venv/bin/python -m enrichment_agent enrich \
    --source bq \
    --dataset bigquery-public-data.ga4_obfuscated_sample_ecommerce \
    --web-seed-file samples/ga4_merch_store/seeds.txt \
    --out ./bundles/ga4_merch_store
```

</RequestExample>

<ResponseExample>

```text
Enriched 12 concept(s) into bundles/ga4_merch_store; web pass used 3 seed(s)
```

</ResponseExample>

### Required flags

<ParamField body="--source" type="string" required>
Source adapter. Currently only `bq` (BigQuery).
</ParamField>

<ParamField body="--dataset" type="string" required>
BigQuery dataset in `project.dataset` form (for example `bigquery-public-data.ga4_obfuscated_sample_ecommerce`).
</ParamField>

<ParamField body="--out" type="path" required>
Bundle root directory. Created if missing.
</ParamField>

### Concept scoping

<ParamField body="--concept" type="string">
Enrich only the given concept id. Repeatable. Format is slash-separated segments matching the source's concept ids, for example `tables/events_` or `datasets/ga4_obfuscated_sample_ecommerce`.
</ParamField>

Use concept scoping to iterate on a single table without re-running the full dataset:

```bash
.venv/bin/python -m enrichment_agent enrich \
    --source bq \
    --dataset bigquery-public-data.ga4_obfuscated_sample_ecommerce \
    --web-seed-file samples/ga4_merch_store/seeds.txt \
    --out ./bundles/ga4_merch_store \
    --concept tables/events_
```

Unknown concept ids raise `ValueError` before enrichment starts.

### Web crawl configuration

| Flag | Default | Purpose |
|------|---------|---------|
| `--web-seed` | — | Single seed URL; repeatable |
| `--web-seed-file` | — | File with one URL per line (`#` comments allowed); repeatable |
| `--no-web` | `false` | Skip web pass entirely |
| `--web-max-pages` | `100` | Hard cap on pages fetched per run |
| `--web-max-depth` | `2` | Max hop distance from any seed (seeds are depth 0) |
| `--web-allowed-host` | seed hosts only | Extra hostnames the crawler may fetch; repeatable |
| `--web-allowed-path-prefix` | no restriction | Only fetch URLs whose path starts with one of these prefixes; repeatable |
| `--web-denied-path-substring` | — | Reject URLs whose path contains these substrings; repeatable |

Seed files support inline comments:

```text
# GA4 BigQuery Export — schema reference
https://support.google.com/analytics/answer/7029846
```

Allowed hosts default to the netloc of each seed URL. The `fetch_url` tool rejects URLs outside allowed hosts, over budget, beyond max depth, on denied path substrings, or not reachable from the seed link graph.

<Warning>
When `fetch_url` returns `"max_pages reached"` or an `error` field, treat it as final. Do not retry rejected URLs in the same run.
</Warning>

### Other flags

<ParamField body="--billing-project" type="string">
Google Cloud project billed for BigQuery queries. Defaults to Application Default Credentials default project.
</ParamField>

<ParamField body="--model" type="string">
Gemini model id. Default: `gemini-flash-latest`.
</ParamField>

<ParamField body="-v, --verbose" type="boolean">
Enable debug logging for enrichment agent events.
</ParamField>

## BigQuery concepts

`BigQuerySource` advertises one concept per dataset plus one per table. Sharded tables matching `prefix_######` (6–8 digit suffix) collapse into a single wildcard concept at `tables/<prefix>` with a representative shard for schema sampling.

Concept ids map to filesystem paths:

| Concept id | Document path |
|------------|---------------|
| `datasets/<dataset_id>` | `datasets/<dataset_id>.md` |
| `tables/<table_id>` | `tables/<table_id>.md` |
| `references/<slug>` | `references/<slug>.md` |

The BQ agent tools are `list_concepts`, `read_concept_raw`, `sample_rows`, `read_existing_doc`, and `write_concept_doc`.

## Web pass behavior

The web agent augments BQ-produced docs under strict rules:

- **Augmentation, not rewrite.** Existing `#` headings, schema field listings, and citations must be preserved. The tool refuses writes that shrink `# Schema` field sets or reduce `# Citations` entry counts on `BigQuery Table` docs.
- **Reference minting.** Pages that define reusable entities, metrics, enums, or conventions may become `references/<slug>.md` docs when they pass topic-shape, citation, and reuse gates.
- **Structured extractions.** Metrics go to `references/metrics/<slug>.md`; join paths to `references/joins/<a>__<b>.md`. These bypass the four-gate reference test.

Web agent tools add `fetch_url` to the BQ tool set.

## Bundle output and indexes

After both passes, `regenerate_indexes()` writes or updates `index.md` at every directory level in the bundle. Each index groups child concepts by `type` frontmatter field and links to their `description` one-liner.

Documents require frontmatter keys `type`, `title`, `description`, and `timestamp` (auto-filled when omitted). Recommended keys are `resource` and `tags`.

<Check>
Verify a successful run by confirming concept markdown files exist under `datasets/` and `tables/`, optional `references/` content appears when seeds were used, and `index.md` files are present at the bundle root and in subdirectories.
</Check>

## Version and iterate

Bundles are directories of plain files. Commit them to git for diff-based review, re-run with `--concept` to refine individual docs, or point `--out` at an existing bundle so `read_existing_doc` lets the agent refine rather than rewrite.

Pre-built sample bundles live under `okf/bundles/` (GA4, Stack Overflow, Bitcoin). Matching recipes with exact commands and seed files are under `okf/samples/`.

## Troubleshooting

| Symptom | Likely cause | Action |
|---------|--------------|--------|
| `--dataset is required for --source bq` | Missing dataset flag | Pass `--dataset project.dataset` |
| `dataset must be in 'project.dataset' form` | Malformed dataset id | Use two-part identifier |
| `Unknown concept(s): ...` | Invalid `--concept` id | Run without `--concept` first to see advertised ids via source listing |
| Web pass produces no references | Seeds too broad or budget exhausted | Add focused seed URLs; raise `--web-max-pages` or tighten `--web-allowed-path-prefix` |
| `Refusing to write: ... missing ... field(s)` | Web agent replaced schema | Re-run with augmentation-aware prompts; preserve existing `# Schema` |
| `max_pages reached` in logs | Crawl budget spent | Increase `--web-max-pages` or reduce seed scope |

<AccordionGroup>
<Accordion title="BQ-only enrichment">

Omit seeds or pass `--no-web`:

```bash
.venv/bin/python -m enrichment_agent enrich \
    --source bq \
    --dataset <project>.<dataset> \
    --no-web \
    --out ./bundles/<name>
```

</Accordion>

<Accordion title="Restrict crawl to documentation paths">

```bash
.venv/bin/python -m enrichment_agent enrich \
    --source bq \
    --dataset <project>.<dataset> \
    --web-seed-file seeds.txt \
    --web-allowed-path-prefix /docs/ \
    --web-denied-path-substring /login \
    --web-max-pages 50 \
    --out ./bundles/<name>
```

</Accordion>
</AccordionGroup>

## Next

<CardGroup>
<Card title="Open Knowledge Format" href="/open-knowledge-format">
OKF v0.1 bundle structure, frontmatter fields, index.md progressive disclosure, and cross-link semantics.
</Card>
<Card title="OKF bundle recipes" href="/okf-bundle-recipes">
Copy-paste commands for GA4, Stack Overflow, and Bitcoin public datasets with seed files and expected outputs.
</Card>
<Card title="Visualize OKF bundles" href="/visualize-okf-bundles">
Generate self-contained `viz.html` graph viewers from produced bundles.
</Card>
<Card title="OKF enrichment CLI reference" href="/okf-enrichment-cli-reference">
Full flag and environment variable reference for `enrich` and `visualize` subcommands.
</Card>
</CardGroup>
