# Produce OKF bundles > Run the OKF enrichment agent against a BigQuery source with optional web crawl seeds, concept scoping, and two-pass BQ-then-web enrichment into a versionable bundle directory. - Repository: GoogleCloudPlatform/knowledge-catalog - GitHub: https://github.com/GoogleCloudPlatform/knowledge-catalog - Human docs: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5 - Complete Markdown: https://www.grok-wiki.com/public/docs/googlecloudplatform-knowledge-catalog-9cee6ee3cba5/llms-full.txt ## Source Files - `okf/README.md` - `okf/src/enrichment_agent/cli.py` - `okf/src/enrichment_agent/runner.py` - `okf/src/enrichment_agent/agent.py` - `okf/src/enrichment_agent/sources/bigquery.py` - `okf/src/enrichment_agent/prompts/enrichment_instruction.md` - `okf/src/enrichment_agent/prompts/web_ingestion_instruction.md` --- --- title: "Produce OKF bundles" description: "Run the OKF enrichment agent against a BigQuery source with optional web crawl seeds, concept scoping, and two-pass BQ-then-web enrichment into a versionable bundle directory." --- The `enrichment-agent` CLI (`python -m enrichment_agent enrich`) reads BigQuery dataset metadata through a pluggable `Source` interface, runs a Google ADK agent per concept to emit OKF markdown documents, optionally augments those docs from seeded web pages, and regenerates `index.md` files across the output bundle directory. ## What you produce An OKF bundle is a directory of markdown files with YAML frontmatter. Each BigQuery concept becomes one document; the web pass may add `references/` docs and augment existing primary concepts. The bundle is plain files—suitable for git, static hosting, or downstream agent consumption. :::files bundles// ├── index.md # Auto-generated directory index ├── datasets/ │ ├── index.md │ └── .md ├── tables/ │ ├── index.md │ └── .md # Sharded families use prefix (e.g. events_.md) └── references/ # Optional, from web pass ├── metrics/ ├── joins/ └── .md ::: OKF bundles are vendor-neutral. The enrichment agent is one producer; the format itself is defined in the OKF specification and is not tied to a model provider or serving system. ## Two-pass enrichment Enrichment runs in two sequential passes orchestrated by `EnrichmentRunner.enrich_all()`: | Pass | Agent | Input | Output | |------|-------|-------|--------| | BQ pass | `okf_bq_enrichment_agent` | BigQuery metadata per concept | One OKF doc per advertised concept | | Web pass | `okf_web_ingestion_agent` | Seed URLs and crawl constraints | Augmented primary docs and optional `references/` docs | ```mermaid sequenceDiagram participant CLI as enrichment_agent CLI participant Runner as EnrichmentRunner participant BQ as BigQuerySource participant BQAgent as okf_bq_enrichment_agent participant WebAgent as okf_web_ingestion_agent participant Bundle as bundle_root/ CLI->>Runner: enrich_all(only?) Runner->>BQ: list_concepts() loop Each concept Runner->>BQAgent: enrich_concept(ref) BQAgent->>Bundle: write_concept_doc end opt web_seeds provided Runner->>WebAgent: run_web_pass() WebAgent->>Bundle: augment / mint references end Runner->>Bundle: regenerate_indexes() ``` **BQ pass.** For each `ConceptRef` from the source, the agent calls `read_concept_raw`, optionally `sample_rows`, and writes exactly one document via `write_concept_doc`. Documents include prose, `# Schema`, `# Common query patterns`, and `# Citations`. **Web pass.** When seed URLs are provided, a separate agent crawls outward from seeds using `fetch_url`. For each fetched page it enriches existing concepts, mints `references/` docs, or skips. Hard limits are enforced inside the tool—not by prompt alone. Skip the web pass with `--no-web`, or omit seeds entirely. ## Prerequisites From the `okf/` directory: ```bash python3.13 -m venv .venv .venv/bin/pip install --index-url https://pypi.org/simple/ -e .[dev] ``` The CLI entry point is `enrichment-agent`; the module form `python -m enrichment_agent` is equivalent. ```bash gcloud auth application-default login gcloud config set project ``` Public datasets are readable, but query bytes bill against the caller's project. Override billing with `--billing-project`. Use one of: Set `GEMINI_API_KEY`. ```bash export GOOGLE_GENAI_USE_VERTEXAI=true export GOOGLE_CLOUD_PROJECT= export GOOGLE_CLOUD_LOCATION= ``` Default model is `gemini-flash-latest` (override with `--model`). ## Run enrichment ```bash .venv/bin/python -m enrichment_agent enrich \ --source bq \ --dataset bigquery-public-data.ga4_obfuscated_sample_ecommerce \ --web-seed-file samples/ga4_merch_store/seeds.txt \ --out ./bundles/ga4_merch_store ``` ```text Enriched 12 concept(s) into bundles/ga4_merch_store; web pass used 3 seed(s) ``` ### Required flags Source adapter. Currently only `bq` (BigQuery). BigQuery dataset in `project.dataset` form (for example `bigquery-public-data.ga4_obfuscated_sample_ecommerce`). Bundle root directory. Created if missing. ### Concept scoping Enrich only the given concept id. Repeatable. Format is slash-separated segments matching the source's concept ids, for example `tables/events_` or `datasets/ga4_obfuscated_sample_ecommerce`. Use concept scoping to iterate on a single table without re-running the full dataset: ```bash .venv/bin/python -m enrichment_agent enrich \ --source bq \ --dataset bigquery-public-data.ga4_obfuscated_sample_ecommerce \ --web-seed-file samples/ga4_merch_store/seeds.txt \ --out ./bundles/ga4_merch_store \ --concept tables/events_ ``` Unknown concept ids raise `ValueError` before enrichment starts. ### Web crawl configuration | Flag | Default | Purpose | |------|---------|---------| | `--web-seed` | — | Single seed URL; repeatable | | `--web-seed-file` | — | File with one URL per line (`#` comments allowed); repeatable | | `--no-web` | `false` | Skip web pass entirely | | `--web-max-pages` | `100` | Hard cap on pages fetched per run | | `--web-max-depth` | `2` | Max hop distance from any seed (seeds are depth 0) | | `--web-allowed-host` | seed hosts only | Extra hostnames the crawler may fetch; repeatable | | `--web-allowed-path-prefix` | no restriction | Only fetch URLs whose path starts with one of these prefixes; repeatable | | `--web-denied-path-substring` | — | Reject URLs whose path contains these substrings; repeatable | Seed files support inline comments: ```text # GA4 BigQuery Export — schema reference https://support.google.com/analytics/answer/7029846 ``` Allowed hosts default to the netloc of each seed URL. The `fetch_url` tool rejects URLs outside allowed hosts, over budget, beyond max depth, on denied path substrings, or not reachable from the seed link graph. When `fetch_url` returns `"max_pages reached"` or an `error` field, treat it as final. Do not retry rejected URLs in the same run. ### Other flags Google Cloud project billed for BigQuery queries. Defaults to Application Default Credentials default project. Gemini model id. Default: `gemini-flash-latest`. Enable debug logging for enrichment agent events. ## BigQuery concepts `BigQuerySource` advertises one concept per dataset plus one per table. Sharded tables matching `prefix_######` (6–8 digit suffix) collapse into a single wildcard concept at `tables/` with a representative shard for schema sampling. Concept ids map to filesystem paths: | Concept id | Document path | |------------|---------------| | `datasets/` | `datasets/.md` | | `tables/` | `tables/.md` | | `references/` | `references/.md` | The BQ agent tools are `list_concepts`, `read_concept_raw`, `sample_rows`, `read_existing_doc`, and `write_concept_doc`. ## Web pass behavior The web agent augments BQ-produced docs under strict rules: - **Augmentation, not rewrite.** Existing `#` headings, schema field listings, and citations must be preserved. The tool refuses writes that shrink `# Schema` field sets or reduce `# Citations` entry counts on `BigQuery Table` docs. - **Reference minting.** Pages that define reusable entities, metrics, enums, or conventions may become `references/.md` docs when they pass topic-shape, citation, and reuse gates. - **Structured extractions.** Metrics go to `references/metrics/.md`; join paths to `references/joins/__.md`. These bypass the four-gate reference test. Web agent tools add `fetch_url` to the BQ tool set. ## Bundle output and indexes After both passes, `regenerate_indexes()` writes or updates `index.md` at every directory level in the bundle. Each index groups child concepts by `type` frontmatter field and links to their `description` one-liner. Documents require frontmatter keys `type`, `title`, `description`, and `timestamp` (auto-filled when omitted). Recommended keys are `resource` and `tags`. Verify a successful run by confirming concept markdown files exist under `datasets/` and `tables/`, optional `references/` content appears when seeds were used, and `index.md` files are present at the bundle root and in subdirectories. ## Version and iterate Bundles are directories of plain files. Commit them to git for diff-based review, re-run with `--concept` to refine individual docs, or point `--out` at an existing bundle so `read_existing_doc` lets the agent refine rather than rewrite. Pre-built sample bundles live under `okf/bundles/` (GA4, Stack Overflow, Bitcoin). Matching recipes with exact commands and seed files are under `okf/samples/`. ## Troubleshooting | Symptom | Likely cause | Action | |---------|--------------|--------| | `--dataset is required for --source bq` | Missing dataset flag | Pass `--dataset project.dataset` | | `dataset must be in 'project.dataset' form` | Malformed dataset id | Use two-part identifier | | `Unknown concept(s): ...` | Invalid `--concept` id | Run without `--concept` first to see advertised ids via source listing | | Web pass produces no references | Seeds too broad or budget exhausted | Add focused seed URLs; raise `--web-max-pages` or tighten `--web-allowed-path-prefix` | | `Refusing to write: ... missing ... field(s)` | Web agent replaced schema | Re-run with augmentation-aware prompts; preserve existing `# Schema` | | `max_pages reached` in logs | Crawl budget spent | Increase `--web-max-pages` or reduce seed scope | Omit seeds or pass `--no-web`: ```bash .venv/bin/python -m enrichment_agent enrich \ --source bq \ --dataset . \ --no-web \ --out ./bundles/ ``` ```bash .venv/bin/python -m enrichment_agent enrich \ --source bq \ --dataset . \ --web-seed-file seeds.txt \ --web-allowed-path-prefix /docs/ \ --web-denied-path-substring /login \ --web-max-pages 50 \ --out ./bundles/ ``` ## Next OKF v0.1 bundle structure, frontmatter fields, index.md progressive disclosure, and cross-link semantics. Copy-paste commands for GA4, Stack Overflow, and Bitcoin public datasets with seed files and expected outputs. Generate self-contained `viz.html` graph viewers from produced bundles. Full flag and environment variable reference for `enrich` and `visualize` subcommands.