# Extract and evolve knowledge > Run `he parse` (single file, directory of `.md`/`.txt`, or stdin), choose templates interactively or by ID, control indexing with `--no-index`, append documents with `he feed`, and rebuild indexes with `he build-index`. - Repository: yifanfeng97/Hyper-Extract - GitHub: https://github.com/yifanfeng97/Hyper-Extract - Human docs: https://www.grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf - Complete Markdown: https://www.grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/llms-full.txt ## Source Files - `hyperextract/cli/cli.py` - `hyperextract/cli/utils.py` - `hyperextract/types/base.py` - `hyperextract/utils/template_engine/template.py` - `hyperextract/cli/commands/list.py` - `hyperextract/templates/presets/finance/earnings_summary.yaml` --- --- title: "Extract and evolve knowledge" description: "Run `he parse` (single file, directory of `.md`/`.txt`, or stdin), choose templates interactively or by ID, control indexing with `--no-index`, append documents with `he feed`, and rebuild indexes with `he build-index`." --- Hyper-Extract creates and grows Knowledge Abstracts (KAs) through three CLI commands—`he parse`, `he feed`, and `he build-index`—backed by `Template.create`, `BaseAutoType.feed_text`, `dump`, `load`, and `build_index` in the Python SDK. Each command validates LLM and embedder configuration, resolves a YAML preset or method template, runs structured LLM extraction, and writes `data.json`, `metadata.json`, and an optional `index/` directory. ## Lifecycle overview ```mermaid stateDiagram-v2 [*] --> Empty: he parse -o ./ka/ Empty --> Indexed: build_index (default) Empty --> Unindexed: --no-index Indexed --> StaleIndex: he feed Unindexed --> StaleIndex: he feed StaleIndex --> Indexed: he build-index Indexed --> Indexed: he build-index --force Unindexed --> Indexed: he build-index ``` | Phase | Command | Data change | Index state | |-------|---------|-------------|-------------| | Create | `he parse` | New `data.json` | Built by default; skipped with `--no-index` | | Append | `he feed` | Merges into existing `data.json` | Cleared in memory; rebuild required for search/chat | | Reindex | `he build-index` | No data change | Rebuilt from current `data.json` | `he feed` does not call `build_index`. After feeding, run `he build-index` before `he search` or `he talk`. ## Prerequisites LLM and embedder clients must be configured before any extraction command runs. `validate_config()` checks `~/.he/config.toml` and environment fallbacks (`OPENAI_API_KEY`, `OPENAI_BASE_URL`) on every `parse`, `feed`, and `build-index` invocation. Set up `he config init`, `he config llm`, and `he config embedder` before your first extraction. Run `he list template` to discover preset IDs such as `finance/earnings_summary` or `general/biography_graph`. ## Create a Knowledge Abstract with `he parse` `he parse` reads input, instantiates a template via `Template.create`, extracts structured knowledge with `feed_text`, saves the KA with `dump`, and optionally builds a vector index. ### Input sources | Input | Behavior | |-------|----------| | File path | Single UTF-8 file read via `read_input` | | Directory | All `*.md` and `*.txt` files discovered by glob, concatenated with `\n\n` | | `-` (stdin) | Full stdin buffer (`cat doc.md \| he parse - ...`) | Directory mode errors with exit code 1 when no `.md` or `.txt` files are found. Only those extensions are processed. ### Template selection Templates resolve in three ways: 1. **Preset ID** — `-t finance/earnings_summary` loads a bundled YAML preset. 2. **Method shorthand** — `-m light_rag` maps to `method/light_rag` (English-only prompts; `--lang` is ignored). 3. **Interactive** — Omit `-t` and `-m` to trigger `select_template_interactive()`, which lists all presets from `Gallery.list()` and accepts a number or keyword search. Knowledge templates require `--lang en` or `--lang zh`. Method templates always use `lang = "en"`. ```bash # Preset template with explicit language he parse earnings_call.md -t finance/earnings_summary -o ./finance_kb/ -l en # Interactive selection (omit -t) he parse document.md -o ./output/ -l en # Extraction method (no -l required) he parse document.md -m light_rag -o ./output/ # Directory of markdown files he parse ./corpus/ -t general/concept_graph -o ./corpus_kb/ -l en # Stdin cat notes.md | he parse - -t general/biography_graph -o ./bio_kb/ -l en ``` ### Flags Output directory for the new KA. Created with `mkdir(parents=True, exist_ok=True)`. Preset template ID (e.g., `general/biography_graph`, `finance/earnings_summary`). Omit for interactive selection. Method name (e.g., `light_rag`, `hyper_rag`). Sets template to `method/{name}`. Language code (`en` or `zh`). Required for knowledge templates; ignored for method templates. Overwrite a non-empty output directory. Without `-f`, a populated directory causes exit code 1. Skip `build_index` after extraction. Use for batch workflows; rebuild later with `he build-index`. ### Parse pipeline `validate_config()` ensures LLM and embedder API keys (or vLLM `base_url`) are present. `Template.get(template)` validates the preset; `Template.create(template, lang)` builds the AutoType instance. `feed_text(text)` chunks input (default 2048 chars, 256 overlap), runs structured LLM extraction, and merges results per AutoType strategy. `dump(output_path)` writes `data.json`, `metadata.json` (template, lang, timestamps), and optionally `index/`. Unless `--no-index`, `build_index()` runs, then `dump` saves the FAISS index under `index/`. ### Output layout :::files ./output/ ├── data.json # Structured extraction (entities, relations, model fields, etc.) ├── metadata.json # template, lang, created_at, updated_at └── index/ # FAISS vector store (when index is built) ├── index.faiss └── docstore.json ::: `metadata.json` records the template ID and language so later commands (`he feed`, `he build-index`, `he search`) can reload the correct AutoType without re-specifying flags. ## Append documents with `he feed` `he feed` loads an existing KA, extracts from new input, merges incrementally, and saves updated `data.json` and `metadata.json`. ```bash # Initial extraction he parse tesla_bio.md -t general/biography_graph -o ./tesla_kb/ -l en # Append a second document (template/lang read from metadata.json) he feed ./tesla_kb/ tesla_inventions.md # Append from stdin cat update.md | he feed ./tesla_kb/ - ``` ### Merge behavior `feed_text` calls `_update_data_state`, which merges incoming extraction into the current AutoType and calls `clear_index()`. Merge semantics depend on the template's AutoType: | AutoType | Incremental merge | |----------|-------------------| | `model` | Field-level merge; first extraction wins for populated fields | | `graph` / `hypergraph` | Nodes and edges added via memory-layer deduplication | | `list` / `set` | Items appended or deduplicated per identifier rules | Override template or language only when necessary: ```bash he feed ./ka/ doc.md -t general/biography_graph -l en ``` When `--template` and `--lang` are omitted, defaults come from `metadata.json` (`template` defaults to `general/graph`, `lang` to `zh` if missing). Verify growth with `he info ./ka/` — node/edge counts and `updated_at` should change after a successful feed. ## Rebuild indexes with `he build-index` `he build-index` loads the KA, optionally clears the existing index with `--force`, embeds all indexable items, and persists FAISS files to `index/`. ```bash # Build index for a KA parsed with --no-index he build-index ./output/ # Force rebuild after feeding or manual data.json edits he build-index ./output/ -f ``` | Condition | Behavior | |-----------|----------| | Index exists, no `--force` | Prints warning and exits 0 without rebuilding | | Index missing or `--force` | Clears index (`clear_index`), runs `build_index`, saves via `dump` | | `data.json` missing | Exit 1 via `validate_ka_with_data` | `he search` and `he talk` require a non-empty `index/` directory (`validate_ka_with_index`). ## Batch workflow pattern For multiple documents, defer indexing until all content is merged: ```bash title="CLI batch" # Parse first doc without index he parse doc1.md -t general/biography_graph -o ./ka/ -l en --no-index # Append remaining docs he feed ./ka/ doc2.md he feed ./ka/ doc3.md # Single index build he build-index ./ka/ # Query he search ./ka/ "key concept" he talk ./ka/ -q "Summarize all documents" ``` ```python title="Python batch" from hyperextract import Template ka = Template.create("finance/earnings_summary", "en") ka.feed_text(doc1_text) ka.feed_text(doc2_text) ka.feed_text(doc3_text) ka.dump("./finance_kb/") ka.build_index() ka.dump("./finance_kb/") # persist index ``` ## Python API equivalent The CLI commands map directly to `BaseAutoType` lifecycle methods: | CLI | Python | |-----|--------| | `he parse` (new KA) | `Template.create(...)` → `feed_text(text)` → `dump(path)` → `build_index()` | | `he feed` | `Template.create(...)` → `load(path)` → `feed_text(text)` → `dump(path)` | | `he build-index` | `Template.create(...)` → `load(path)` → `build_index()` → `dump(path)` | | Preview without mutation | `parse(text)` returns a new instance | ```python from hyperextract import Template # Create and extract (equivalent to he parse) ka = Template.create("finance/earnings_summary", "en") ka.feed_text(earnings_transcript) ka.dump("./finance_kb/") ka.build_index() ka.dump("./finance_kb/") # Evolve (equivalent to he feed) ka.load("./finance_kb/") ka.feed_text(q4_update) ka.dump("./finance_kb/") ka.build_index() ka.dump("./finance_kb/") ``` `Template.create` reads LLM and embedder from global config when clients are not passed explicitly. Method templates accept extra kwargs (for example `observation_time` for temporal extractors). ## Error handling | Error | Cause | Resolution | |-------|-------|------------| | `LLM API key is not configured` | Missing config before extraction | Run `he config init` or set `OPENAI_API_KEY` | | `--lang is required for knowledge templates` | `-l` omitted on a preset template | Add `--lang en` or `--lang zh` | | `Output directory already exists and is not empty` | Re-parse to same path | Use `-f` or choose a new `-o` path | | `Template '...' not found` | Invalid `-t` or `-m` value | Run `he list template` or `he list method` | | `No .txt or .md files found` | Empty or unsupported directory | Add `.md`/`.txt` files or pass a single file | | `Index not found` on search/talk | Fed KA without rebuild | Run `he build-index ./ka/` | | `Not a valid Knowledge Abstract directory` | Missing `metadata.json` on feed | Ensure directory was created by `he parse` | Enable debug logging with `HYPER_EXTRACT_LOG_LEVEL=DEBUG` to trace extraction stages (`feed_text_invoked`, `knowledge_extracted`, `index_built`). ## Choosing a template Use `he list template` to browse presets by domain, AutoType, and language. Example preset `finance/earnings_summary` is an `AutoModel` template with fields such as `company_name`, `quarter`, `reported_revenue`, and `overall_tone`—suited for earnings call transcripts in English or Chinese. ```bash he parse transcript.md -t finance/earnings_summary -o ./earnings_kb/ -l en he parse bio.md -t general/biography_graph -o ./bio_kb/ -l en ``` ```bash he parse paper.md -m hyper_rag -o ./paper_kb/ he list method -q light ``` Compare YAML domain presets and algorithm-driven method templates, including language requirements. Deep dive into `data.json`, `metadata.json`, and `index/` layout and lifecycle methods. ## Related pages First successful extraction from install through `he search` and `he show`. Query and explore KAs after indexing with `he search`, `he talk`, and `he show`. Full `he` command surface, flags, defaults, and exit conditions. `Template.create`, `feed_text`, `dump`, `load`, and `build_index` signatures. Common failure modes for parse, feed, and index operations.