# Extract and evolve knowledge

> Run `he parse` (single file, directory of `.md`/`.txt`, or stdin), choose templates interactively or by ID, control indexing with `--no-index`, append documents with `he feed`, and rebuild indexes with `he build-index`.

- Repository: yifanfeng97/Hyper-Extract
- GitHub: https://github.com/yifanfeng97/Hyper-Extract
- Human docs: https://www.grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf
- Complete Markdown: https://www.grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/llms-full.txt

## Source Files

- `hyperextract/cli/cli.py`
- `hyperextract/cli/utils.py`
- `hyperextract/types/base.py`
- `hyperextract/utils/template_engine/template.py`
- `hyperextract/cli/commands/list.py`
- `hyperextract/templates/presets/finance/earnings_summary.yaml`

---

---
title: "Extract and evolve knowledge"
description: "Run `he parse` (single file, directory of `.md`/`.txt`, or stdin), choose templates interactively or by ID, control indexing with `--no-index`, append documents with `he feed`, and rebuild indexes with `he build-index`."
---

Hyper-Extract creates and grows Knowledge Abstracts (KAs) through three CLI commands—`he parse`, `he feed`, and `he build-index`—backed by `Template.create`, `BaseAutoType.feed_text`, `dump`, `load`, and `build_index` in the Python SDK. Each command validates LLM and embedder configuration, resolves a YAML preset or method template, runs structured LLM extraction, and writes `data.json`, `metadata.json`, and an optional `index/` directory.

## Lifecycle overview

```mermaid
stateDiagram-v2
    [*] --> Empty: he parse -o ./ka/
    Empty --> Indexed: build_index (default)
    Empty --> Unindexed: --no-index
    Indexed --> StaleIndex: he feed
    Unindexed --> StaleIndex: he feed
    StaleIndex --> Indexed: he build-index
    Indexed --> Indexed: he build-index --force
    Unindexed --> Indexed: he build-index
```

| Phase | Command | Data change | Index state |
|-------|---------|-------------|-------------|
| Create | `he parse` | New `data.json` | Built by default; skipped with `--no-index` |
| Append | `he feed` | Merges into existing `data.json` | Cleared in memory; rebuild required for search/chat |
| Reindex | `he build-index` | No data change | Rebuilt from current `data.json` |

<Note>
`he feed` does not call `build_index`. After feeding, run `he build-index` before `he search` or `he talk`.
</Note>

## Prerequisites

LLM and embedder clients must be configured before any extraction command runs. `validate_config()` checks `~/.he/config.toml` and environment fallbacks (`OPENAI_API_KEY`, `OPENAI_BASE_URL`) on every `parse`, `feed`, and `build-index` invocation.

<CardGroup>
  <Card title="Configure providers" href="/configure-providers">
    Set up `he config init`, `he config llm`, and `he config embedder` before your first extraction.
  </Card>
  <Card title="List templates" href="/cli-reference">
    Run `he list template` to discover preset IDs such as `finance/earnings_summary` or `general/biography_graph`.
  </Card>
</CardGroup>

## Create a Knowledge Abstract with `he parse`

`he parse` reads input, instantiates a template via `Template.create`, extracts structured knowledge with `feed_text`, saves the KA with `dump`, and optionally builds a vector index.

### Input sources

| Input | Behavior |
|-------|----------|
| File path | Single UTF-8 file read via `read_input` |
| Directory | All `*.md` and `*.txt` files discovered by glob, concatenated with `\n\n` |
| `-` (stdin) | Full stdin buffer (`cat doc.md \| he parse - ...`) |

<Warning>
Directory mode errors with exit code 1 when no `.md` or `.txt` files are found. Only those extensions are processed.
</Warning>

### Template selection

Templates resolve in three ways:

1. **Preset ID** — `-t finance/earnings_summary` loads a bundled YAML preset.
2. **Method shorthand** — `-m light_rag` maps to `method/light_rag` (English-only prompts; `--lang` is ignored).
3. **Interactive** — Omit `-t` and `-m` to trigger `select_template_interactive()`, which lists all presets from `Gallery.list()` and accepts a number or keyword search.

Knowledge templates require `--lang en` or `--lang zh`. Method templates always use `lang = "en"`.

```bash
# Preset template with explicit language
he parse earnings_call.md -t finance/earnings_summary -o ./finance_kb/ -l en

# Interactive selection (omit -t)
he parse document.md -o ./output/ -l en

# Extraction method (no -l required)
he parse document.md -m light_rag -o ./output/

# Directory of markdown files
he parse ./corpus/ -t general/concept_graph -o ./corpus_kb/ -l en

# Stdin
cat notes.md | he parse - -t general/biography_graph -o ./bio_kb/ -l en
```

### Flags

<ParamField body="--output / -o" type="string" required>
Output directory for the new KA. Created with `mkdir(parents=True, exist_ok=True)`.
</ParamField>

<ParamField body="--template / -t" type="string">
Preset template ID (e.g., `general/biography_graph`, `finance/earnings_summary`). Omit for interactive selection.
</ParamField>

<ParamField body="--method / -m" type="string">
Method name (e.g., `light_rag`, `hyper_rag`). Sets template to `method/{name}`.
</ParamField>

<ParamField body="--lang / -l" type="string">
Language code (`en` or `zh`). Required for knowledge templates; ignored for method templates.
</ParamField>

<ParamField body="--force / -f" type="boolean" default="false">
Overwrite a non-empty output directory. Without `-f`, a populated directory causes exit code 1.
</ParamField>

<ParamField body="--no-index" type="boolean" default="false">
Skip `build_index` after extraction. Use for batch workflows; rebuild later with `he build-index`.
</ParamField>

### Parse pipeline

<Steps>
  <Step title="Validate configuration">
    `validate_config()` ensures LLM and embedder API keys (or vLLM `base_url`) are present.
  </Step>
  <Step title="Resolve template">
    `Template.get(template)` validates the preset; `Template.create(template, lang)` builds the AutoType instance.
  </Step>
  <Step title="Extract knowledge">
    `feed_text(text)` chunks input (default 2048 chars, 256 overlap), runs structured LLM extraction, and merges results per AutoType strategy.
  </Step>
  <Step title="Persist KA">
    `dump(output_path)` writes `data.json`, `metadata.json` (template, lang, timestamps), and optionally `index/`.
  </Step>
  <Step title="Build index (default)">
    Unless `--no-index`, `build_index()` runs, then `dump` saves the FAISS index under `index/`.
  </Step>
</Steps>

### Output layout

:::files
./output/
├── data.json           # Structured extraction (entities, relations, model fields, etc.)
├── metadata.json       # template, lang, created_at, updated_at
└── index/              # FAISS vector store (when index is built)
    ├── index.faiss
    └── docstore.json
:::

`metadata.json` records the template ID and language so later commands (`he feed`, `he build-index`, `he search`) can reload the correct AutoType without re-specifying flags.

## Append documents with `he feed`

`he feed` loads an existing KA, extracts from new input, merges incrementally, and saves updated `data.json` and `metadata.json`.

```bash
# Initial extraction
he parse tesla_bio.md -t general/biography_graph -o ./tesla_kb/ -l en

# Append a second document (template/lang read from metadata.json)
he feed ./tesla_kb/ tesla_inventions.md

# Append from stdin
cat update.md | he feed ./tesla_kb/ -
```

### Merge behavior

`feed_text` calls `_update_data_state`, which merges incoming extraction into the current AutoType and calls `clear_index()`. Merge semantics depend on the template's AutoType:

| AutoType | Incremental merge |
|----------|-------------------|
| `model` | Field-level merge; first extraction wins for populated fields |
| `graph` / `hypergraph` | Nodes and edges added via memory-layer deduplication |
| `list` / `set` | Items appended or deduplicated per identifier rules |

Override template or language only when necessary:

```bash
he feed ./ka/ doc.md -t general/biography_graph -l en
```

When `--template` and `--lang` are omitted, defaults come from `metadata.json` (`template` defaults to `general/graph`, `lang` to `zh` if missing).

<Info>
Verify growth with `he info ./ka/` — node/edge counts and `updated_at` should change after a successful feed.
</Info>

## Rebuild indexes with `he build-index`

`he build-index` loads the KA, optionally clears the existing index with `--force`, embeds all indexable items, and persists FAISS files to `index/`.

```bash
# Build index for a KA parsed with --no-index
he build-index ./output/

# Force rebuild after feeding or manual data.json edits
he build-index ./output/ -f
```

| Condition | Behavior |
|-----------|----------|
| Index exists, no `--force` | Prints warning and exits 0 without rebuilding |
| Index missing or `--force` | Clears index (`clear_index`), runs `build_index`, saves via `dump` |
| `data.json` missing | Exit 1 via `validate_ka_with_data` |

`he search` and `he talk` require a non-empty `index/` directory (`validate_ka_with_index`).

## Batch workflow pattern

For multiple documents, defer indexing until all content is merged:

<CodeGroup>
```bash title="CLI batch"
# Parse first doc without index
he parse doc1.md -t general/biography_graph -o ./ka/ -l en --no-index

# Append remaining docs
he feed ./ka/ doc2.md
he feed ./ka/ doc3.md

# Single index build
he build-index ./ka/

# Query
he search ./ka/ "key concept"
he talk ./ka/ -q "Summarize all documents"
```

```python title="Python batch"
from hyperextract import Template

ka = Template.create("finance/earnings_summary", "en")
ka.feed_text(doc1_text)
ka.feed_text(doc2_text)
ka.feed_text(doc3_text)
ka.dump("./finance_kb/")
ka.build_index()
ka.dump("./finance_kb/")  # persist index
```
</CodeGroup>

## Python API equivalent

The CLI commands map directly to `BaseAutoType` lifecycle methods:

| CLI | Python |
|-----|--------|
| `he parse` (new KA) | `Template.create(...)` → `feed_text(text)` → `dump(path)` → `build_index()` |
| `he feed` | `Template.create(...)` → `load(path)` → `feed_text(text)` → `dump(path)` |
| `he build-index` | `Template.create(...)` → `load(path)` → `build_index()` → `dump(path)` |
| Preview without mutation | `parse(text)` returns a new instance |

<RequestExample>
```python
from hyperextract import Template

# Create and extract (equivalent to he parse)
ka = Template.create("finance/earnings_summary", "en")
ka.feed_text(earnings_transcript)
ka.dump("./finance_kb/")
ka.build_index()
ka.dump("./finance_kb/")

# Evolve (equivalent to he feed)
ka.load("./finance_kb/")
ka.feed_text(q4_update)
ka.dump("./finance_kb/")
ka.build_index()
ka.dump("./finance_kb/")
```
</RequestExample>

`Template.create` reads LLM and embedder from global config when clients are not passed explicitly. Method templates accept extra kwargs (for example `observation_time` for temporal extractors).

## Error handling

| Error | Cause | Resolution |
|-------|-------|------------|
| `LLM API key is not configured` | Missing config before extraction | Run `he config init` or set `OPENAI_API_KEY` |
| `--lang is required for knowledge templates` | `-l` omitted on a preset template | Add `--lang en` or `--lang zh` |
| `Output directory already exists and is not empty` | Re-parse to same path | Use `-f` or choose a new `-o` path |
| `Template '...' not found` | Invalid `-t` or `-m` value | Run `he list template` or `he list method` |
| `No .txt or .md files found` | Empty or unsupported directory | Add `.md`/`.txt` files or pass a single file |
| `Index not found` on search/talk | Fed KA without rebuild | Run `he build-index ./ka/` |
| `Not a valid Knowledge Abstract directory` | Missing `metadata.json` on feed | Ensure directory was created by `he parse` |

Enable debug logging with `HYPER_EXTRACT_LOG_LEVEL=DEBUG` to trace extraction stages (`feed_text_invoked`, `knowledge_extracted`, `index_built`).

## Choosing a template

Use `he list template` to browse presets by domain, AutoType, and language. Example preset `finance/earnings_summary` is an `AutoModel` template with fields such as `company_name`, `quarter`, `reported_revenue`, and `overall_tone`—suited for earnings call transcripts in English or Chinese.

<Tabs>
  <Tab title="Domain presets">
    ```bash
    he parse transcript.md -t finance/earnings_summary -o ./earnings_kb/ -l en
    he parse bio.md -t general/biography_graph -o ./bio_kb/ -l en
    ```
  </Tab>
  <Tab title="Extraction methods">
    ```bash
    he parse paper.md -m hyper_rag -o ./paper_kb/
    he list method -q light
    ```
  </Tab>
</Tabs>

<CardGroup>
  <Card title="Templates vs methods" href="/templates-vs-methods">
    Compare YAML domain presets and algorithm-driven method templates, including language requirements.
  </Card>
  <Card title="Knowledge Abstracts" href="/knowledge-abstracts">
    Deep dive into `data.json`, `metadata.json`, and `index/` layout and lifecycle methods.
  </Card>
</CardGroup>

## Related pages

<CardGroup>
  <Card title="Quickstart" href="/quickstart">
    First successful extraction from install through `he search` and `he show`.
  </Card>
  <Card title="Search, chat, and visualize" href="/search-chat-visualize">
    Query and explore KAs after indexing with `he search`, `he talk`, and `he show`.
  </Card>
  <Card title="CLI reference" href="/cli-reference">
    Full `he` command surface, flags, defaults, and exit conditions.
  </Card>
  <Card title="Python API reference" href="/python-api-reference">
    `Template.create`, `feed_text`, `dump`, `load`, and `build_index` signatures.
  </Card>
  <Card title="Troubleshooting" href="/troubleshooting">
    Common failure modes for parse, feed, and index operations.
  </Card>
</CardGroup>
