10. Knowledge Import Pipeline

Status: Accepted Date: 2026-02-16

Context

Crocbot’s memory system provides conversation-based knowledge through sqlite-vec hybrid search. However, agents could not ingest external documents (PDFs, web pages, markdown files) into their project-scoped knowledge base. Users needed a way to import reference material for richer context during conversations. Agent Zero provides a knowledge_import.py reference with LangChain document loaders and an incremental state machine. Crocbot needed a TypeScript-native equivalent using existing infrastructure (sqlite-vec, SSRF guards, embedding providers).

Decision

Build a 6-stage import pipeline (fetch, parse, chunk, embed, dedup, store) with:

Parser Registry (strategy pattern) with priority-ordered dispatch for text, markdown, PDF, and URL/HTML formats
Heading-aware chunking that respects document structure (sections, headings) rather than naive character splitting
Two-layer dedup: content-hash first (O(1)), then vector similarity for near-duplicates
SQLite storage with knowledge_chunks, knowledge_vectors (vec0), and knowledge_meta tables
Incremental re-import via content-hash state machine (new/unchanged/changed classification)
Project scoping via separate storage directories per project
CLI interface: crocbot knowledge import|list|remove

Key library choices:

cheerio + node-html-markdown for HTML parsing (not Readability — more control over content extraction)
pdfjs-dist for PDF (already installed, lazy-loaded to avoid startup overhead)
Existing fetchWithSsrFGuard for SSRF-protected URL fetching

Consequences

External documents become searchable alongside conversation memories
Incremental re-import prevents redundant processing on unchanged sources
Parser registry is extensible for future formats (DOCX, EPUB, etc.)
Separate knowledge storage tables avoid polluting existing memory indexes
PDF parser uses dynamic import, adding no startup cost when unused
Batch import (--batch) enables scripted bulk ingestion

​10. Knowledge Import Pipeline

​Context

​Decision

​Consequences

10. Knowledge Import Pipeline

Context

Decision

Consequences