Skip to main content

Knowledge Import

Import external documents and URLs into the project-scoped vector knowledge base. The pipeline fetches content, parses it into structured text, chunks it with semantic boundary detection, embeds chunks via the configured embedding provider, deduplicates against existing content, and stores into sqlite-vec with category metadata.

Quick Start

# Import a URL
crocbot knowledge import https://example.com/docs/getting-started

# Import a local file
crocbot knowledge import ./docs/architecture.md

# Import into a specific project
crocbot knowledge import ./api-reference.md --project my-project

# List imported sources
crocbot knowledge list

# Remove an imported source
crocbot knowledge remove https://example.com/docs/getting-started

Supported Formats

FormatExtensions / PatternsParser
URL/HTMLhttps://...Fetches with SSRF guard, extracts main content via cheerio, converts to markdown
Markdown.md, .mdxExtracts YAML frontmatter, preserves heading structure
PDF.pdfExtracts text per page using pdfjs-dist
Plain TextAny other fileUniversal fallback, treats content as-is
The parser is selected automatically based on the source type and file extension. URL sources always use the HTML parser. File sources are matched by extension in priority order: PDF, Markdown, then Text fallback.

Commands

crocbot knowledge import

Import a document into the knowledge base.
crocbot knowledge import <source> [options]
Arguments:
  • source - URL or local file path to import
Options:
FlagDescriptionDefault
--project <name>Target project scopeDefault project
--category <cat>Knowledge category: docs, references, solutionsdocs
--dry-runPreview without importingfalse
--forceForce re-import even if unchangedfalse
--batch <file>Import multiple sources from a file (one URL/path per line)-
Batch file format:
# Lines starting with # are comments
https://example.com/docs/page-1
https://example.com/docs/page-2
./local-docs/architecture.md

crocbot knowledge list

List all imported knowledge sources with their status.
crocbot knowledge list [options]
Options:
FlagDescription
--project <name>Target project scope
--jsonOutput as JSON

crocbot knowledge remove

Remove an imported source and all its chunks from the knowledge base.
crocbot knowledge remove <source> [options]
Options:
FlagDescription
--project <name>Target project scope

How It Works

Pipeline Stages

  1. Parse - The document is fetched (URLs) or read (files) and parsed into normalized markdown content
  2. Chunk - Content is split into overlapping chunks respecting heading boundaries (~400 tokens per chunk, ~80 token overlap)
  3. Embed - Each chunk is embedded into a vector using the configured embedding provider
  4. Dedup - Hash-exact and similarity-based deduplication removes redundant chunks
  5. Store - Unique chunks and their embeddings are stored in the project-scoped sqlite-vec database

Incremental Updates

The pipeline tracks imported sources via a state file (knowledge-state.json). On re-import:
  • Unchanged content is skipped (no work done)
  • Changed content triggers removal of old chunks followed by fresh import
  • New sources are imported directly
  • Removed sources (via knowledge remove) have all chunks deleted
Use --force to re-import even when content has not changed.

Deduplication

Two-phase deduplication prevents redundant content:
  1. Hash dedup (fast) - Chunks with identical text hashes are skipped
  2. Similarity dedup (fuzzy) - Chunks with cosine similarity above 0.95 against existing or in-batch embeddings are skipped
This handles both exact duplicates and near-duplicate content across overlapping documents.

Project Isolation

Knowledge is stored per-project using the project workspace isolation from Phase 14:
  • Each project has its own knowledge.db and knowledge-state.json
  • Chunks imported into one project are not visible from another
  • Use --project to target a specific project scope

Security

URL imports are protected by the SSRF guard:
  • Private and internal IP addresses (127.0.0.0/8, 10.0.0.0/8, 169.254.0.0/16, etc.) are blocked
  • DNS resolution is pinned to prevent TOCTOU attacks
  • Redirect chains are validated at each hop
  • Response body size is limited to 10 MB
  • Fetch timeout is 30 seconds

Categories

Imported knowledge can be categorized for better organization:
CategoryUse Case
docsDocumentation, guides, tutorials (default)
referencesAPI references, specifications
solutionsKnown solutions, troubleshooting guides

Configuration

The knowledge import pipeline uses the project’s configured embedding provider. No additional configuration is required beyond the standard embedding setup.

Chunking Defaults

ParameterDefaultDescription
Chunk size400 tokensMaximum tokens per chunk (~1600 chars)
Overlap80 tokensOverlap between adjacent chunks (~320 chars)
Heading-awaretrueSplit on heading boundaries when possible

Dedup Defaults

ParameterDefaultDescription
Hash deduptrueSkip chunks with identical text hash
Similarity deduptrueSkip chunks above similarity threshold
Similarity threshold0.95Cosine similarity threshold for near-duplicates