Skip to content

fix(embedding): drop degenerate tiny chunks before embedding#768

Open
lfnovo wants to merge 1 commit intomainfrom
fix/embedding-degenerate-chunks
Open

fix(embedding): drop degenerate tiny chunks before embedding#768
lfnovo wants to merge 1 commit intomainfrom
fix/embedding-degenerate-chunks

Conversation

@lfnovo
Copy link
Copy Markdown
Owner

@lfnovo lfnovo commented Apr 19, 2026

Summary

Fixes #764. URL source ingestion was crashing on complex HTML pages (Wikipedia, Project Gutenberg) with:

Failed to generate embeddings: float() argument must be a string or a real number, not 'NoneType'

Two interacting bugs caused this:

  1. Our chunker emits degenerate chunks. LangChain's HTMLHeaderTextSplitter on complex pages can produce single-character or punctuation-only chunks (e.g. "."). Our existing [c.strip() for c in chunks if c and c.strip()] filter only drops empty/whitespace chunks, so these survived.
  2. Esperanto crashes on null embeddings returned by llama.cpp's OpenAI-compatible endpoint for tiny inputs in batch mode. Defensive issue filed upstream: Embedding providers crash on null values returned by OpenAI-compatible endpoints esperanto#119.

This PR fixes our side: chunk_text() now filters chunks below OPEN_NOTEBOOK_MIN_CHUNK_SIZE tokens (default 5) before returning. The filter is bypassed if it would empty the result list, so legitimately short documents are preserved.

Changes

  • open_notebook/utils/chunking.py: new MIN_CHUNK_SIZE constant + filter step in chunk_text().
  • tests/test_chunking.py: 2 new tests covering the filter behavior and the empty-result safeguard.
  • docs/5-CONFIGURATION/environment-reference.md: documents the new env var.
  • CHANGELOG.md: Unreleased entry.

Test plan

Header-based splitters (notably HTMLHeaderTextSplitter on complex pages
like Wikipedia or Project Gutenberg) can emit single-character or
punctuation-only chunks. Some embedding providers — including
llama.cpp's OpenAI-compatible endpoint — return null vector elements
for such inputs, which then crash response parsing in Esperanto with
'TypeError: float() argument must be a string or a real number, not
NoneType'.

chunk_text() now filters chunks below OPEN_NOTEBOOK_MIN_CHUNK_SIZE
tokens (default 5) after splitting. The filter is bypassed when it
would empty the result list, so legitimately short documents are
preserved.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: URL source ingestion single-character chunks causing NoneType embedding failure

1 participant