fix(embedding): drop degenerate tiny chunks before embedding by lfnovo · Pull Request #768 · lfnovo/open-notebook

lfnovo · 2026-04-19T20:03:01Z

Summary

Fixes #764. URL source ingestion was crashing on complex HTML pages (Wikipedia, Project Gutenberg) with:

Failed to generate embeddings: float() argument must be a string or a real number, not 'NoneType'

Two interacting bugs caused this:

Our chunker emits degenerate chunks. LangChain's HTMLHeaderTextSplitter on complex pages can produce single-character or punctuation-only chunks (e.g. "."). Our existing [c.strip() for c in chunks if c and c.strip()] filter only drops empty/whitespace chunks, so these survived.
Esperanto crashes on null embeddings returned by llama.cpp's OpenAI-compatible endpoint for tiny inputs in batch mode. Defensive issue filed upstream: Embedding providers crash on null values returned by OpenAI-compatible endpoints esperanto#119.

This PR fixes our side: chunk_text() now filters chunks below OPEN_NOTEBOOK_MIN_CHUNK_SIZE tokens (default 5) before returning. The filter is bypassed if it would empty the result list, so legitimately short documents are preserved.

Changes

open_notebook/utils/chunking.py: new MIN_CHUNK_SIZE constant + filter step in chunk_text().
tests/test_chunking.py: 2 new tests covering the filter behavior and the empty-result safeguard.
docs/5-CONFIGURATION/environment-reference.md: documents the new env var.
CHANGELOG.md: Unreleased entry.

Test plan

uv run pytest tests/test_chunking.py tests/test_embedding.py — 52 passed
Manual: ingest https://en.wikipedia.org/wiki/Martin_Luther via URL with a llama.cpp embedding endpoint and confirm processing completes

Header-based splitters (notably HTMLHeaderTextSplitter on complex pages like Wikipedia or Project Gutenberg) can emit single-character or punctuation-only chunks. Some embedding providers — including llama.cpp's OpenAI-compatible endpoint — return null vector elements for such inputs, which then crash response parsing in Esperanto with 'TypeError: float() argument must be a string or a real number, not NoneType'. chunk_text() now filters chunks below OPEN_NOTEBOOK_MIN_CHUNK_SIZE tokens (default 5) after splitting. The filter is bypassed when it would empty the result list, so legitimately short documents are preserved.

cubic-dev-ai

No issues found across 4 files

Confidence score: 5/5

Automated review surfaced no issues in the provided summaries.
No files require special attention.

cubic-dev-ai Bot reviewed Apr 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(embedding): drop degenerate tiny chunks before embedding#768

fix(embedding): drop degenerate tiny chunks before embedding#768
lfnovo wants to merge 1 commit intomainfrom
fix/embedding-degenerate-chunks

lfnovo commented Apr 19, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lfnovo commented Apr 19, 2026

Summary

Changes

Test plan

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant