fix(embedding): drop degenerate tiny chunks before embedding#768
Open
fix(embedding): drop degenerate tiny chunks before embedding#768
Conversation
Header-based splitters (notably HTMLHeaderTextSplitter on complex pages like Wikipedia or Project Gutenberg) can emit single-character or punctuation-only chunks. Some embedding providers — including llama.cpp's OpenAI-compatible endpoint — return null vector elements for such inputs, which then crash response parsing in Esperanto with 'TypeError: float() argument must be a string or a real number, not NoneType'. chunk_text() now filters chunks below OPEN_NOTEBOOK_MIN_CHUNK_SIZE tokens (default 5) after splitting. The filter is bypassed when it would empty the result list, so legitimately short documents are preserved.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #764. URL source ingestion was crashing on complex HTML pages (Wikipedia, Project Gutenberg) with:
Two interacting bugs caused this:
HTMLHeaderTextSplitteron complex pages can produce single-character or punctuation-only chunks (e.g."."). Our existing[c.strip() for c in chunks if c and c.strip()]filter only drops empty/whitespace chunks, so these survived.This PR fixes our side:
chunk_text()now filters chunks belowOPEN_NOTEBOOK_MIN_CHUNK_SIZEtokens (default 5) before returning. The filter is bypassed if it would empty the result list, so legitimately short documents are preserved.Changes
open_notebook/utils/chunking.py: newMIN_CHUNK_SIZEconstant + filter step inchunk_text().tests/test_chunking.py: 2 new tests covering the filter behavior and the empty-result safeguard.docs/5-CONFIGURATION/environment-reference.md: documents the new env var.CHANGELOG.md: Unreleased entry.Test plan
uv run pytest tests/test_chunking.py tests/test_embedding.py— 52 passed