Skip to content

fix(backend): preserve conversations when LLM returns empty title (#5668)#6675

Closed
Surya-Midde wants to merge 2 commits intoBasedHardware:mainfrom
Surya-Midde:fix/5668-empty-llm-title-discard
Closed

fix(backend): preserve conversations when LLM returns empty title (#5668)#6675
Surya-Midde wants to merge 2 commits intoBasedHardware:mainfrom
Surya-Midde:fix/5668-empty-llm-title-discard

Conversation

@Surya-Midde
Copy link
Copy Markdown

Summary

Fixes #5668 — P1 data-loss bug where valid conversations were silently discarded when the LLM returned an empty title.

Two-part fix, split across two commits for reviewability:

  1. fix(backend) — root-cause signal fix. _get_conversation_obj was re-deriving discarded from structured.title == ''. The authoritative discarded bool is already correctly computed by _get_structured via should_discard_conversation, and is used correctly downstream for folder assignment and analytics. It was just being dropped at the _get_conversation_obj call boundary. This commit passes it explicitly and removes the fragile title-based proxy. Default value is False so any future caller that forgets to pass it preserves the conversation rather than silently dropping it.

  2. feat(backend) — deterministic fallback title. Even with silent-discard protection, an empty title still renders as "" in the UI and breaks search indexing. When the main transcript path's LLM call returns empty title for a transcript ≥100 words, a deterministic fallback is applied (preferring overview first-sentence with a ≥2-word guard, then a locale-independent date label, then "Untitled Conversation"). No retry LLM call — intentionally cheap.

Scope

  • Fix (1) protects all 5 return paths of _get_structured from silent discard (main transcript, force-process/reprocess, external-integration audio/message/other).
  • Fix (2) is scoped to the main transcript path only. External integrations typically get titles from their source systems (Limitless, iOS Shortcuts, etc.); force-process preserves the existing title as an LLM hint. Those paths are protected from data loss by fix (1) but don't get cosmetic fallback titles — keeping the PR narrow.

Files

  • backend/utils/conversations/process_conversation.py — signature change, new helper, fallback trigger, caller update
  • backend/tests/unit/test_empty_title_no_discard.pynew — 8 pure unit tests
  • backend/test.sh — register new test file

Note on line numbers

The issue cites some stale line numbers (conversation.py:169-170, process_conversation.py:209). Since the issue was filed, Structured was extracted into its own module (models/structured.py) and the title == '' check is now at process_conversation.py:222. The bug itself is unchanged.

Implementation details worth knowing

  • Locale-independent date formatting: uses a hard-coded English month-abbreviation tuple rather than strftime('%b %d, %Y'), which is locale-sensitive (returns \"avr.\" on fr_FR.UTF-8). This makes the fallback and its tests deterministic on any developer's machine.
  • 2-word minimum on overview first sentence: avoids noisy one-word titles like \"Dr\" (from "Dr. Smith met with...") or \"Scored 3\" (from "Scored 3.5 on...").
  • Pure unit tests: all 8 tests avoid mocking _get_structured. Stubbing of heavy transitive imports (Firestore, Pinecone, RAG, etc.) follows the same pattern as test_process_conversation_usage_context.py.

Out of scope

  • Retry LLM title generation (extra cost for an edge case)
  • Making Structured.title a required Pydantic field (would break ≥6 placeholder construction sites in transcribe.py, merge_conversations.py, conversation_processing.py)
  • Fallback titles for external-integration and force-process paths

Potential follow-up (for post-merge, not this PR)

  • Add a `logger.info(...)` at the fallback trigger site for observability — would let maintainers spot LLM regressions via log volume without waiting for user reports.

Test plan

  • `pytest tests/unit/test_empty_title_no_discard.py -v` → 8/8 passing
  • `pytest tests/unit/test_conversation_model_split.py -v` → unchanged, existing assertions (`Structured().title == ''` default) still pass
  • `bash test.sh` run locally (new test registered)
  • `black --line-length 120 --skip-string-normalization` applied

🤖 Generated with Claude Code

middesurya and others added 2 commits April 15, 2026 13:34
)

The `_get_conversation_obj` helper re-derived `discarded` from
`structured.title == ''`, which silently dropped valid conversations
whenever the LLM returned an empty title. The authoritative `discarded`
signal was already being computed correctly by `_get_structured` (via
`should_discard_conversation`) but was being thrown away at the call
boundary.

This change passes the authoritative `discarded` bool explicitly to
`_get_conversation_obj` and removes the fragile title-based proxy.
Default value is `False` so any future caller that forgets to pass a
value preserves the conversation rather than silently dropping it.

Adds three regression tests covering:
  1. empty title + discarded=False preserves the conversation
  2. empty title + discarded=True legitimately marks it discarded
  3. omitted discarded param defaults to False (safe default)

Fixes #5668

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…onversations (#5668)

Complements the signal-fix commit: even with silent-discard protection,
an empty LLM title still renders as '' in the UI and breaks search
indexing. This commit adds a deterministic fallback title generator
invoked on the main transcript path when the LLM returns an empty
title for a transcript >=100 words.

Preference order for the fallback:
  1. First sentence of the LLM-generated overview (if present and <=60 chars)
  2. Date-stamped label ("Conversation on Apr 15, 2026")
  3. Static fallback "Untitled Conversation"

No retry LLM call — intentionally cheap. External integration and
force-process paths are not affected (they have their own title
sources) and stay out of scope for this PR.

Adds four helper tests (all pure).

Fixes #5668

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 15, 2026

Greptile Summary

This PR fixes issue #5668 where conversations were silently discarded when the LLM returned an empty title. The fix has two parts: (1) _get_conversation_obj now accepts an explicit discarded parameter instead of re-deriving it from structured.title == '', and (2) a deterministic fallback title is applied on the main transcript path when the LLM returns an empty title for a transcript ≥100 words. Eight new unit tests are included and registered in test.sh.

Confidence Score: 5/5

Safe to merge — the data-loss bug is correctly fixed across all code paths with no new regressions introduced.

All three branches of _get_conversation_obj correctly propagate the explicit discarded parameter. The only call site in the codebase has been updated. The Structured.title field defaults to '' so .strip() is safe. The fallback title logic is deterministic and locale-independent. The single remaining finding is a P2 docstring inaccuracy about the "Scored 3" edge case — it does not affect runtime behavior.

No files require special attention; the docstring inaccuracy in process_conversation.py lines 109–111 is cosmetic only.

Important Files Changed

Filename Overview
backend/utils/conversations/process_conversation.py Root-cause fix: replaces discarded = structured.title == '' with explicit parameter; adds _fallback_title_for_empty_llm_response; one P2 docstring inaccuracy about the "Scored 3" guard
backend/tests/unit/test_empty_title_no_discard.py 8 pure unit tests covering the discard-signal fix and fallback title helper; module stubbing pattern mirrors existing tests; all cases are correct
backend/test.sh New test file registered in CI runner; no issues

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[process_conversation] --> B["_get_structured(uid, lang, conv, force_process)"]
    B --> C{conversation.source?}
    C -->|external_integration / workflow| D{text_source?}
    D -->|audio| E[get_transcript_structure]
    D -->|message| F[get_message_structure]
    D -->|other| G[summarize_experience_text]
    E & F & G --> H["return structured, discarded=False"]
    C -->|native transcript| I{force_process?}
    I -->|yes| J[get_reprocess_transcript_structure]
    J --> K["return structured, discarded=False"]
    I -->|no| L[should_discard_conversation]
    L -->|discarded=True| M["return Structured(emoji=...), True"]
    L -->|discarded=False| N[get_transcript_structure]
    N --> O{title empty AND transcript >= 100 words?}
    O -->|yes| P["_fallback_title_for_empty_llm_response()"]
    O -->|no| Q["return structured, False"]
    P --> Q
    H & K & M & Q --> R["_get_conversation_obj(uid, structured, conv, discarded=discarded)"]
    R --> S{isinstance?}
    S -->|CreateConversation| T["Conversation(discarded=discarded)"]
    S -->|ExternalIntegrationCreateConversation| U["Conversation(discarded=discarded)"]
    S -->|Conversation| V["conv.discarded = discarded"]
    T & U & V --> W[return Conversation]
Loading

Reviews (1): Last reviewed commit: "feat(backend): fallback title for empty ..." | Re-trigger Greptile

Comment on lines +109 to +111
* The 2-word minimum avoids producing noisy one-word titles when overview
begins with an abbreviation ("Dr. Smith met..." → first_sentence=="Dr")
or a decimal ("Scored 3.5..." → first_sentence=="Scored 3").
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Misleading docstring — 2-word guard does not prevent "Scored 3"

The comment claims the 2-word minimum avoids "Scored 3" (from "Scored 3.5 on..."), but "Scored 3" is 2 words and passes the >= 2 check. Tracing the logic: "Scored 3.5 on...".split('.')['Scored 3', '5 on...'], so first_sentence = 'Scored 3' which has len(...split()) == 2 and len == 8 <= 60 — it is returned as the title. The guard only prevents 1-word extractions (e.g. "Dr" from "Dr. Smith..."). The decimal-split case should either be handled explicitly or removed from the docstring to avoid misleading future maintainers.

Suggested change
* The 2-word minimum avoids producing noisy one-word titles when overview
begins with an abbreviation ("Dr. Smith met..."first_sentence=="Dr")
or a decimal ("Scored 3.5..."first_sentence=="Scored 3").
* The 2-word minimum avoids producing noisy one-word titles when overview
begins with an abbreviation ("Dr. Smith met..."first_sentence=="Dr").
Note: decimal-split cases ("Scored 3.5..."first_sentence=="Scored 3")
still produce a 2-word fallback title; they are not filtered by this guard.

@Surya-Midde
Copy link
Copy Markdown
Author

Closing in favor of #6676 — same commits, pushed from my primary GitHub account (@middesurya). Sorry for the noise!

@github-actions
Copy link
Copy Markdown
Contributor

Hey @Surya-Midde 👋

Thank you so much for taking the time to contribute to Omi! We truly appreciate you putting in the effort to submit this pull request.

After careful review, we've decided not to merge this particular PR. Please don't take this personally — we genuinely try to merge as many contributions as possible, but sometimes we have to make tough calls based on:

  • Project standards — Ensuring consistency across the codebase
  • User needs — Making sure changes align with what our users need
  • Code best practices — Maintaining code quality and maintainability
  • Project direction — Keeping aligned with our roadmap and vision

Your contribution is still valuable to us, and we'd love to see you contribute again in the future! If you'd like feedback on how to improve this PR or want to discuss alternative approaches, please don't hesitate to reach out.

Thank you for being part of the Omi community! 💜

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Empty LLM title silently discards valid conversations

2 participants