Skip to content

[TAN-7519] Add error handling and retry logic for PDF import jobs#13615

Open
AchrafGoVocal wants to merge 19 commits intoTAN-7380-formsync-better-progress-indicatorfrom
TAN-7519-error-handling-and-retry-formsync
Open

[TAN-7519] Add error handling and retry logic for PDF import jobs#13615
AchrafGoVocal wants to merge 19 commits intoTAN-7380-formsync-better-progress-indicatorfrom
TAN-7519-error-handling-and-retry-formsync

Conversation

@AchrafGoVocal
Copy link
Copy Markdown
Contributor

@AchrafGoVocal AchrafGoVocal commented Apr 10, 2026

Summary

Makes PDF imports resilient to transient LLM failures and gives admins a clearer view of what happened when things go wrong. Builds on top of the tracker-based progress work from TAN-7380.

Changes

Retry transient LLM errors (IdeaPdfImportJob)

  • perform_retries true with a RETRYABLE_ERRORS allowlist: RubyLLM::RateLimitError, OverloadedError, ServerError, ServiceUnavailableError. Everything else falls through to the non-retry path.
  • handle_error — retries allowlisted errors up to Que's maximum_retry_count, then stops. When retries are exhausted (or the error is non-retryable), the job:
    • Counts unprocessed files (unprocessed_files_count) and pushes them through track_progress_and_complete!(remaining, remaining) so the tracker hits total and the frontend stops spinning.
    • Reports the error via ErrorReporter with the phase_id in the extras.
    • Calls expire so Que stops retrying.
  • Idempotent retries — added next if IdeaImport.exists?(file_id: file.id) at the top of the per-file loop so that when Que retries a batch, files that already succeeded on a previous attempt are skipped.
  • after_success only when work happened — skips the bulk_import_succeeded activity when the batch produced zero ideas (e.g. all files were already imported on a previous retry attempt).

Misc backend fixes

  • IdeaXlsxImportJob — fixed track_progress_and_complete! being called with no arguments on the error path, so the failed chunk now properly advances the tracker by 1 with an error count of 1 (was silently stuck before).

Importer UI — per-job error counts

  • ImportStatus component — now takes an errorCount prop and, when the import has errors, shows "Imported N of M inputs. K could not be imported due to errors." instead of the generic "Errors occurred during the import". errorCount is wired up from latestJob?.attributes.error_count in ReviewSection.
  • Copy — existing errorImporting message is kept as a fallback; added a new errorImportingWithCounts message for the counted variant.

Changelog

Added

  • [TAN-7519] Added retry handling for transient LLM errors during PDF imports and surfaced per-job error counts in the importer UI.

Fixed

  • [TAN-7519] Fixed XLSX import failures not advancing the progress tracker.

For translators

@notion-workspace
Copy link
Copy Markdown

@cl-dev-bot
Copy link
Copy Markdown
Collaborator

cl-dev-bot commented Apr 10, 2026

Messages
📖 Changelog provided 🎉
📖 Notion issue: TAN-7519
📖

Run the e2e tests

📖 Check translation progress

Generated by 🚫 dangerJS against e5fac3a

@AchrafGoVocal AchrafGoVocal changed the title [TAN-7519] Add import progress tracking to the frontend [TAN-7519] Add error handling and retry logic for PDF import jobs Apr 10, 2026
@AchrafGoVocal AchrafGoVocal added this to the FormSync milestone Apr 13, 2026
…com:CitizenLabDotCo/citizenlab into TAN-7519-error-handling-and-retry-formsync
…com:CitizenLabDotCo/citizenlab into TAN-7519-error-handling-and-retry-formsync
Copy link
Copy Markdown
Contributor

@amanda-anderson amanda-anderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FE looks good to me 👍

Copy link
Copy Markdown
Contributor

@adessy adessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting there! 💪

Comment on lines +132 to +133
QueJob
.where('id = :id OR data @> :data', id: root_job_id, data: { root_job_id: root_job_id }.to_json)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could extract this as a private jobs method.

def job_errors
QueJob
.where('id = :id OR data @> :data', id: root_job_id, data: { root_job_id: root_job_id }.to_json)
.where(finished_at: nil) # Filter out jobs that are not finished yet since they might be retried and the error might be resolved in a retry.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expired_at is set on failing jobs that they run out of retries.

Suggested change
.where(finished_at: nil) # Filter out jobs that are not finished yet since they might be retried and the error might be resolved in a retry.
.where.not(expired_at: nil)

QueJob
.where('id = :id OR data @> :data', id: root_job_id, data: { root_job_id: root_job_id }.to_json)
.where(finished_at: nil) # Filter out jobs that are not finished yet since they might be retried and the error might be resolved in a retry.
.where.not(last_error_message: nil)
Copy link
Copy Markdown
Contributor

@adessy adessy Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant with the filter in filter_map. In any case, if they are failed, there should be an error message.

Suggested change
.where.not(last_error_message: nil)

)

attribute :job_type, &:root_job_type
attribute :errors, &:job_errors
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand how those error messages get localized by the front-end, since they're essentially just the string representation of the Ruby error (e.to_s or similar, most likely).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and it's something like: "#{e.class}: #{e.message}".slice(0, 500)


self.priority = 60
perform_retries false
perform_retries true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed. It's true by default.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the code we drafted during our call this morning. It's not perfect, but I think it's about as good as we can get without reworking parts of the job system.

    private

    def handle_error(error)
      case error
      when *RETRYABLE_ERRORS then super
      else expire
      end
    end

    def expire
      error = que_target.que_error
      finalize_foobar(idea_import_files, import_user, phase, error)
      super
    end

    def finalize_foobar(idea_files, user, phase, error)
      SideFxBulkImportService.new.after_failure(user, phase, 'idea', 'pdf', error.to_s)

      remaining = count_missing_imports(idea_files)
      track_progress(remaining, remaining) if remaining > 0
      complete_if_done!
    end

    def count_missing_imports(idea_files)
      file_ids = idea_files.map(&:id)
      file_ids.size - IdeaImport.where(file_id: file_ids).count
    end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants