[TAN-7519] Add error handling and retry logic for PDF import jobs#13615
[TAN-7519] Add error handling and retry logic for PDF import jobs#13615AchrafGoVocal wants to merge 19 commits intoTAN-7380-formsync-better-progress-indicatorfrom
Conversation
|
amanda-anderson
left a comment
There was a problem hiding this comment.
FE looks good to me 👍
| QueJob | ||
| .where('id = :id OR data @> :data', id: root_job_id, data: { root_job_id: root_job_id }.to_json) |
There was a problem hiding this comment.
We could extract this as a private jobs method.
| def job_errors | ||
| QueJob | ||
| .where('id = :id OR data @> :data', id: root_job_id, data: { root_job_id: root_job_id }.to_json) | ||
| .where(finished_at: nil) # Filter out jobs that are not finished yet since they might be retried and the error might be resolved in a retry. |
There was a problem hiding this comment.
expired_at is set on failing jobs that they run out of retries.
| .where(finished_at: nil) # Filter out jobs that are not finished yet since they might be retried and the error might be resolved in a retry. | |
| .where.not(expired_at: nil) |
| QueJob | ||
| .where('id = :id OR data @> :data', id: root_job_id, data: { root_job_id: root_job_id }.to_json) | ||
| .where(finished_at: nil) # Filter out jobs that are not finished yet since they might be retried and the error might be resolved in a retry. | ||
| .where.not(last_error_message: nil) |
There was a problem hiding this comment.
This is redundant with the filter in filter_map. In any case, if they are failed, there should be an error message.
| .where.not(last_error_message: nil) |
| ) | ||
|
|
||
| attribute :job_type, &:root_job_type | ||
| attribute :errors, &:job_errors |
There was a problem hiding this comment.
I still don't understand how those error messages get localized by the front-end, since they're essentially just the string representation of the Ruby error (e.to_s or similar, most likely).
There was a problem hiding this comment.
I checked and it's something like: "#{e.class}: #{e.message}".slice(0, 500)
|
|
||
| self.priority = 60 | ||
| perform_retries false | ||
| perform_retries true |
There was a problem hiding this comment.
This can be removed. It's true by default.
There was a problem hiding this comment.
Here's the code we drafted during our call this morning. It's not perfect, but I think it's about as good as we can get without reworking parts of the job system.
private
def handle_error(error)
case error
when *RETRYABLE_ERRORS then super
else expire
end
end
def expire
error = que_target.que_error
finalize_foobar(idea_import_files, import_user, phase, error)
super
end
def finalize_foobar(idea_files, user, phase, error)
SideFxBulkImportService.new.after_failure(user, phase, 'idea', 'pdf', error.to_s)
remaining = count_missing_imports(idea_files)
track_progress(remaining, remaining) if remaining > 0
complete_if_done!
end
def count_missing_imports(idea_files)
file_ids = idea_files.map(&:id)
file_ids.size - IdeaImport.where(file_id: file_ids).count
end
Summary
Makes PDF imports resilient to transient LLM failures and gives admins a clearer view of what happened when things go wrong. Builds on top of the tracker-based progress work from TAN-7380.
Changes
Retry transient LLM errors (
IdeaPdfImportJob)perform_retries truewith aRETRYABLE_ERRORSallowlist:RubyLLM::RateLimitError,OverloadedError,ServerError,ServiceUnavailableError. Everything else falls through to the non-retry path.handle_error— retries allowlisted errors up to Que'smaximum_retry_count, then stops. When retries are exhausted (or the error is non-retryable), the job:unprocessed_files_count) and pushes them throughtrack_progress_and_complete!(remaining, remaining)so the tracker hitstotaland the frontend stops spinning.ErrorReporterwith thephase_idin the extras.expireso Que stops retrying.next if IdeaImport.exists?(file_id: file.id)at the top of the per-file loop so that when Que retries a batch, files that already succeeded on a previous attempt are skipped.after_successonly when work happened — skips thebulk_import_succeededactivity when the batch produced zero ideas (e.g. all files were already imported on a previous retry attempt).Misc backend fixes
IdeaXlsxImportJob— fixedtrack_progress_and_complete!being called with no arguments on the error path, so the failed chunk now properly advances the tracker by 1 with an error count of 1 (was silently stuck before).Importer UI — per-job error counts
ImportStatuscomponent — now takes anerrorCountprop and, when the import has errors, shows"Imported N of M inputs. K could not be imported due to errors."instead of the generic "Errors occurred during the import".errorCountis wired up fromlatestJob?.attributes.error_countinReviewSection.errorImportingmessage is kept as a fallback; added a newerrorImportingWithCountsmessage for the counted variant.Changelog
Added
Fixed
For translators