-
|
@dosu hi :) What is the optimal number of cpu cores to use when working with simple pdf pipeline on a cpu only machine? I am working with single digital pdf with around 300 pages on average. pipeline_options = PdfPipelineOptions( |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
|
Hi! 👋 For your setup (digital PDFs, no OCR, ~300 pages), setting However, there are a couple things to keep in mind:
Your current config looks fine. If you want to experiment, try starting with half your core count and work up to see if there's any noticeable improvement—but don't be surprised if doubling threads doesn't double performance. To reply, just mention @dosu. Share context across your team and agents. Try Dosu. |
Beta Was this translation helpful? Give feedback.
-
|
@dosu i have experimented and found out that using 10 cores yields the best speed, here is the profiling result: However, i wonder if 25 seconds for a 5 page pdf is a good speed on my cpu only machine? Would you please advice how can i achieve faster speeds for my pipeline? I was thinking of splitting pdf into multiple chunks and then creating multiple processes that are going to convert those pdf chunks in parallel. Thank you |
Beta Was this translation helpful? Give feedback.
Hi! 👋
For your setup (digital PDFs, no OCR, ~300 pages), setting
num_threadsequal to your CPU core count is a reasonable starting point. Somultiprocessing.cpu_count()is a sensible default.However, there are a couple things to keep in mind:
Diminishing returns: Testing on modern machines shows that increasing CPU cores doesn't significantly improve throughput due to Python's concurrency limitations. The threading helps with parallel internal tasks, but gains may be modest.
Also set
OMP_NUM_THREADS: For a more consistent setup, set theOMP_NUM_THREADSenvironment variable to match your thread count. For example, on a 4-core machine:C…