Skip to content
Discussion options

You must be logged in to vote

Hi! 👋

For your setup (digital PDFs, no OCR, ~300 pages), setting num_threads equal to your CPU core count is a reasonable starting point. So multiprocessing.cpu_count() is a sensible default.

However, there are a couple things to keep in mind:

  1. Diminishing returns: Testing on modern machines shows that increasing CPU cores doesn't significantly improve throughput due to Python's concurrency limitations. The threading helps with parallel internal tasks, but gains may be modest.

  2. Also set OMP_NUM_THREADS: For a more consistent setup, set the OMP_NUM_THREADS environment variable to match your thread count. For example, on a 4-core machine:

    import os
    os.environ["OMP_NUM_THREADS"] = "4"
  3. C…

Replies: 2 comments 5 replies

Comment options

You must be logged in to vote
0 replies
Answer selected by ofcat
Comment options

You must be logged in to vote
5 replies
@dosubot
Comment options

@ofcat
Comment options

@dosubot
Comment options

@ofcat
Comment options

@dosubot
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
1 participant