Utilities

Excellent Utilities: OCRmyPDF – add OCR text layer to scanned PDFs

Last Updated on May 22, 2022

In Operation

OCRmyPDF doesn’t offer a graphical front-end. Instead you run the program from the command line with a command such as:

We run a lot of tests of scanned documents, most of them are single pages. The process was fast. And each file was successfully processed with no fuss or bother.

To get a better idea of the time taken to complete the process, we took a 395 page PDF. This PDF already has a text layer, so OCRmyPDF defaults not to apply OCR. But it’s possible to force the process with the –force-ocr flag. This is useful if the file has been OCRed with an earlier version of Tesseract or other OCR software.

There’s various stages. The first is a scanning stage. On the 395 page PDF, this proceeded at approximately 1.9 seconds a page using a quad core Intel i5 processor. The scanning phase is only single threaded. The next stage is applying the OCR. This calls a popular OCR tool, Tesseract. For this part of the process, multiple copies of Tesseract are called, making a lot better use of the multi-core processor. This part of the process took, on average, 2.7 seconds of the page. There’s also lossless image optimization performed, courtesy of GhostScript. Again that’s only single core affair, as is the final stage of the process, which like the first stage calls the ocrmypdf process. The whole process on that 395 page PDF took a whopping 43 minutes.

Of course, documents you’ll want to OCR will typically be much shorter than 395 pages.

OCRmyPDF doesn’t only apply an OCR layer to PDFs. It can also take an image file as an input. When given an image, the software will try to convert the image to a PDF before processing. This pre-stage uses the Python package img2pdf.

In the video below, we take a sample JPEG scanned file with a size 2,887,137 bytes. The program recognizes we have submitted a JPG image. The program checks the validity of the image, and then proceeds to convert it to PDF. After conversion, it calls Tesseract to perform the OCR function. And then it sees if there’s any benefit from image optimization.

As the video indicates, adding the OCR layer increases the file size to 2,902,363 bytes. That’s an increase of 15,226 bytes. Put another way, that’s an increase of a mere 0.52%.

Features of the program:

  • Generates a searchable PDF/A file from a regular PDF. PDF/A is an ISO-standardized subset of the full PDF specification that is designed for archiving (the ‘A’ stands for Archive). OCRmyPDF generates PDF/A-2b by default.
  • Places OCR text accurately below the image to ease copy / paste.
  • Retains the exact resolution of the original embedded images.
  • When possible, inserts OCR information as a “lossless” operation without disrupting any other content.
  • Optimizes PDF images.
  • If requested, the program deskews and/or cleans the image before performing OCR.
  • Validates input and output files.
  • Distributes work across all available CPU cores. This only applies to the OCR phase of the process unless you use a program like GNU Parallel.
  • Scales well to handle files with thousands of pages.

Next page: Page 3 – Summary

Pages in this article:
Page 1 – Introduction / Installation
Page 2 – In Operation
Page 3 – Summary


Complete list of articles in this series:

Excellent Utilities
AES CryptEncrypt files using the Advanced Encryption Standard
AnanicyShell daemon created to manage processes’ IO and CPU priorities
brootNext gen tree explorer and customizable launcher
CerebroFast application launcher
cheat.shCommunity driven unified cheat sheet
CopyQAdvanced clipboard manager
crocSecurely transfer files and folders from the command-line
DeskreenLive streaming your desktop to a web browser
dufDisk usage utility with more polished presentation than the classic df
ezaA turbo-charged alternative to the venerable ls command
Extension ManagerBrowse, install and manage GNOME Shell Extensions
fdWonderful alternative to the venerable find
fkillKill processes quick and easy
fontpreviewQuickly search and preview fonts
horcruxFile splitter with encryption and redundancy
KoohaSimple screen recorder
KOReaderDocument viewer for a wide variety of file formats
ImagineA simple yet effective image optimization tool
LanguageToolStyle and grammar checker for 30+ languages
Liquid PromptAdaptive prompt for Bash & Zsh
lnavAdvanced log file viewer for the small-scale; great for troubleshooting
lsdLike exa, lsd is a turbo-charged alternative to ls
Mark TextSimple and elegant Markdown editor
McFlyNavigate through your bash shell history
mdlessFormatted and highlighted view of Markdown files
notiMonitors a command or process and triggers a notification
NushellFlexible cross-platform shell with a modern feel
nvitopGPU process management for NVIDIA graphics cards
OCRmyPDFAdd OCR text layer to scanned PDFs
Oh My ZshFramework to manage your Zsh configuration
PaperworkDesigned to simplify the management of your paperwork
pastelGenerate, analyze, convert and manipulate colors
PDF Mix ToolPerform common editing operations on PDF files
pecoSimple interactive filtering tool that's remarkably useful
ripgrepRecursively search directories for a regex pattern
RnoteSketch and take handwritten notes
scrcpyDisplay and control Android devices
StickySimulates the traditional “sticky note” style stationery on your desktop
tldrSimplified and community-driven man pages
tmuxA terminal multiplexer that offers a massive boost to your workflow
TuskAn unofficial Evernote client with bags of potential
UlauncherSublime application launcher
WatsonTrack the time spent on projects
Whoogle SearchSelf-hosted and privacy-focused metasearch engine
ZellijTerminal workspace with batteries included
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments