Excellent Utilities: OCRmyPDF - add OCR text layer to scanned PDFs

Last Updated on May 22, 2022

This is a series highlighting best-of-breed utilities. We cover a wide range of utilities including tools that boost your productivity, help you manage your workflow, and lots more besides. There’s a complete list of the tools in this series in the Summary section.

Optical Character Recognition (OCR) is a visual recognition process that turns printed or written text into an electronic character-based file. This makes the document searchable and offers the ability to copy-paste its contents.

PDF is generally considered to be an excellent format for storing and exchanging scanned documents. Unfortunately, PDFs aren’t trivial to modify. OCRmyPDF makes it easy to apply image processing and OCR to existing PDFs. The program add an OCR text layer to scanned PDF files. It’s a command-line only affair.

Let’s get an important distinction out of the way. If you create a PDF document from an electronic source, there will already be an OCR layer applied. Native PDF files have an internal structure that can be read and interpreted. These “generated” PDF documents already contain characters that have an electronic character designation. The most popular office suite for Linux is LibreOffice. That suite automatically applies a text layer to documents exported to the PDF format. For this scenario, you don’t need OCRmyPDF.

PDF documents are also created by scanning a paper document into an electronic format. Typically, this is with a flatbed scanner. The scanner takes a “snapshot” of the paper document. This snapshot is turned into a PDF (or another format such as JPG and TIFF). This is a “scanned” PDF document which often won’t have an OCR layer. Want to add that text layer? Step forward OCRmyPDF.

Installation

Installation procedure will depend on the Linux distro you’re using. On my Arch based system, installation is trivial, as there’s a package in the Arch User Repository.

Installing the package pulls in a number of other programs including tesseract, img2pdf, pngquant, unpaper, and various Python packages.

You’ll also need a language pack.

$ sudo pacman -S tesseract-data-eng

I’m using the English language pack for Tesseract. But Tesseract supports most languages. Just install the relevant language pack(s) for your requirements. And there’s support for multilingual documents.

Next page: Page 2 – In Operation

Pages in this article:
Page 1 – Introduction / Installation
Page 2 – In Operation
Page 3 – Summary

Complete list of articles in this series:

Excellent Utilities
AES Crypt	Encrypt files using the Advanced Encryption Standard
Ananicy	Shell daemon created to manage processes’ IO and CPU priorities
broot	Next gen tree explorer and customizable launcher
Cerebro	Fast application launcher
cheat.sh	Community driven unified cheat sheet
CopyQ	Advanced clipboard manager
croc	Securely transfer files and folders from the command-line
Deskreen	Live streaming your desktop to a web browser
duf	Disk usage utility with more polished presentation than the classic df
eza	A turbo-charged alternative to the venerable ls command
Extension Manager	Browse, install and manage GNOME Shell Extensions
fd	Wonderful alternative to the venerable find
fkill	Kill processes quick and easy
fontpreview	Quickly search and preview fonts
horcrux	File splitter with encryption and redundancy
Kooha	Simple screen recorder
KOReader	Document viewer for a wide variety of file formats
Imagine	A simple yet effective image optimization tool
LanguageTool	Style and grammar checker for 30+ languages
Liquid Prompt	Adaptive prompt for Bash & Zsh
lnav	Advanced log file viewer for the small-scale; great for troubleshooting
lsd	Like exa, lsd is a turbo-charged alternative to ls
Mark Text	Simple and elegant Markdown editor
McFly	Navigate through your bash shell history
mdless	Formatted and highlighted view of Markdown files
navi	Interactive cheatsheet tool
noti	Monitors a command or process and triggers a notification
Nushell	Flexible cross-platform shell with a modern feel
nvitop	GPU process management for NVIDIA graphics cards
OCRmyPDF	Add OCR text layer to scanned PDFs
Oh My Zsh	Framework to manage your Zsh configuration
Paperwork	Designed to simplify the management of your paperwork
pastel	Generate, analyze, convert and manipulate colors
PDF Mix Tool	Perform common editing operations on PDF files
peco	Simple interactive filtering tool that's remarkably useful
ripgrep	Recursively search directories for a regex pattern
Rnote	Sketch and take handwritten notes
scrcpy	Display and control Android devices
Sticky	Simulates the traditional “sticky note” style stationery on your desktop
tldr	Simplified and community-driven man pages
tmux	A terminal multiplexer that offers a massive boost to your workflow
Tusk	An unofficial Evernote client with bags of potential
Ulauncher	Sublime application launcher
Watson	Track the time spent on projects
Whoogle Search	Self-hosted and privacy-focused metasearch engine
Zellij	Terminal workspace with batteries included

Pages: 1 2 3

Documents	Internet	Education
Audio	Video	Graphics
Admin	Desktop	Productivity
Science	Games	Security
Utilities	Coding	Finance
Web Apps	Other	Books

Google	Microsoft	Apple
Adobe	IBM	Autodesk
Oracle	Atlassian	Corel
Cisco	Intuit	SAS
Progress	Salesforce	Citrix