hocr-tools - manipulate and evaluate hOCR format

hocr-tools is a set of tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

The tools comprise:

hocr-check — performs consistency checks on the hOCR file.
hocr-combine — combine pages in multiple hOCR files into a single document. The document metadata is taken from the first file.
hocr-cut — cut a page (horizontally) into two pages in the middle such that the most of the bounding boxes are separated nicely, e.g. cutting double pages or double columns.
hocr-eval — evaluate the actual OCR with respect to the ground truth. This outputs the number of OCR errors due to incorrect segmentation and the number of OCR errors due to character recognition errors.
It works by aligning segmentation components geometrically, and for each segmentation component that can be aligned, computing the string edit distance of the text the segmentation component contains..
hocr-eval-geom — compare the segmentations at the level of the element name (default: ocr_line). Computes undersegmentation, oversegmentation, and missegmentation.
hocr-eval-lines — evaluate hOCR output against ASCII ground truth. This evaluation method requires that the line breaks in true-lines.txt and the ocr_line elements in hocr-actual.html agree (most ASCII output from OCR systems satisfies this requirement)..
hocr-extract-g1000 — extract lines from Google 1000 book sample.
hocr-extract-images — extract the images and texts within all the ocr_line elements.
hocr-lines — extract the text within all the ocr_line elements.
hocr-merge-dc — merges the Dublin Core metadata into the hOCR file by encoding the data in its header.
hocr-pdf — create a searchable PDF from a pile of hOCR and JPEG. The corresponding JPEG and hOCR files must have the same name with their respective file ending.
hocr-split — split an hOCR file into individual pages. The pattern should be something like “base-%03d.html”
hocr-wordfreq — calculate word frequency in an hOCR file. By default, the first 10 words are shown, but any number can be requested with -n.

Each command line program is self contained.

Website: github.com/ocropus/hocr-tools
Support:
Developer: Thomas M. Breuel
License: Apache License, Version 2.0

hocr-tools is written in Python. Learn Python with our recommended free books and free tutorials.

Related Software

OCR Tools
OCRmyPDF	Adds an OCR text layer to scanned PDFs using the unpaper utility
Paperwork	Simplify the management of your paperwork
OCRFeeder	Desktop OCR suite featuring a complete GTK graphical user interface
gImageReader	Simple Gtk/Qt front-end to Tesseract
gscan2pdf	GUI to produce PDFs or DjVus from scanned documents
lios	linux-intelligent-ocr-solution for converting print into text
hocr-tools	Manipulate and evaluate hOCR format
Skanpage	Simple scanning application optimized for multi-page document scanning
GOCR	Reads images in many formats
QuickSnip	OCR and Google Lens search
ocropy	Open source document analysis and OCR system

Read our verdict in the software roundup.

Explore our comprehensive directory of recommended free and open source software. Our carefully curated collection spans every major software category.

This directory is part of our ongoing series of informative articles for Linux enthusiasts. It features hundreds of detailed reviews, along with open source alternatives to proprietary solutions from major corporations such as Google, Microsoft, Apple, Adobe, IBM, Cisco, Oracle, and Autodesk.

You’ll also find interesting projects to try, hardware coverage, free programming books and tutorials, and much more.

Discovered a useful open source Linux program that we haven’t covered yet? Let us know by completing this form.

Documents	Internet	Education
Audio	Video	Graphics
Admin	Desktop	Productivity
Science	Games	Security
Utilities	Coding	Finance
Web Apps	Other	Books

Google	Microsoft	Apple
Adobe	IBM	Autodesk
Oracle	Atlassian	Corel
Cisco	Intuit	SAS
Progress	Salesforce	Citrix