hocr-tools – manipulate and evaluate hOCR format

hocr-tools is a set of tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

The tools comprise:

hocr-check — performs consistency checks on the hOCR file.
hocr-combine — combine pages in multiple hOCR files into a single document. The document metadata is taken from the first file.
hocr-cut — cut a page (horizontally) into two pages in the middle such that the most of the bounding boxes are separated nicely, e.g. cutting double pages or double columns.
hocr-eval — evaluate the actual OCR with respect to the ground truth. This outputs the number of OCR errors due to incorrect segmentation and the number of OCR errors due to character recognition errors.
It works by aligning segmentation components geometrically, and for each segmentation component that can be aligned, computing the string edit distance of the text the segmentation component contains..
hocr-eval-geom — compare the segmentations at the level of the element name (default: ocr_line). Computes undersegmentation, oversegmentation, and missegmentation.
hocr-eval-lines — evaluate hOCR output against ASCII ground truth. This evaluation method requires that the line breaks in true-lines.txt and the ocr_line elements in hocr-actual.html agree (most ASCII output from OCR systems satisfies this requirement)..
hocr-extract-g1000 — extract lines from Google 1000 book sample.
hocr-extract-images — extract the images and texts within all the ocr_line elements.
hocr-lines — extract the text within all the ocr_line elements.
hocr-merge-dc — merges the Dublin Core metadata into the hOCR file by encoding the data in its header.
hocr-pdf — create a searchable PDF from a pile of hOCR and JPEG. The corresponding JPEG and hOCR files must have the same name with their respective file ending.
hocr-split — split an hOCR file into individual pages. The pattern should be something like “base-%03d.html”
hocr-wordfreq — calculate word frequency in an hOCR file. By default, the first 10 words are shown, but any number can be requested with -n.

Each command line program is self contained.

Website: github.com/ocropus/hocr-tools
Support:
Developer: Thomas M. Breuel
License: Apache License, Version 2.0

hocr-tools is written in Python. Learn Python with our recommended free books and free tutorials.


Related Software

OCR Tools
OCRmyPDFAdds an OCR text layer to scanned PDFs using the unpaper utility
PaperworkSimplify the management of your paperwork
OCRFeederDesktop OCR suite featuring a complete GTK graphical user interface
gImageReaderSimple Gtk/Qt front-end to Tesseract
gscan2pdfGUI to produce PDFs or DjVus from scanned documents
lioslinux-intelligent-ocr-solution for converting print into text
hocr-toolsManipulate and evaluate hOCR format
SkanpageSimple scanning application optimized for multi-page document scanning
GOCRReads images in many formats
QuickSnipOCR and Google Lens search
ocropyOpen source document analysis and OCR system

Read our verdict in the software roundup.


Best Free and Open Source Software Explore our comprehensive directory of recommended free and open source software. Our carefully curated collection spans every major software category.

This directory is part of our ongoing series of informative articles for Linux enthusiasts. It features hundreds of detailed reviews, along with open source alternatives to proprietary solutions from major corporations such as Google, Microsoft, Apple, Adobe, IBM, Cisco, Oracle, and Autodesk.

You’ll also find interesting projects to try, hardware coverage, free programming books and tutorials, and much more.

Discovered a useful open source Linux program that we haven’t covered yet? Let us know by completing this form.
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments