hocr-tools is a set of tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
The tools comprise:
hocr-check — performs consistency checks on the hOCR file.
hocr-combine — combine pages in multiple hOCR files into a single document. The document metadata is taken from the first file.
hocr-cut — cut a page (horizontally) into two pages in the middle such that the most of the bounding boxes are separated nicely, e.g. cutting double pages or double columns.
hocr-eval — evaluate the actual OCR with respect to the ground truth. This outputs the number of OCR errors due to incorrect segmentation and the number of OCR errors due to character recognition errors.
It works by aligning segmentation components geometrically, and for each segmentation component that can be aligned, computing the string edit distance of the text the segmentation component contains..
hocr-eval-geom — compare the segmentations at the level of the element name (default: ocr_line). Computes undersegmentation, oversegmentation, and missegmentation.
hocr-eval-lines — evaluate hOCR output against ASCII ground truth. This evaluation method requires that the line breaks in true-lines.txt and the ocr_line elements in hocr-actual.html agree (most ASCII output from OCR systems satisfies this requirement)..
hocr-extract-g1000 — extract lines from Google 1000 book sample.
hocr-extract-images — extract the images and texts within all the ocr_line elements.
hocr-lines — extract the text within all the ocr_line elements.
hocr-merge-dc — merges the Dublin Core metadata into the hOCR file by encoding the data in its header.
hocr-pdf — create a searchable PDF from a pile of hOCR and JPEG. The corresponding JPEG and hOCR files must have the same name with their respective file ending.
hocr-split — split an hOCR file into individual pages. The pattern should be something like “base-%03d.html”
hocr-wordfreq — calculate word frequency in an hOCR file. By default, the first 10 words are shown, but any number can be requested with -n.
Each command line program is self contained.
Website: github.com/ocropus/hocr-tools
Support:
Developer: Thomas M. Breuel
License: Apache License, Version 2.0
hocr-tools is written in Python. Learn Python with our recommended free books and free tutorials.
Related Software
| OCR Tools | |
|---|---|
| OCRmyPDF | Adds an OCR text layer to scanned PDFs using the unpaper utility |
| Paperwork | Simplify the management of your paperwork |
| OCRFeeder | Desktop OCR suite featuring a complete GTK graphical user interface |
| gImageReader | Simple Gtk/Qt front-end to Tesseract |
| gscan2pdf | GUI to produce PDFs or DjVus from scanned documents |
| lios | linux-intelligent-ocr-solution for converting print into text |
| hocr-tools | Manipulate and evaluate hOCR format |
| Skanpage | Simple scanning application optimized for multi-page document scanning |
| GOCR | Reads images in many formats |
| QuickSnip | OCR and Google Lens search |
| ocropy | Open source document analysis and OCR system |
Read our verdict in the software roundup.
Explore our comprehensive directory of recommended free and open source software. Our carefully curated collection spans every major software category.This directory is part of our ongoing series of informative articles for Linux enthusiasts. It features hundreds of detailed reviews, along with open source alternatives to proprietary solutions from major corporations such as Google, Microsoft, Apple, Adobe, IBM, Cisco, Oracle, and Autodesk. You’ll also find interesting projects to try, hardware coverage, free programming books and tutorials, and much more. Discovered a useful open source Linux program that we haven’t covered yet? Let us know by completing this form. |

