Austin Acton Linux OCR: A review of free optical character recognition software.

Source of information: http://www.austinacton.com/

LINUX OCR: A REVIEW OF FREE OPTICAL CHARACTER RECOGNITION SOFTWARE.

Austin Acton

I've used Linux as my full-time desktop for seven years now. I have almost no reason to use Windows (other than stupid ExamSoft), and even when I do, I don't have much Windows software available. The one "hole" in my workflow has been OCR. For years, people have been able to scan a document and have it converted into real text. One of my old printers even came with OCR software included - for Windows of course. But when I've really needed OCR, I've just assumed that there were no high quality packages available for Linux.

Recently I decided to find out for myself (a complete OCR virgin) what is available, how to use it, and what the results are like. I installed every free OCR package I could find, and systematically tested them. They all work very differently, so I tried to design a simple test for my specific needs.

As a test subject, I entered the following line into a word processor:

The quick brown Métis jumped over the fluffy Finance Manager.

I specifically included an accented character, some capital letters, and the "fl" and "ff" combos which tend to overlap in serif fonts. I then printed the sentence sixteen times: in each of Times New Roman, Vera Serif, Arial, and Vera SansSerif, at 10, 12, 14, and 16 point size.

…

TESSERACT

Recently, HP (one of Free software's good friends, when convenient) has released some OCR code they developed between 1985 and 1994 called Tesseract, under the Apache license. A group of volunteers have created a new home for it at Google Code, and it is now under active development again. According to the bundled self-promotion, it was one of the best performing OCR engines of its day, so if the development is continued, it looks as if it may become a great gift to the community.

As Tesseract is too new to have been included with Mandriva 2007, I had to compile it myself. Other than tiff-devel, it required very few development libraries.

I downloaded the tarball from the Google Code site, and was pleased to see that the build is based on GNU autotools. (Not that I love autotools, but many old packages just have an ugly, edit-it-yourself makefile.) A simple `./configure; make` had it building. The build failed at several points, seemingly caused by the very recent version of gcc we are using, but patches were available, easily applied, and the build continued without problems.

Tesseract is an engine and framework for other software to build upon, so it only supports a single column of very horizontal text. It does however seem to include some extra code for viewing, training, and spell checking the interpreted text.

Usage is a bit rough, but understandably so for such young and active software. There is no man page, no [-h] or [--help] option, and it crashes if launched without proper arguments. Again, the Google Code website provided much help. It seems to only support tiff at the moment, so I had to use Image Magick to convert all of my scanned pnm's to tiffs, with:

$ for i in *.pnm; do convert $i $i.tiff; done

Running the software required a simple:

$ /usr/local/bin/tesseract g400.tiff g400.txt

Much to my surprise, the output was near perfect. There were a few errors at font size 10 and 16. In no case was the é interpreted properly. However, disregarding the é, font sizes 12 and 14 were perfectly interpreted in all four typefaces. Very impressive. I'm not sure if the unaccented e is a result of English-only spell checking or not.

While Tesseract is the least user-friendly of the command line applications, it is by far the most accurate, the most active, and the most promising.

Name:	Tesseract
Location:	http://code.google.com/p/tesseract-ocr/
Version:	1.04b
Input Format:	TIFF
Accuracy:	99 %
Easy of Use:	2/5

OCROPUS

Ocropus is the motherload of Free OCR. It began as a combination of a handwriting analysis engine and a layout analysis engine. Tesseract has been integrated as the OCR engine, but it allows for other OCR engines to be plugged in as well. The project is now sponsored by Google.

According to the Google Code site:

OCRopus is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.

Anticipated future functionalities include:

statistical, trainable layout analysis
efficient OCR and layout correction, active learning
a web services interface
PDF, camera, and screen OCR
support for additional languages
integration with Beagle, Spotlight, Google Desktop Search
GUI frontends for Gnome, Windows, Macintosh
packaging for Ubuntu, Fedora, and other platforms .

Wow. Those are some lofty ambitions! Although it is in a pre-alpha state currently, I couldn't help but try it out.

As a tarball is not available, I checked out the development tree:

$ svn checkout http://ocropus.googlecode.com/svn/trunk/ ocropus

In addition to tesseract, a few development libraries had to be installed, but this is a breeze with urpmi:

# urpmi jam aspell-devel libtiff-devel libpng-devel libjpeg-devel

Compiling was easy:

$ ./configure; jam

There is very little documentation bundled with the project. The only test I performed was a direct OCR of my 400 DPI greyscale standard. The default output of ocropus is HTML, with some embedded OCR-specific tags. As expected, the results were the same quality as Tesseract, with the é being the only incorrectly interpreted character. I've posted the resulting HTML file in case you're interested.

Ocropus is entirely the most interesting and promising project here, and with Google's interest in OCR, their sponsorship may be very helpful.

Name:	Ocropus
Location:	http://code.google.com/p/ocropus/
Version:	svn (20070523)
Input Format:	many
Accuracy:	99 %
Easy of Use:	1/5

…

IN CLOSING

The good news is that there are solutions available on Linux right now which interpret documents at up to 99% accuracy. The bad news is that 99% is not 100%, and that anything other than a high quality 400-600 DPI scan of 12-14 point font drops off very quickly in accuracy. The combination of Tesseract and Ocropus is clearly the project we can most rely on to provide the missing elements of a full-featured Free OCR suite.

I learned a lot about OCR in writing this. I realize it's not a perfectly scientific review, but I hope it can be useful to other Linux OCR newbies as well as reminding people that Free software alternatives are available and need your help.