Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google and is probably one of the most accurate open source OCR engines available.
Combined with the Leptonica Image Processing Library it can read a wide variety of image formats and convert them to text in over 40 languages.
Command Line usage: imagename outputbase [-l lang] [-psm pagesegfmode] [configfile...]
What's New in This Release:
· Moved ResultIterator/PageIterator to ccmain.
· Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic.
· Added paragraph detection in layout analysis/post OCR.
· Fixed inconsistent xheight during training and over-chopping.
· Added simultaneous multi-language capability.
· Refactored top-level word recognition module.
· Added experimental equation detector.
· Improved handling of resolution from input images.
· Blamer module added for error analysis.
· Cleaned up externally used namespace by removing includes from baseapi.h.
· Removed dead memory management code.
· Tidied up constraints on control parameters.
· Added support for ShapeTable in classifier and training.
· Refactored class pruner.
· Fixed training leaks and randomness.
· Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding.
· Improved line detection and removal.
· Added fixed pitch ch...