Text recognition
Computer-aided text recognition (OCR) is a method of extracting text from pictures and drawings and converting it into a machine-readable format. OCR is often used for faxes or scanned documents. i-net PDFC uses this technology as the basis for various pre-installed filters. The exact description of the individual scenarios is documented with the respective filters.
Tesseract OCR
i-net PDFC's basic OCR plug-in used the open source software Tesseract as its standard for text recognition. The current version of Tesseract uses trained neural networks for recognition and thus offers a very high recognition rate for printed texts. Handwritten texts are not supported.
The configuration specifies which variant of Tesseract is used and which languages are available for text recognition.
Prerequisites
In order for Tesseract to deliver the best possible results, a number of characteristics are required.
-
Tesseract must be installed and operational. The functionality can be tested via configuration or recovery.
-
A language must be specified. (This is done automatically by the plugin LanguageDetection, prerequisite a document contains text as such) If the language is detected incorrectly, it can be set manually.
-
Only the language English is delivered as standard, further languages must be added by yourself. (If the language is missing, English is used)
-
The quality of the images must be at least 300 DPI. A resolution of 300 DPI is reached when the small letter
x
has a height of about 10 pixels. -
The background colour should be monochrome. Noise in the image should be avoided.
-
Texts should be aligned horizontally.
-
The font should not be exotic. Well working fonts are included in this list.
-
The text should not be written by hand.