Tesseract

Computer-aided text recognition (OCR) is a method of extracting text from pictures and drawings and converting it into a machine-readable format. OCR is often used for scanned documents. i-net PDFC uses this technology as the basis for various pre-installed filters. The exact description of the individual scenarios is documented with the respective filters.

The OCR in i-net PDFC is based up on Tesseract and requires at least version 4. The configuration of OCR depends on the operating system i-net PDFC is installed on. This configuration page mostly displays information about the state of the Tesseract availability:

Current State - current Tesseract availability
Settings - additional settings, depending on operating system and state

Note: Tesseract 4 and 5 are supported. These must not be alpha or beta versions.

Current State

The Current State section reflects information from the backend system and indicates whether tesseract is functional.

Tesseract variant: the variant used to provide Tesseract functions. Can bei either Windows or Custom Installation. A custom installation is required on all non-Windows operating systems.

Status: should be OK if there are no issues and Tesseract can be used. Will display an error otherwise.

Version: the version of Tesseract, detected by the plugin. Will display an error otherwise.

Available Languages: a list of languages, detected from the settings of the plugin. Will display an error otherwise.

Tesseract variant: Windows

The Visual C++ Redistributable 2015 package has to be installed on the Windows system, which can be done in one of the following ways:

automatically: choco install vcredist2015
manually: download and install from Microsoft

Tesseract variant: Custom Installation

For custom installations, please check Install Tesseract for installation details on Linux and Windows systems. macOS users can usually use one of the following commands to install Tesseract 5 via the package manager MacPorts or Homebrew:

sudo port install tesseract
 
# or
 
brew install tesseract

Additional languages

If additional languages besides English should be supported, the corresponding language files must be installed manually by downloading the corresponding *.traineddata files. Afterwards these files have to be moved into the <installation>/lang/tessdata folder or the customized path. Finally the i-net PDFC server has to be restarted.

Settings

Tesseract executable: the path and file name of the Tesseract main binary. If it is part of the PATH environment, only the the tesseract binary name should suffice. This entry is only shown for the Custom Installation Tesseract variant.

Path to '.traineddata' language files: a path to the training files of tesseract. Will use the default folder lang/tessdata if left empty.
- Note: The English language file is always required.