Comparison Profile
A Compare Profile for PDFC contains parameters and settings for the comparison of documents. Different Compare Profiles can lead to very different results when compared. Therefore, it may be necessary to adjust or optimize these for certain comparison scenarios.
From i-net PDFC provided and by the Administrator shared Compare Profile can not be changed by the user. These profiles, however, can duplicated and then customized for their own needs. With the administrative permissions (permission to "configuration") a Compare Profile can be shared with other users on the server.
Note: The a shared configuration can't be unshared, it has to be deleted.
Manage Compare Profiles
The compare profiles provided by i-net PDFC can be activated or deactivated in the configuration under Comparison> Profiles.
In the footer of the Compare Profile window you have the possibility to manage Compare Profiles. The currently open profile can be duplicated, exported, published and deleted. In addition, settings for a Compare Profiles can be imported.
Note: In this way, for example, Compare Profiles can be exported from the i-net PDFC GUI and imported into the server. This also works in the other direction. Exported Compare Profiles can be edited outside the application, that means settings which can not be set with the configuration interface. This can be headers and footers that are over 100 pixels high. The possible settings can be found in the respective tables.
The following setting are optional and only used for managing the difference profiles. The properties have no effect at the comparisons.
Property Name | Description |
---|---|
PROFIL_NAME |
A unique name to differentiate the profiles. |
PROFIL_DESCRIPTION |
A descriptive text about the profile |
Import/Export of profiles
Each profile can be stored to an external file by clicking on Export
at the bottom of the profile panel. The export files are portable and can be used for any type of i-net PDFC installation - GUI, API and Server.
To import a profile, create a custom profile and select it as the active one, The Import
label will be available at the bottom of the profile configuration page. Click on this label to select a file to import. Alternatively drag&drop a profile XML file into the panel to load the settings.
The selected imported profile will replace all settings of the current profile.
Profile settings
The behavior and the precision of i-net PDFC can be specified using the configuration properties. These configuration properties of i-net PDFC are included in the file config.xml
.
After the installation the installation folder contains the config.xml
with default values. You can change the default values by editing this file. If it does not exist, then i-net PDFC uses the default values.
Note: The graphical user interface uses the file .pdfc
in the current users home directory instead of the config.xml
file in the installation directory. You can edit this file the same way to perform some special fine-tuning but this is not necessary in most of the cases.
A profile basically contains the settings for comparison mode, element comparison types and filters to be used. Each filter or comparison type may have additional options to fine-tune the feature.
Comparison mode
Property Name | Description |
---|---|
CONTINUOUS_COMPARE |
This value defines which comparison engine to be used. Currently available are 'CONTINUOUS' and 'STRICT'. The default value is: CONTINUOUS. |
Continous mode
The following property are only work in the comparison mode CONTINUOUS
.
Property Name | Description | Default | Range |
---|---|---|---|
CONTINUOUS_DETECT_PAGES |
Specifies whether the continuous compare can be splited instead of comparing all content at once. If set to a value greater zero, this specifies how many pages may be added or inserted before the comparison fails to match the content. The larger this value, the more precise the comparison will be. On the downside a large value will increase the memory consumption. If this value is set to zero, all content will be compared at one. This will give the optimum result on the cost of a maximum memory requirement. | 5 | 0 - 2147483647 |
Strict mode
The following property are only work in the comparison mode STRICT
.
Property Name | Description | Default | Range |
---|---|---|---|
TOLERANCE_PAGE_LEFTCORNER |
Specifies the maximum number of pixels that the left or top margin of a page can differ (is the upper left corner of all elements) before it is viewed as a difference. | 3 | 0 - 100 |
TOLERANCE_PAGE_RATIO |
Specifies the tolerance for the aspect ratio of the PDF page. | 0.01 | 0 - 1 |
TOLERANCE_PAGE_SIZE |
Specifies the maximum number of pixels that the width or height of a page can differ before it is viewed as a difference. | 2 | 0 - 100 |
Logging and Results
Deprecated, these properties will be removed in a future version.
With the following properties it is possible to configure the output of i-net PDFC and the logging. This profile setting are deprecated and will be removed soon. Use the Settings or the CommandLine arguments instead.
Property Name | Description |
---|---|
CREATE_DIFFIMAGES |
Specifies if a PNG image with the marked difference will be created for each pair of pages that contains differences. Possible values are: false , first , second and true - this creates the difference image for none, the first, the second or both files. The default value is: false. |
CREATE_ORIGIMAGES |
Specifies if a PNG image with the original content will be created for each compared page. The default value is: false. |
CREATE_XORIMAGES |
Creates an (negated) XOR image for any pair of pages with differences. The image will be stored as a PNG in the differences directory of the current comparison. If CREATE_DIFFIMAGES is enabled as well, the XOR image will be drawn onto the image created by CREATE_DIFFIMAGES between the two actual page images. The default value is: false. |
IMAGE_SCALE_FACTOR |
Defines a scale factor for the generated images (original and difference images). The default is 1, i.e. no scaling. The default value is: 1. |
LOG_FILE |
Specifies the file where logged information is to be stored. If a file is specified, the logging is written to the file, otherwise the logging is written to the console. Default is empty, logging to the console. |
LOG_LEVEL |
Specifies the Logging Level. Available values:
''OFF'' // switches the output completely off. ''ERROR'' // logs error messages. ''WARN'' // contains all the messages from ERROR-Level and additionally informs about the irregularities during the execution. ''INFO'' // (Default) contains all the messages from WARN-Level and additionally describes settings and environment attributes. ''ALL'' // is used to display the maximal information during the PDFC execution including any debug info. |
MAX_ERRORS_PER_FILE |
Sets the maximum number of differences for the console or log output. All futher differences will be counted but not show in detail. The default value is: 100. [ Value "-1" for unlimited] |
EXPORT_PDF_ALWAYS |
Specifies whether the PDF export function should create a file for any comparison(value true ) or only in case of differences(value false ). The default value is: false |
Filters and optimizations
Filters are an optional feature for the continuous mode. They help to remove redundant elements from the comparison and to overcome the issue that PDFs may not contain any information about the original text layout. Please note the these filters may not be exactly correct in every single case. Finding the original layout of a document depends heavily on the content of these documents. The chance of correctly detecting a header rises with the number of pages available. So it's recommended to use the desktop or web application of i-net PDFC when activating filters since they allow you to review the result of each filter.
Filters can be activated by adding them to the FILTERS property:
Property Name | Description |
---|---|
FILTERS |
Specifies a comma-separated list of filters that will be executed before the actual comparison. |
If all filter available plugins are installed and activated in the configuration (server only), the following filter keys are available:
Area for Comparison
With this filter it is possible to filter an area over all pages for the comparison. The filtered area(s) to compare can be specified semicolon separated. If no value is specified, then no areas will be filtered. All elements completed inside this areas will be filtered.
An area is define with 4 integer parameters (x, y, width, height). Each parameter is separated with a comma. Empty parameter is equal to 0. The values are in 'pixels' with a resolution of 72 DPI. This resolution is used to calculate the default rendering size of a page as well. For example an US-Letter sized page in portrait orientation has 612 x 792 pixels. The values are relative to the pages as they would be displayed at 72 DPI. A different screen resolution will have no effect to the area filter.
Multiple areas are supported. Use ; to separate the area definitions.
With an optional parameters it is possible to specify the page number and/or the document on that this area will be filtered. Available values for page number are 1 to max. document page number. With no declaration, the area will be filtered for all pages. Available values for documents are 'F' for first document or 'S' for second document. With no declaration, the area will be filtered for both documents
Samples:
-
0,0,100,100
-
,,100,100
-
5,5,10,10;50,10,50,50 (two areas)
-
0,0,100,100,1 (only valid for the first page)
-
50,50,100,100,F (area for all pages but only for the first document)
-
100,10,200,200,3,S (area for the 3.page for the second document)
Property
Name | Description |
---|---|
FILTERS |
Add AREA to the comma separated list to enable. The default value is disabled |
AREAFILTER |
Remove rectangle. Valid characters are 0-9 , S , F , - and , . For further example see above. |
Pages to Compare
This filter allows to select pages and page ranges that should be compared. Multiple pages can be selected using a comma separated list. All pages will be used for the comparison if no value is given.
The filter can be applied to each document using the fields "Comparison range document 1"
and "Comparison range document 2"
.
To filter pages from the end of the document there are two additional fields "Last Page(s) filter document 1"
and "Last Page(s) filter document 2"
. Multiple pages and ranges can be selected here as well. Positive numbers are being used, starting with 1
being the last page. A value of 0
(default) means: no filtering.
Examples for page and page range definition:
-
1
-
1-4
-
4-7,11-32
-
1,5,7-21
Page for Comparison
Property
Name | Description |
---|---|
FILTERS |
Add PAGERANGE to the comma separated list to enable. The default value is disabled |
PAGERANGE_DOCUMENT1 |
Remove the page for the first document. Valid characters are 0-9 , , and - . For further example see above. |
PAGERANGEEND1 |
Removes the last pages for comparison. Allowed characters are 0-9 |
PAGERANGEEND2 |
Removes the last pages for comparison. Allowed characters are 0-9 |
Basic Table Optimization
This filter can be used to optimize the comparison of tables with visual borders. The filter will detect the original structure of the table and rearrange the content so that the content will be compared by cell.
Requirement: This filter will only detect a table, if
-
The table has a visible border
-
Each cell has a visible border
-
There is no cell spacing
-
The table has at least two rows and two columns
Filter repeating headers
In case a table does not fit onto a single page it is common to repeat the table header after the page break. Usually i-net PDFC tends to mark such repeated headers as differences since it's content that does not belong to the table data. With this option enabled, the filter will exclude any table header from the comparison that is identical to the header of the last table on the previous page.
Property
Property Name | Description |
---|---|
FILTERS |
Add BASELINETABLE to the comma separated list to enable the filter. The default value is disabled |
IGNORE_REPEATING_HEADER |
Enables or disables the filtering of repeated table headers. The default value is false |
TABLE_HEADER_DIFF_RATIO |
If IGNORE_REPEATING_HEADER is active this value defines the number of differences that will be tolerated while scanning for the repeated table header. It is recommended to leave this value at the default value. The default value is 0 |
Multi-column layout
This filter should be used if the content is arranged in several columns. A typical example is the layout of daily newspapers.
Note: The filter is not suitable for tables!
Property
Property Name | Description |
---|---|
FILTERS |
Add MULTICOLUMN to the comma separated list to enable. Optimizes the text recognition for a multi-column layout. The default value is false |
Headers and Footers
This filter can be used to exclude headings and footers from the comparison that leads to the reduction of repeating differences. Automatic detection is only possible in non-strict mode. Three options are available.
-
Do not recognize Headers and footers are not recognized
-
Automatically Detect Headers and footers are automatically recognized and treated by PDFC
-
Manually set Headers and footers allow you to precisely adjust pixels if the areas can not be detected automatically.
Property
Name | Description |
---|---|
FILTERS |
Add HEADERFOOTER to the comma separated list to enable. The default value is disabled |
FIXED_HEADER_SIZE |
Specifies the size of the header in pixels. Set the value to -1 to automatically detect the header. The default value is -1 |
FIXED_FOOTER_SIZE |
Specifies the size of the footer in pixels. Set the value to -1 to automatically recognize the footer. The default value is -1 |
Compare original PDF text order
This filter option toggles how the layout detection of i-net PDFC will react to PDF files. By default, i-net PDFC will try to detect the layout of the document pages to some extend. But with this option activate, the original print order of the file will be used. It's recommended to only use this option if the document is a PDF that was generated by a text processor.
For further details can be found on the page of this parser extension.
Name | Description |
---|---|
FILTERS |
Add ORIGINALORDER to the comma separated list to enable. The default value is disabled |
USE_PDF_STRUCTURE |
Boolean value to enable the usage of the structure tree, if available. This have no effect, if there is no logical structure. Default is true |
Filter content
You can specify patterns for the content filter. There are two types of filters: plain-text filters and regular expressions. These patterns can be disabled without deleting them from the configuration.
Property
Property Name | Description |
---|---|
FILTERS |
Add REGEXP to the comma separated list to enable. The default value is disabled |
FILTER_PATTERNS |
Defines a list of filters, each filter is defined by one pattern/strin, e.g. <pattern or string>|(regexp|text)|(active|inactive) |
REGEX_MATCH_WORDS |
If enabled, words will be filtered completely even if the pattern matches only a part of the word. When disabled, only the matched characters will be filtered. Default ist enabled. |
Text recognition (OCR)
This filter uses the optical character recognition plugin to extract text content from images and drawings. As a requirement the OCR plugin has to be active and the required language files have to be installed. For further details, please refer to the OCR plugin
Error tolerance
Optical character recognition often has some recognition errors due to small fonts, poor scanning, noise by background images or even ambiguous characters. To overcame these errors a tolerance level can be defined.
-
None - compare all characters as recognized (not recommended)
-
Similar characters only - tolerate errors on characters with the same appearance, like a Latin 'a' and a Russian 'а'. A full of tolerated characters can be found here http://www.unicode.org/reports/tr36/confusables.txt
-
Common recognition errors - tolerate errors in characters with similar appearance especially on noisy background. This tolerance is based on experience and testing as there is no public recommendation. An example would be the German sharp s 'ß' and the upper case letter 'B' that are very similar in some fonts.
-
Common recognition errors caused by distortion - same as 'Common recognition errors' but extended for slightly rotated or distorted images. Such distortions are usually happen when scanning pages.
Property
Name | Beschreibung |
---|---|
FILTERS |
Add 'OCR' to the comma separated list to enable. The default value is disabled |
NORMALIZATION_LEVEL |
0 - None - compare all characters as recognized (not recommended) |
1 - Similar characters only - tolerate errors on characters with the same appearance, like a Latin 'a' and a Russian 'а'. A full of tolerated characters can be found here http://www.unicode.org/reports/tr36/confusables.txt |
|
2 - Common recognition errors - tolerate errors in characters with similar appearance especially on noisy background. This tolerance is based on experience and testing as there is no public recommendation. An example would be the German sharp s ß and the upper case letter B that are very similar in some fonts. |
|
3 - Common recognition errors caused by distortion - same as 'Common recognition errors' but extended for slightly rotated or distorted images. Such distortions are usually happen when scanning pages. |
Text recovery
Some PDF files have a missing character mapping. Such mapping is required to translate from the visual text to machine readable text. The usual effect is a correctly displayed document to with an apparently corrupt text in the comparison result. Furthermore the text is corrupted as well when copying & pasting from this document (with any reader application!).
As a solution this filter will rebuild and correct the character mapping by using optical character recognition. The accuracy depends on the amount of text with more text providing higher accuracy.
Property
Name | Description |
---|---|
FILTERS |
Add CMAPPATCH to the comma separated list to enable. The default value is disabled |
Compared Types
The continuous compare mode distinguishes between four types of content: text words, lines / shapes, images and annotations. Each of theses types can be excluded from the comparison.
Compared types can be included or excluded by COMPARE_TYPES
property:
Property Name | Description |
---|---|
COMPARE_TYPES |
Specifies a comma-separated list of types that will be included in the comparison. Default is 'TEXT, LINE, IMAGE, ANNOTATION' |
Text comparison
Property
Includes all text elements like words, numbers, punctuation and list items. The text comparison can be modified using the following properties:
Property Name | Description |
---|---|
DOCUMENT_LANGUAGE |
This value defines the language for all text recognition plugins. If the configured language doesn't match the actual language of the document, the recognition errors will increase significantly. If the required language is not available, please have a look at the OCR help page on how to install further languages. The default value is 'auto-detect' in which case i-net PDFC will try to detect the language from the native text elements in the document, if any. In case the language cannot be detected, the client language or English will be used. |
TEXT_ALIGN_RATIO |
The text tolerance value sets the allowed vertical deviation for the text line identification. It is relative to the text height of the respective line. This value can be used to compensate rounding errors of different PDF generators. But, if the documents are very accurately layouted a lower value will lead to a more precise comparison. The default value is 0.15 |
COMPARE_TEXT_STYLES |
A comma separated list defining which text properties of matched words to compare. Available values are SIZE, COLOR, FONT, STYLES, ROTATION. The default value is 'true' which compares all properties. |
TOLERANCE_TEXT_SIZE |
This property defines the tolerated difference in the text size as a ratio. It's only relevant in case COMPARE_TEXT_STYLES is set to true. The default value is 0.05 |
TOLERANCE_COLOR |
Defines the maximum color difference per RGB or HSB channel for all paints. The value is the absolute difference for HSB and absolute * 255 for RGB. This value is used by the line comparison as well. Will be used for Text and Line comparison. The default value is: 0.01 which is 1% |
COMPARE_TEXT_CASE_SENSITIVE |
This switch toggles the case sensitivity of the text comparison. If set to 'false', all text elements will be compared as lower case. This cause the comparison to run slightly slower and take some more memory. The conversion to lower case will be performed using the default localization of the runtime. The default value is 'true' |
TOLERANCE_UNDERLINE_LENGTH |
Specifies the maximum difference in percent, in which the length of underlines may differ before it is viewed as a difference. The default value is: 0.1. The range is 0.0 - 10.0. This value will only be use for a STRICT comparison mode |
Ignore rotated text
Excludes rotated text from the comparison. This setting is particularly suitable for watermarks and print marks.
Property Name | Description |
---|---|
INVISIBLEELEMENTS_HIDE_ROTATION |
Excludes rotated text from the comparison. The default value is true |
FILTERS |
Add 'HIDEROTATEDTEXT' to the comma separated list to enable. The default value is disabled |
Decompose complex characters
Activate to decompose complex or special characters in into basic characters. Complex characters are for instance ligatures like 'fi' which will be decomposed into 'fi'. Furthermore special character like long or short hyphens will be normalized to their base character.
Property
Property Name | Description |
---|---|
TRANSFORM_OPERATIONS | Add REPALCE_IDENTICAL to the comma separated list to enable. The default value is enabled |
FILTERS | Add TEXTTRANSFORM to the comma separated list to enable. The default value is enabled |
Property Name | Description |
---|---|
TRANSFORM_OPERATIONS | Add REPLACE_CONFUSABLES to the comma separated list to enable. The default value is disabled |
FILTERS | Add TEXTTRANSFORM to the comma separated list to enable. The default value is enabled |
Equalize character recognition mistakes
Activate to correct typical text recognition mistakes. An example for a common ambiguousness in text recognition is the character 'm' and the syllable 'rn' which appear very similar depending on print quality and font.
Ignore invisible elements
The purpose of this filter is to ignore the meaningless elements, generated by certain PDF renderers. E.g. text outside of the visible area(or page) or transparent borders of tables. So the filter is designed to efficiently remove:
-
transparent text
-
tiny shapes which are not visible at 100% scale
-
transparent or white filled shapes
Clipping calculation
Several document formats, such as PDF, are actually vector graphics formats. These documents contain commands which advise the viewer application what and where to draw. The commands are not exclusive and may cause overlapping or nonsensical drawing operations. Like text that is hidden behind an opaque shape. Or a white line on a white background. To recognize such scenarios, where certain graphical elements are hidden or clipped, requires to calculate the actual visibility of each such element. Due to the performance impact of this computation it has to be activated by the switch COMPUTE_CLIPPING
.
With this feature active, i-net PDFC will check for any element: whether it's occluded, whether it can be merged together with similar elements (for shapes and images), whether it is clipped in some way and whether it's the same color as it's background. So, only the visible part of each element will be compared or the element is ignored if it has no impact on the visual appearance of the document.
Property
Property Name | Description |
---|---|
FILTERS |
Add INVISIBLEELEMENTS to the comma separated list to enable. Potentially invisible elements such as white or transparent lines are not compared. |
COMPUTE_CLIPPING |
Enables the calculation of the actual visibility of each element in the document. This feature may require a lot of performance, thus the default value is false |
Line and shapes comparison
Deviation tolerance for lines
This value includes all graphical elements except images. The line and shape comparison can be modified using the following properties:
Property Name | Description | Default | Range |
---|---|---|---|
COMPARE_LINE_STYLES |
If set to 'true', the styles of all matched lines and shapes will be checked as well. This will compare the color, stroke and thickness of all lines. | 'true' | |
TOLERANCE_LINE_POSITION |
Specifies the maximum number of pixels that the position of a line or curves can differ per axis before it is viewed as a difference. | 3 | 0 - 100 |
TOLERANCE_LINE_SIZE |
Specifies the maximum number of pixels that the length of a line can differ in total before it is viewed as a difference. | 2 | 0 - 100 |
TOLERANCE_LINE_THICKNESS |
Specifies the maximum difference in stroke thickness of two lines or curves (measured in pt) before it is viewed as a difference. | 1 | 100 |
TOLERANCE_COLOR |
Defines the maximum color difference per RGB or HSB channel for all paints. The value is the absolute difference for HSB and absolute * 255 for RGB. This value is used by the text comparison as well. Will be used for Text and Line comparison. | 0.01 (1%) | 0.0 - 1.0 |
TOLERANCE_BOX_ROUND_EDGES |
Specifies the maximum number of pixels (1 pixel is approximately 0.265mm) that a control point of a quadratic Bézier curve may differ in total before it is viewed as a difference. | 3 | 0 - 10 |
Image comparison
Deviation tolerance for images
This value includes all images. Note that comparing images may have a notable impact on your performance. The image comparison can be modified using the following properties:
Property Name | Description | Default | Range |
---|---|---|---|
TOLERANCE_IMAGE_DISTANCE |
Specifies the maximum number of pixels that the position of an image can differ before it is viewed as a difference. | 3 | 0 - 10 |
TOLERANCE_IMAGE_PIXEL_VALUE |
Specifies the maximal allowed discrepancy of pixel values (Double) before it is viewed as a difference. | 0.05 | 0.0 - 1.0 |
TOLERANCE_IMAGE_SIZE |
Specifies the maximum difference in percent that the area spanned by an image may differ before it is viewed as a difference. | 0.1 | 0.0 - 1.0 |
USE_PIXEL_MEDIUM_VALUE |
This property of the image comparison specifies, if i-net PDFC should compare the medium values instead of single-pixel values. | 'true' |
Annotation Comparison
Property Name | Description |
---|---|
COMPARE_ANNOTATIONS_DETAILED |
Specifies whether differences per annotation are summarized (false) or will be fully detailed as any other difference in the document (true). Default is 'false' |