{{sidenavigation.sidenavigationExpandLabel}}
{{getMsg('Help_YouAreHere')}}: {{page.title}} {{page.title}}

{{$root.getMsg("downLoadHelpAsPdf")}} {{helpModel.downloadHelpPdfDataStatus}}

Comparison Profile

A Compare Profile for PDFC contains parameters and settings for the comparison of documents. Different Compare Profiles can lead to very different results when compared. Therefore, it may be necessary to adjust or optimize these for certain comparison scenarios.

From i-net PDFC provided and by the Administrator shared Compare Profile can not be changed by the user. These profiles, however, can duplicated and then customized for their own needs. With the administrative permissions (permission to "configuration") a Compare Profile can be shared with other users on the server.

Note: The a shared configuration can't be unshared, it has to be deleted.

Manage Compare Profiles

The compare profiles provided by i-net PDFC can be activated or deactivated in the configuration under Comparison> Profiles.

In the footer of the Compare Profile window you have the possibility to manage Compare Profiles. The currently open profile can be duplicated, exported, published and deleted. In addition, settings for a Compare Profiles can be imported.

Note: In this way, for example, Compare Profiles can be exported from the i-net PDFC GUI and imported into the server. This also works in the other direction. Exported Compare Profiles can be edited outside the application, that means settings which can not be set with the configuration interface. This can be headers and footers that are over 100 pixels high. The possible settings can be found in the respective tables.

The following setting are optional and only used for managing the difference profiles. The properties have no effect at the comparisons.

Property Name Description
PROFIL_NAME A unique name to differentiate the profiles.
PROFIL_DESCRIPTION A descriptive text about the profile

Import/Export of profiles

Each profile can be stored to an external file by clicking on Export at the bottom of the profile panel. The export files are portable and can be used for any type of i-net PDFC installation - GUI, API and Server.

To import a profile, create a custom profile and select it as the active one, The Import label will be available at the bottom of the profile configuration page. Click on this label to select a file to import. Alternatively drag&drop a profile XML file into the panel to load the settings.

The selected imported profile will replace all settings of the current profile.

Profile settings

The behavior and the precision of i-net PDFC can be specified using the configuration properties. These configuration properties of i-net PDFC are included in the file config.xml.

After the installation the installation folder contains the config.xml with default values. You can change the default values ​​by editing this file. If it does not exist, then i-net PDFC uses the default values.

Note: The graphical user interface uses the file .pdfc in the current users home directory instead of the config.xml file in the installation directory. You can edit this file the same way to perform some special fine-tuning but this is not necessary in most of the cases.

A profile basically contains the settings for comparison mode, element comparison types and filters to be used. Each filter or comparison type may have additional options to fine-tune the feature.

Comparison mode

Property Name Description
CONTINUOUS_COMPARE This value defines which comparison engine to be used. Currently available are 'CONTINUOUS' and 'STRICT'. The default value is: CONTINUOUS.

Continous mode

The following property are only work in the comparison mode CONTINUOUS.

Property Name Description Default Range
CONTINUOUS_DETECT_PAGES Specifies whether the continuous compare can be splited instead of comparing all content at once. If set to a value greater zero, this specifies how many pages may be added or inserted before the comparison fails to match the content. The larger this value, the more precise the comparison will be. On the downside a large value will increase the memory consumption. If this value is set to zero, all content will be compared at one. This will give the optimum result on the cost of a maximum memory requirement. 5 0 - 2147483647

Strict mode

The following property are only work in the comparison mode STRICT.

Property Name Description Default Range
TOLERANCE_PAGE_LEFTCORNER Specifies the maximum number of pixels that the left or top margin of a page can differ (is the upper left corner of all elements) before it is viewed as a difference. 3 0 - 100
TOLERANCE_PAGE_RATIO Specifies the tolerance for the aspect ratio of the PDF page. 0.01 0 - 1
TOLERANCE_PAGE_SIZE Specifies the maximum number of pixels that the width or height of a page can differ before it is viewed as a difference. 2 0 - 100

Logging and Results

Deprecated, these properties will be removed in a future version.

With the following properties it is possible to configure the output of i-net PDFC and the logging. This profile setting are deprecated and will be removed soon. Use the Settings or the CommandLine arguments instead.

Property Name Description
CREATE_DIFFIMAGES Specifies if a PNG image with the marked difference will be created for each pair
of pages that contains differences. Possible values are: false, first, second
and true - this creates the difference image for none, the first, the second or
both files.
The default value is: false.
CREATE_ORIGIMAGES Specifies if a PNG image with the original content will be created for
each compared page.
The default value is: false.
CREATE_XORIMAGES Creates an (negated) XOR image for any pair of pages with differences.
The image will be stored as a PNG in the differences directory of the
current comparison. If CREATE_DIFFIMAGES is enabled as well,
the XOR image will be drawn onto the image created by
CREATE_DIFFIMAGES between the two actual page images.
The default value is: false.
IMAGE_SCALE_FACTOR Defines a scale factor for the generated images (original and
difference images). The default is 1, i.e. no scaling.
The default value is: 1.
LOG_FILE Specifies the file where logged information is to be stored.
If a file is specified, the logging is written to the file, otherwise the logging is
written to the console.
Default is empty, logging to the console.
LOG_LEVEL Specifies the Logging Level. Available values:
''OFF'' // switches the output completely off.
''ERROR'' // logs error messages.
''WARN'' // contains all the messages from ERROR-Level and 
            additionally informs about the irregularities 
            during the execution.
''INFO'' // (Default) contains all the messages from 
            WARN-Level and additionally describes settings 
            and environment attributes.
''ALL'' // is used to display the maximal information during 
           the PDFC execution including any debug info.
MAX_ERRORS_PER_FILE Sets the maximum number of differences for the console or log output.
All futher differences will be counted but not show in detail.
The default value is: 100. [ Value "-1" for unlimited]
EXPORT_PDF_ALWAYS Specifies whether the PDF export function should create a file for
any comparison(value true) or only in case of differences(value false).
The default value is: false

Filters and optimizations

Filters are an optional feature for the continuous mode. They help to remove redundant elements from the comparison and to overcome the issue that PDFs may not contain any information about the original text layout. Please note the these filters may not be exactly correct in every single case. Finding the original layout of a document depends heavily on the content of these documents. The chance of correctly detecting a header rises with the number of pages available. So it's recommended to use the desktop or web application of i-net PDFC when activating filters since they allow you to review the result of each filter.

Filters can be activated by adding them to the FILTERS property:

Property Name Description
FILTERS Specifies a comma-separated list of filters that will be executed before the actual comparison.

If all filter available plugins are installed and activated in the configuration (server only), the following filter keys are available:

Area for Comparison

With this filter it is possible to filter an area over all pages for the comparison. The filtered area(s) to compare can be specified semicolon separated. If no value is specified, then no areas will be filtered. All elements completed inside this areas will be filtered.

An area is define with 4 integer parameters (x, y, width, height). Each parameter is separated with a comma. Empty parameter is equal to 0. The values are in 'pixels' with a resolution of 72 DPI. This resolution is used to calculate the default rendering size of a page as well. For example an US-Letter sized page in portrait orientation has 612 x 792 pixels. The values are relative to the pages as they would be displayed at 72 DPI. A different screen resolution will have no effect to the area filter.

Multiple areas are supported. Use ; to separate the area definitions.

With an optional parameters it is possible to specify the page number and/or the document on that this area will be filtered. Available values for page number are 1 to max. document page number. With no declaration, the area will be filtered for all pages. Available values for documents are 'F' for first document or 'S' for second document. With no declaration, the area will be filtered for both documents

Samples:

  • 0,0,100,100
  • ,,100,100
  • 5,5,10,10;50,10,50,50 (two areas)
  • 0,0,100,100,1 (only valid for the first page)
  • 50,50,100,100,F (area for all pages but only for the first document)
  • 100,10,200,200,3,S (area for the 3.page for the second document)

Property

Name Description
FILTERS Add AREA to the comma separated list to enable. The default value is disabled
AREAFILTER Remove rectangle. Valid characters are 0-9 , S , F , - and , . For further example see above.

Pages to Compare

This filter allows to select pages and page ranges that should be compared. Multiple pages can be selected using a comma separated list. All pages will be used for the comparison if no value is given.

The filter can be applied to each document using the fields "Comparison range document 1" and "Comparison range document 2".

To filter pages from the end of the document there are two additional fields "Last Page(s) filter document 1" and "Last Page(s) filter document 2". Multiple pages and ranges can be selected here as well. Positive numbers are being used, starting with 1 being the last page. A value of 0 (default) means: no filtering.

Examples for page and page range definition:

  • 1
  • 1-4
  • 4-7,11-32
  • 1,5,7-21

Page for Comparison

Property

Name Description
FILTERS Add PAGERANGE to the comma separated list to enable. The default value is disabled
PAGERANGE_DOCUMENT1 Remove the page for the first document. Valid characters are 0-9, , and - . For further example see above.
PAGERANGEEND1 Removes the last pages for comparison. Allowed characters are 0-9
PAGERANGEEND2 Removes the last pages for comparison. Allowed characters are 0-9

Basic Table Optimization

This filter can be used to optimize the comparison of tables with visual borders. The filter will detect the original structure of the table and rearrange the content so that the content will be compared by cell.

Requirement: This filter will only detect a table, if

  • The table has a visible border
  • Each cell has a visible border
  • There is no cell spacing
  • The table has at least two rows and two columns

Filter repeating headers

In case a table does not fit onto a single page it is common to repeat the table header after the page break. Usually i-net PDFC tends to mark such repeated headers as differences since it's content that does not belong to the table data. With this option enabled, the filter will exclude any table header from the comparison that is identical to the header of the last table on the previous page.

Property

Property Name Description
FILTERS Add BASELINETABLE to the comma separated list to enable the filter. The default value is disabled
IGNORE_REPEATING_HEADER Enables or disables the filtering of repeated table headers. The default value is false
TABLE_HEADER_DIFF_RATIO If IGNORE_REPEATING_HEADER is active this value defines the number of differences that will be tolerated while scanning for the repeated table header. It is recommended to leave this value at the default value. The default value is 0

Multi-column layout

This filter should be used if the content is arranged in several columns. A typical example is the layout of daily newspapers.

Note: The filter is not suitable for tables!

Property

Property Name Description
FILTERS Add MULTICOLUMN to the comma separated list to enable. Optimizes the text recognition for a multi-column layout. The default value is false

Headers and Footers

This filter can be used to exclude headings and footers from the comparison that leads to the reduction of repeating differences. Automatic detection is only possible in non-strict mode. Three options are available.

  • Do not recognize Headers and footers are not recognized
  • Automatically Detect Headers and footers are automatically recognized and treated by PDFC
  • Manually set Headers and footers allow you to precisely adjust pixels if the areas can not be detected automatically.

Property

Name Description
FILTERS Add HEADERFOOTER to the comma separated list to enable. The default value is disabled
FIXED_HEADER_SIZE Specifies the size of the header in pixels. Set the value to -1 to automatically detect the header. The default value is -1
FIXED_FOOTER_SIZE Specifies the size of the footer in pixels. Set the value to -1 to automatically recognize the footer. The default value is -1

Compare original PDF text order

This filter option toggles how the layout detection of i-net PDFC will react to PDF files. By default, i-net PDFC will try to detect the layout of the document pages to some extend. But with this option activate, the original print order of the file will be used. It's recommended to only use this option if the document is a PDF that was generated by a text processor.

For further details can be found on the page of this parser extension.

Name Description
FILTERS Add ORIGINALORDER to the comma separated list to enable. The default value is disabled
USE_PDF_STRUCTURE Boolean value to enable the usage of the structure tree, if available. This have no effect, if there is no logical structure. Default is true

Filter content

You can specify patterns for the content filter. There are two types of filters: plain-text filters and regular expressions. These patterns can be disabled without deleting them from the configuration.

Property

Property Name Description
FILTERS Add REGEXP to the comma separated list to enable. The default value is disabled
FILTER_PATTERNS Defines a list of filters, each filter is defined by one pattern/strin, e.g. <pattern or string>|(regexp|text)|(active|inactive)
REGEX_MATCH_WORDS If enabled, words will be filtered completely even if the pattern matches only a part of the word. When disabled, only the matched characters will be filtered. Default ist enabled.

Text recognition (OCR)

This filter uses the optical character recognition plugin to extract text content from images and drawings. As a requirement the OCR plugin has to be active and the required language files have to be installed. For further details, please refer to the OCR plugin

Error tolerance

Optical character recognition often has some recognition errors due to small fonts, poor scanning, noise by background images or even ambiguous characters. To overcame these errors a tolerance level can be defined.

  • None - compare all characters as recognized (not recommended)
  • Similar characters only - tolerate errors on characters with the same appearance, like a Latin 'a' and a Russian 'а'. A full of tolerated characters can be found here http://www.unicode.org/reports/tr36/confusables.txt
  • Common recognition errors - tolerate errors in characters with similar appearance especially on noisy background. This tolerance is based on experience and testing as there is no public recommendation. An example would be the German sharp s 'ß' and the upper case letter 'B' that are very similar in some fonts.
  • Common recognition errors caused by distortion - same as 'Common recognition errors' but extended for slightly rotated or distorted images. Such distortions are usually happen when scanning pages.
Property
Name Beschreibung
FILTERS Add 'OCR' to the comma separated list to enable. The default value is disabled
NORMALIZATION_LEVEL 0 - None - compare all characters as recognized (not recommended)
1 - Similar characters only - tolerate errors on characters with the same appearance, like a Latin 'a' and a Russian 'а'. A full of tolerated characters can be found here http://www.unicode.org/reports/tr36/confusables.txt
2 - Common recognition errors - tolerate errors in characters with similar appearance especially on noisy background. This tolerance is based on experience and testing as there is no public recommendation. An example would be the German sharp s ß and the upper case letter B that are very similar in some fonts.
3 - Common recognition errors caused by distortion - same as 'Common recognition errors' but extended for slightly rotated or distorted images. Such distortions are usually happen when scanning pages.

Text recovery

Some PDF files have a missing character mapping. Such mapping is required to translate from the visual text to machine readable text. The usual effect is a correctly displayed document to with an apparently corrupt text in the comparison result. Furthermore the text is corrupted as well when copying & pasting from this document (with any reader application!).

As a solution this filter will rebuild and correct the character mapping by using optical character recognition. The accuracy depends on the amount of text with more text providing higher accuracy.

Property

Name Description
FILTERS Add CMAPPATCH to the comma separated list to enable. The default value is disabled

Compared Types

The continuous compare mode distinguishes between four types of content: text words, lines / shapes, images and annotations. Each of theses types can be excluded from the comparison.

Compared types can be included or excluded by COMPARE_TYPES property:

Property Name Description
COMPARE_TYPES Specifies a comma-separated list of types that will be included in the comparison. Default is 'TEXT, LINE, IMAGE, ANNOTATION'

Text comparison

Property

Includes all text elements like words, numbers, punctuation and list items. The text comparison can be modified using the following properties:

Property Name Description
DOCUMENT_LANGUAGE This value defines the language for all text recognition plugins. If the configured language doesn't match the actual language of the document, the recognition errors will increase significantly. If the required language is not available, please have a look at the OCR help page on how to install further languages. The default value is 'auto-detect' in which case i-net PDFC will try to detect the language from the native text elements in the document, if any. In case the language cannot be detected, the client language or English will be used.
TEXT_ALIGN_RATIO The text tolerance value sets the allowed vertical deviation for the text line identification. It is relative to the text height of the respective line. This value can be used to compensate rounding errors of different PDF generators. But, if the documents are very accurately layouted a lower value will lead to a more precise comparison. The default value is 0.15
COMPARE_TEXT_STYLES A comma separated list defining which text properties of matched words to compare. Available values are SIZE, COLOR, FONT, STYLES, ROTATION. The default value is 'true' which compares all properties.
TOLERANCE_TEXT_SIZE This property defines the tolerated difference in the text size as a ratio. It's only relevant in case COMPARE_TEXT_STYLES is set to true. The default value is 0.05
TOLERANCE_COLOR Defines the maximum color difference per RGB or HSB channel for all paints. The value is the absolute difference for HSB and absolute * 255 for RGB. This value is used by the line comparison as well. Will be used for Text and Line comparison. The default value is: 0.01 which is 1%
COMPARE_TEXT_CASE_SENSITIVE This switch toggles the case sensitivity of the text comparison. If set to 'false', all text elements will be compared as lower case. This cause the comparison to run slightly slower and take some more memory. The conversion to lower case will be performed using the default localization of the runtime. The default value is 'true'
TOLERANCE_UNDERLINE_LENGTH Specifies the maximum difference in percent, in which the length of underlines may differ before it is viewed as a difference. The default value is: 0.1. The range is 0.0 - 10.0. This value will only be use for a STRICT comparison mode

Ignore rotated text

Excludes rotated text from the comparison. This setting is particularly suitable for watermarks and print marks.

Property Name Description
INVISIBLEELEMENTS_HIDE_ROTATION Excludes rotated text from the comparison. The default value is true
FILTERS Add 'HIDEROTATEDTEXT' to the comma separated list to enable. The default value is disabled

Decompose complex characters

Activate to decompose complex or special characters in into basic characters. Complex characters are for instance ligatures like 'fi' which will be decomposed into 'fi'. Furthermore special character like long or short hyphens will be normalized to their base character.

Property
Property Name Description
TRANSFORM_OPERATIONS Add REPALCE_IDENTICAL to the comma separated list to enable. The default value is enabled
FILTERS Add TEXTTRANSFORM to the comma separated list to enable. The default value is enabled
Property Name Description
TRANSFORM_OPERATIONS Add REPLACE_CONFUSABLES to the comma separated list to enable. The default value is disabled
FILTERS Add TEXTTRANSFORM to the comma separated list to enable. The default value is enabled

Equalize character recognition mistakes

Activate to correct typical text recognition mistakes. An example for a common ambiguousness in text recognition is the character 'm' and the syllable 'rn' which appear very similar depending on print quality and font.

Ignore invisible elements

The purpose of this filter is to ignore the meaningless elements, generated by certain PDF renderers. E.g. text outside of the visible area(or page) or transparent borders of tables. So the filter is designed to efficiently remove:

  • transparent text
  • tiny shapes which are not visible at 100% scale
  • transparent or white filled shapes
Clipping calculation

Several document formats, such as PDF, are actually vector graphics formats. These documents contain commands which advise the viewer application what and where to draw. The commands are not exclusive and may cause overlapping or nonsensical drawing operations. Like text that is hidden behind an opaque shape. Or a white line on a white background. To recognize such scenarios, where certain graphical elements are hidden or clipped, requires to calculate the actual visibility of each such element. Due to the performance impact of this computation it has to be activated by the switch COMPUTE_CLIPPING.

With this feature active, i-net PDFC will check for any element: whether it's occluded, whether it can be merged together with similar elements (for shapes and images), whether it is clipped in some way and whether it's the same color as it's background. So, only the visible part of each element will be compared or the element is ignored if it has no impact on the visual appearance of the document.

Property
Property Name Description
FILTERS Add INVISIBLEELEMENTS to the comma separated list to enable. Potentially invisible elements such as white or transparent lines are not compared.
COMPUTE_CLIPPING Enables the calculation of the actual visibility of each element in the document. This feature may require a lot of performance, thus the default value is false

Line and shapes comparison

Deviation tolerance for lines

This value includes all graphical elements except images. The line and shape comparison can be modified using the following properties:

Property Name Description Default Range
COMPARE_LINE_STYLES If set to 'true', the styles of all matched lines and shapes will be checked as well. This will compare the color, stroke and thickness of all lines. 'true'
TOLERANCE_LINE_POSITION Specifies the maximum number of pixels that the position of a line or curves can differ per axis before it is viewed as a difference. 3 0 - 100
TOLERANCE_LINE_SIZE Specifies the maximum number of pixels that the length of a line can differ in total before it is viewed as a difference. 2 0 - 100
TOLERANCE_LINE_THICKNESS Specifies the maximum difference in stroke thickness of two lines or curves (measured in pt) before it is viewed as a difference. 1 100
TOLERANCE_COLOR Defines the maximum color difference per RGB or HSB channel for all paints. The value is the absolute difference for HSB and absolute * 255 for RGB. This value is used by the text comparison as well. Will be used for Text and Line comparison. 0.01 (1%) 0.0 - 1.0
TOLERANCE_BOX_ROUND_EDGES Specifies the maximum number of pixels (1 pixel is approximately 0.265mm) that a control point of a quadratic Bézier curve may differ in total before it is viewed as a difference. 3 0 - 10

Image comparison

Deviation tolerance for images

This value includes all images. Note that comparing images may have a notable impact on your performance. The image comparison can be modified using the following properties:

Property Name Description Default Range
TOLERANCE_IMAGE_DISTANCE Specifies the maximum number of pixels that the position of an image can differ before it is viewed as a difference. 3 0 - 10
TOLERANCE_IMAGE_PIXEL_VALUE Specifies the maximal allowed discrepancy of pixel values (Double) before it is viewed as a difference. 0.05 0.0 - 1.0
TOLERANCE_IMAGE_SIZE Specifies the maximum difference in percent that the area spanned by an image may differ before it is viewed as a difference. 0.1 0.0 - 1.0
USE_PIXEL_MEDIUM_VALUE This property of the image comparison specifies, if i-net PDFC should compare the medium values instead of single-pixel values. 'true'

Annotation Comparison

Property Name Description
COMPARE_ANNOTATIONS_DETAILED Specifies whether differences per annotation are summarized (false) or will be fully detailed as any other difference in the document (true). Default is 'false'
 
 
 
 
 
 
 
 
 
 
 
 
 

i-net PDFC
This application uses cookies to allow login. By continuing to use this application, you agree to the use of cookies.


Help - Comparison Profile