{{sidenavigation.sidenavigationExpandLabel}}
{{getMsg('Help_YouAreHere')}}: {{page.title}} {{page.title}}
{{$root.getMsg("downLoadHelpAsPdf")}} {{helpModel.downloadHelpPdfDataStatus}}

Comparison Profile

A Compare Profile for PDFC contains parameters and settings for the comparison of documents. Different Compare Profiles can lead to very different results when compared. Therefore, it may be necessary to adjust or optimize these for certain comparison scenarios.

Manage Compare Profiles

In the footer of the Compare Profile window you have the possibility to manage Compare Profiles. The currently open profile can be duplicated, exported, published/unpublished and deleted. In addition, settings for a Compare Profiles can be imported.

Note: In this way, for example, Compare Profiles can be exported from the i-net PDFC GUI and imported into the server. This also works in the other direction. Exported Compare Profiles can be edited outside the application, that means settings which can not be set with the configuration interface. This can be headers and footers that are over 100 pixels high. The possible settings can be found in the respective tables.

Default Profiles and Publishing

The default compare profiles provided by i-net PDFC can be activated or deactivated in the server configuration under Comparison> Profiles.

Any user with administrative permissions or the permission 'Manage Users and Groups' can publish a user profile either for all users or for selected users or groups. Once a profile is published it will appear alongside the default profiles for any user who has access to the profile. The publishing state of a profile is displayed in the list of available profiles unless the profile is shared for anyone. So, this mechanism can for instance be used to augment the default profiles of the server.

Published profiles can be unpublished by any user with administrative permissions or the permission 'Manage Users and Groups'. They can only be modified by the owner or an administrator.

To customize a default profile or a write protected shared profile, it has to be duplicated to create a new writable profile.

Import/Export of profiles

Each profile can be stored to an external file by clicking on Export at the bottom of the profile panel. The export files are portable and can be used for any type of i-net PDFC installation - GUI, API and Server.

To import a profile, create a custom profile and select it as the active one, The Import label will be available at the bottom of the profile configuration page. Click on this label to select a file to import. Alternatively drag&drop a profile XML file into the panel to load the settings.

The selected imported profile will replace all settings of the current profile.

Profile settings

A profile basically contains the settings for comparison mode, element comparison types and filters to be used. Each filter or comparison type may have additional options to fine-tune the feature.

Comparison mode

This option has the biggest impact on the comparison. Here are the differences between the default comparison mode and the "strict mode":

Default mode Strict mode
Allows for parts of the document to be matched even when one part is located further down the document due to an inserted paragraph or element. Each part of the document must be lined up in the same position in both documents in order to be seen as matching. This means that if a paragraph is inserted in the one document, all content underneath this paragraph will be seen as different to the other document since it will have moved.
Places an emphasis on the continuous flow of content being the same, as opposed to look/location on each individual page. Places an emphasis on both location of elements AND content.

Filters and optimizations

i-net PDFC offers various specialized optimizations for comparing content of specific kinds. You can turn these optimizations on and off at will.

Combine differences

Large replacements can cause common words to be marked as equal even though the context is completely different. To reduce such false negatives, the option 'Combine large text differences' can be used.

Area for Comparison

With this filter it is possible to filter an area over all pages for the comparison. The filtered area(s) to compare can be specified semicolon separated. If no value is specified, then no areas will be filtered. All elements completed inside this areas will be filtered.

An area is define with 4 integer parameters (x, y, width, height). Each parameter is separated with a comma. Empty parameter is equal to 0. The values are in 'pixels' with a resolution of 72 DPI. This resolution is used to calculate the default rendering size of a page as well. For example an US-Letter sized page in portrait orientation has 612 x 792 pixels. The values are relative to the pages as they would be displayed at 72 DPI. A different screen resolution will have no effect to the area filter.

Multiple areas are supported. Use ; to separate the area definitions.

With an optional parameters it is possible to specify the page number and/or the document on that this area will be filtered. Available values for page number are 1 to max. document page number. With no declaration, the area will be filtered for all pages. Available values for documents are 'F' for first document or 'S' for second document. With no declaration, the area will be filtered for both documents

Samples:

  • 0,0,100,100
  • ,,100,100
  • 5,5,10,10;50,10,50,50 (two areas)
  • 0,0,100,100,1 (only valid for the first page)
  • 50,50,100,100,F (area for all pages but only for the first document)
  • 100,10,200,200,3,S (area for the 3.page for the second document)

Pages to Compare

This filter allows to select pages and page ranges that should be compared. Multiple pages can be selected using a comma separated list. All pages will be used for the comparison if no value is given.

The filter can be applied to each document using the fields "Comparison range document 1" and "Comparison range document 2".

To filter pages from the end of the document there are two additional fields "Last Page(s) filter document 1" and "Last Page(s) filter document 2". Multiple pages and ranges can be selected here as well. Positive numbers are being used, starting with 1 being the last page. A value of 0 (default) means: no filtering.

Examples for page and page range definition:

  • 1
  • 1-4
  • 4-7,11-32
  • 1,5,7-21

Basic Table Optimization

This filter can be used to optimize the comparison of tables with visual borders. The filter will detect the original structure of the table and rearrange the content so that the content will be compared by cell.

Requirement: This filter will only detect a table, if

  • The table has a visible border
  • Each cell has a visible border
  • There is no cell spacing
  • The table has at least two rows and two columns

Filter repeating headers

In case a table does not fit onto a single page it is common to repeat the table header after the page break. Usually i-net PDFC tends to mark such repeated headers as differences since it's content that does not belong to the table data. With this option enabled, the filter will exclude any table header from the comparison that is identical to the header of the last table on the previous page.

Property

Property Name Description
Filter repeating headers Enables or disables the filtering of repeated table headers. The default value is false

Multi-column layout

This filter should be used if the content is arranged in several columns. A typical example is the layout of daily newspapers.

Note: The filter is not suitable for tables!

Property

Property Name Description
Multi Column Layout Optimizes the text recognition for a multi-column layout. The default value is false

Headers and Footers

This filter can be used to exclude headings and footers from the comparison that leads to the reduction of repeating differences. Automatic detection is only possible in non-strict mode. Three options are available.

  • Do not recognize Headers and footers are not recognized
  • Automatically Detect Headers and footers are automatically recognized and treated by PDFC
  • Manually set Headers and footers allow you to precisely adjust pixels if the areas can not be detected automatically.

Property

Name Description
Header Size Specifies the size of the header in pixels. Set the value to -1 to automatically detect the header. The default value is -1
Footer Size Specifies the size of the footer in pixels. Set the value to -1 to automatically recognize the footer. The default value is -1

Group content

This filter option toggles how the layout detection of i-net PDFC will react to PDF files. By default, i-net PDFC will try to detect the layout of the document pages to some extent. With this option you can modify how i-net PDFC retrieves the layout information.

PDFC Standard

The layout is detected by the filters of i-net PDFC. All layout filters will be applied.

Compare original PDF text order

The content will be compared in the order it was printed to the PDF document. This approach assumes that the print order reflects the reading order. If can yield better results for very complex layouts.

Use PDF structure tree

This option advises i-PDFC to use the optional meta information about the structure of the document. Usually this includes information about for instance paragraphs, tables and figures. If the structure data is present and accurate it will be used to improve the result. In case it is not present, the original PDF text order will be used.

A more detailed explanation can be found on the page of this parser extension.

Deactivate Font CMAP

With this option, the PDF parser will drop the mapping from character numbers to readable text. This often solves issues with intentionally obfuscated PDF files, which don't have this mapping in the first place. On the downside, it may void the readability of the differences messages and won't work if the CMAP of both documents is different. So it's not a general solution, but often works for PDFs generated by the same application.

This option can also be combined with the "Text recovery by OCR" filter plugin. This filter uses optical recognition to restore readable text. By default, this recognition is only performed for fonts that do not have a character mapping table. With the "Disable CMAP" option, however, the recognition is performed for all fonts in the document.

Filter content

You can specify patterns for the content filter. There are two types of filters: plain-text filters and regular expressions. These patterns can be disabled without deleting them from the configuration.

Property

Property Name Description
Exclude whole words If enabled, words will be filtered completely even if the pattern matches only a part of the word. When disabled, only the matched characters will be filtered. Default ist enabled.
Filter Patterns Defines a list of filters, each filter is defined by one pattern/string, e.g. <pattern or string>|(regexp|text)|(active|inactive)

Text recognition (OCR)

This filter uses the optical character recognition plugin to extract text content from images and drawings. As a requirement the OCR plugin has to be active and the required language files have to be installed. For further details, please refer to the OCR plugin

Error tolerance

Optical character recognition often has some recognition errors due to small fonts, poor scanning, noise by background images or even ambiguous characters. To overcame these errors a tolerance level can be defined.

  • None - compare all characters as recognized (not recommended)
  • Similar characters only - tolerate errors on characters with the same appearance, like a Latin 'a' and a Russian 'а'. A full of tolerated characters can be found here http://www.unicode.org/reports/tr36/confusables.txt
  • Common recognition errors - tolerate errors in characters with similar appearance especially on noisy background. This tolerance is based on experience and testing as there is no public recommendation. An example would be the German sharp s 'ß' and the upper case letter 'B' that are very similar in some fonts.
  • Common recognition errors caused by distortion - same as 'Common recognition errors' but extended for slightly rotated or distorted images. Such distortions are usually happen when scanning pages.

Text recovery

Some PDF files have a missing character mapping. Such mapping is required to translate from the visual text to machine readable text. The usual effect is a correctly displayed document to with an apparently corrupt text in the comparison result. Furthermore the text is corrupted as well when copying & pasting from this document (with any reader application!).

As a solution this filter will rebuild and correct the character mapping by using optical character recognition. The accuracy depends on the amount of text with more text providing higher accuracy.

Compared Types

Text comparison

The text comparison includes all text elements such as words, numbers, punctuation and list elements. i-net PDFC will determine these elements as required and according to the rules of the system language. Such text will be compared by element. So even if only a single character is changed, the whole word will be marked as different. This is due to the fact that a slight change could be a typo or if could completely change the meaning of the word. Textual content will be compared in natural reading order instead. This order may be different to the word order specified in the document since some generators (especially for PDF) have no meaningful word order.

Deviation tolerance for text

The deviation tolerance for text sets the maximum allowed y-jitter for the text line identification. It is relative to the text height of the respective line. This value can be used to compensate rounding errors of different PDF generators.

This property defines the tolerated difference in the text size as a ratio. It's only relevant in case COMPARE_TEXT_STYLES is set to true and only if the strict comparison mode is being used.

Case sensitive comparison

If set to false, all text elements will be compared as lower case. It will cause the comparison to run slightly slower and take some more memory. The conversion to lower case will be performed using the default localization of the runtime. The default value is 'true'

Check text size

Verifies that the text size is the same in both documents.

Check text color

Verifies that the text color is the same in both documents.

Check font names

Verifies that the font names are the same in both documents.

Check text styles

Verifies that the text styles are the same in both documents.

Check non-semantic white spaces

Check for changes in white spaces and line breaks that are not semantically relevant. A common example is the removal of a white space between a word and the adjacent comma. Such changes are merely stylistic and do not change the meaning of the content. Thus these changes belong to the category 'Modified Styles'.

Language

In case you're going to use an optical character recognition filter like 'Extract Text', i-net PDFC requires to know the language of the document. If the language analyzer plugin is available you may choose 'Auto-detect' to let the analyzer detect the language automatically. But, if there is no such plugin or if there are no native text elements in the document, you'll have to explicitly set the language. In case the selected or detected language doesn't match the document language, the text recognition rate will be very poor.

If the language of the document is missing in the selection, please manually install this language. Further details can be found on the OCR help page.

Ignore rotated text

Excludes rotated text from the comparison. This setting is particularly suitable for watermarks and print marks.

Property Name Description
Ignore rotated text Excludes rotated text from the comparison. The default value is true

Decompose complex characters

Activate to decompose complex or special characters in into basic characters. Complex characters are for instance ligatures like 'fi' which will be decomposed into 'fi'. Furthermore special character like long or short hyphens will be normalized to their base character.

Equalize character recognition mistakes

Activate to correct typical text recognition mistakes. An example for a common ambiguousness in text recognition is the character 'm' and the syllable 'rn' which appear very similar depending on print quality and font.

Ignore invisible elements

The purpose of this filter is to ignore the meaningless elements, generated by certain PDF renderers. Eg. text outside of the visible area (or page) or transparent borders of tables. So the filter is designed to efficiently remove:

  • transparent text
  • tiny shapes which are not visible at 100% scale
  • transparent or white filled shapes
Compute actual visibility

Several document formats, such as PDF, are actually vector graphics formats. These documents contain commands which advise the viewer application what and where to draw. The commands are not exclusive and may cause overlapping or nonsensical drawing operations. Like text that is hidden behind an opaque shape. Or a white line on a white background. To recognize such scenarios, where certain graphical elements are hidden or clipped, requires to calculate the actual visibility of each such element. Due to the performance impact of this computation it has to be activated by the switch COMPUTE_CLIPPING.

With this feature active, i-net PDFC will check for any element: whether it's occluded, whether it can be merged together with similar elements (for shapes and images), whether it is clipped in some way and whether it's the same color as it's background. So, only the visible part of each element will be compared or the element is ignored if it has no impact on the visual appearance of the document.

Property
Property Name Description
Ignore invisible elements Potentially invisible elements such as white or transparent lines are not compared.
Compute actual visibility Enables the calculation of the actual visibility of each element in the document. This feature may require a lot of performance, thus it's inactive by default.

Line and shapes comparison

Lines and shapes can be compared as well, this will compare each and every line in the document for differences. It is recommended to leave this option off unless necessary, since little movements and extra space can cause lines to be placed at different positions, leading to a multitude of detected differences. You can additionally decide whether line styles (such as dashed vs. dotted lines) are to be compared, as well as define the tolerance level for the differences in line sizes (length/thickness). The tolerance levels here are measured in pixels - e.g. a tolerance of "20 pixels" for the size would cause a line which is 50 pixels wide to be seen as identical to a line which is 30 pixels wide, but as different to a line which is 29 pixels wide.

Deviation tolerance for lines

In graphical interfaces the tolerance slider will adjust the size, location(strict mode only) and thickness tolerance.

Image comparison

The image comparison of i-net PDFC will compare all images of a document according to it's visual appearance. The comparison can be configured to tolerate color differences to some extent. For overlapping, connected or clipped images the comparer will only take the visual pixels into account.

You should note that the image comparison may have a notable impact on the comparison performance.

Deviation tolerance for images

In graphical interfaces the tolerance slider adjusts the tolerance for color, size and location(strict mode only).

Property Name Description
Image metadata comparison This property compares the metadata of an image, if it can be read. Image metadata includes the DPI, image format (JPG, PNG, etc.), color model (RGB, black/white, CMYK), and whether an alpha mask is present.
Detailed View This property of the image comparison specifies, if i-net PDFC should compare in blocks and show this in the result for the case if the difference under 50%. This option increase the ressource consumption. The default value is: false.

Annotation comparison

Annotations are optional and often editable content, usually in PDF documents. Since annotations are not part of the primary content of the document, they are ignored by the comparison by default. With this option you can choose to compare annotations as well.

Property Name Description
Detailed View Differences in annotations will by default be summarized into one marker per different annotation. To get distinct markers for each difference in a comparison, please activate this option.
Alternative Text comparison Differences in the alternative text of tag elements will be compared. Alternative text is usually present in accessible documents like PDF/UA. This option is Independent from the annotation comparison.

i-net PDFC
This application uses cookies to allow login. By continuing to use this application, you agree to the use of cookies.


Help - Comparison Profile