Skip to content

Celonis Data Deduplication

Check the example jupyter notebooks below to learn about the basic usage of Celonis Data Deduplication. You can directly try these examples with your data! To get a copy of these notebooks in you current working directory, run the following command in any notebook:

from pycelonis.notebooks import data_deduplication

The folder data_deduplication with the example notebooks will appear in your current working directory.

Tutorials

Matching Algorithms

See below a list of algorithms / matching logic that are available.


exact

matches fields that are completely identical. This can be applied to fields of any data type.


different

this matches fields that are NOT identical. This can be useful if you are trying to find matches of different combinations of fields, in order to create distinct patterns. E.g. to create these three distinct patterns:

pattern 1: {"col1":"exact", "col2":"exact"}
pattern 2: {"col1":"different", "col2":"exact"}
pattern 3: {"col1":"exact", "col2":"different"}

CompanyNameFuzzy

this algorithm is built to deduplicate company names (vendors, customers etc.). The algorithm works as follows:

  1. removing all special characters, white spaces etc besides: [^a-zA-ZА-я\d]. E.g. "A R-AMC1234" <-> "AR AMC 1234"

  2. removing all "company key words" like "corp", "llc" etc.

  3. checking for matches with the given string similarity metric and the given threshold.

Advanced tipp: The used metric and threshold can be configured manually by using: "comparers.CompanyNameComparer(column, column,label=column, threshold=0.9,method='jarowinkler')"

The similarity is 1 if the given string similarity metric is bigger than the threshold.


DateFuzzy

this algorithm is built to deduplicate dates, which can be helpful e.g. if there are two invoices with similar dates. The algorithm matches the following:

  1. Exact same date. E.g. “2020-11-09” <-> “2020-11-09”

  2. Month and day swapped. E.g. “2020-01-02” <-> “2020-02-01”

  3. Month swapped common error. E.g. “2020-07-02” <-> “2020-06-02”

  4. Distance between two dates less than 7 days. E.g. | “2020-07-02” - “2020-07-08” | < 7 days.


InvoiceReferenceFuzzy

this algorithm is built to deduplicate invoice references. These are usually a combination of characters and numbers and are thus not fit for standard string comparison algorithms. The algorithm works as follows for two strings: 1. checking whether they are exactly equal besides special characters: [^a-zA-ZА-я\d]. E.g. "A R-AMC1234" <-> "AR AMC 1234" 2. checking whether they are exactly equal except for 0-3 extra characters in one of the two records. E.g. "AR-AMC1234" <-> "A-AMC1234" 3. checking for common scanning errors like "8" <-> "B".E.g. "AR-AMC1238" <-> "RA-AMC123B" 4. checking for turners in characters. E.g. "AR-AMC1234" <-> "RA-AMC1234"

The similarity is 1 in case of agreement and 0 otherwise.


InvoiceValueFuzzy

Currently using the standard record linkage Numeric comparer from Record Linkage Comparers:

"recordlinkage.compare.Numeric(left_on, right_on, method='linear', offset=0.0, scale=40.0, origin=0.0, missing_value=0.0, label=None)."


MaterialFuzzy

this algorithm is built to deduplicate material descriptions. These are usually a combination of characters and numbers and are thus not fit for standard string comparison algorithms. The algorithm works as follows for two strings: 1. checking whether they are exactly equal besides special characters: [^a-zA-ZА-я\d]. E.g. "A R-AMC1234" <-> "AR AMC 1234" 2. checking whether they are exactly equal except for 0-3 extra characters in one of the two records. E.g. "AR-AMC1234" <-> "A-AMC1234" AND The differing characters are NOT digits. 3. checking for common scanning errors like "8" <-> "B".E.g. "AR-AMC1238" <-> "RA-AMC123B" 4. checking for turners in characters. E.g. "AR-AMC1234" <-> "RA-AMC1234"

The similarity is 1 in case of agreement and 0 otherwise.


ZipCodeComparer

This algorithm is built to match zip codes that are close to each other. It will match two numbers if they are equal or if the first two digits are equal.

E.g. "80888" <-> "80651" but NOT "77777" <-> "78777".

custom Logic

You can also use one of the matching algorithms from the Record Linkage Comparers. Simply parse them as a string e.g. : "recordlinkage.compare.String(column, column, method='jarowinkler', threshold=0.85, missing_value=0.0, label=column)"



The results explained

The results table contains the same rows and columns as the input table, plus three additional columns, which contain the main results of the duplicate checking.

GROUP_ID

This is the most important column of the results table and contains the matching information. The rows that were found to be duplicates of each other have the same GROUP_ID and thus can be matched through this column. The GROUP_ID is contains the unique_ids of the matched cases.


PATTERN

The column PATTERN contains the name of the pattern that was used to match the rows of each group.


MATCH_FOUND

The column MATCH_FOUND indicated whether the corresponding row was found to be a duplicate of another. If 'YES' a duplicate was found, if 'NOT' the row was already checked, but no duplicate was found. This column can be used to check whether a case in the Data Model was already checked for duplicates or not. Simply use the PQL "ISNULL(DUPLICATE_TABLE.MATCH_FOUND)=1" to filter on cases that were not yet checked.


Back to top