Duplicate Checker Basic Usage Tutorial¶

In [2]:

            
                Copied!
                
import pandas as pd
from pycelonis.data_deduplication.duplicate_checker import DuplicateChecker
import pandas as pd
from pycelonis.data_deduplication.duplicate_checker import DuplicateChecker

Step 1: Create DuplicateChecker object

In [27]:

            
                Copied!
                
dc = DuplicateChecker()
dc = DuplicateChecker()

1. Find exact matches¶

Example Data:

Below you see an example data set containing the unique identifier of each invoice (_CASE_KEY) and the columns that are supposed to be checked for duplicate entries. In this example only the last two invoice (_CASE_KEY = 6, 7) are exact matches where all the fields are the same.

In [38]:

            
                Copied!
                
                    
                    
                
                

        
d = {
        "_CASE_KEY": ["1", "2", "3", "4", "5", "6"],
        "REFERENCE": ["ABCDEF123", "ABCEF123", "asdas", "1234", "1234", "1234"],
        "VENDOR_NAME": ["Celonis SE", "Celonis SE", "Celonis SE", "Uipath Corp.", "Uipath Corp.", "Uipath Corp."],
        "INVOICE_DATE": ["2020-09-01", "2020-09-01", "2020-08-01", "2020-10-01", "2020-10-01", "2020-10-01"],
        "_VENDOR_ID": ["1", "1", "1", "4", "6", "6"],
        "VALUE": [1000, 1000, 1000, 4400, 220, 220],
}
df = pd.DataFrame(data=d)
df
d = {
        "_CASE_KEY": ["1", "2", "3", "4", "5", "6"],
        "REFERENCE": ["ABCDEF123", "ABCEF123", "asdas", "1234", "1234", "1234"],
        "VENDOR_NAME": ["Celonis SE", "Celonis SE", "Celonis SE", "Uipath Corp.", "Uipath Corp.", "Uipath Corp."],
        "INVOICE_DATE": ["2020-09-01", "2020-09-01", "2020-08-01", "2020-10-01", "2020-10-01", "2020-10-01"],
        "_VENDOR_ID": ["1", "1", "1", "4", "6", "6"],
        "VALUE": [1000, 1000, 1000, 4400, 220, 220],
}
df = pd.DataFrame(data=d)
df

Out[38]:

	_CASE_KEY	REFERENCE	VENDOR_NAME	INVOICE_DATE	_VENDOR_ID	VALUE
0	1	ABCDEF123	Celonis SE	2020-09-01	1	1000
1	2	ABCEF123	Celonis SE	2020-09-01	1	1000
2	3	asdas	Celonis SE	2020-08-01	1	1000
3	4	1234	Uipath Corp.	2020-10-01	4	4400
4	5	1234	Uipath Corp.	2020-10-01	6	220
5	6	1234	Uipath Corp.	2020-10-01	6	220

Search Option 1: Search for rows where all fields (besides the unique id columns) are matching exactly

In [39]:

            
                Copied!
                
                    
                    
                
                

        
search_pattern1 = {
    "All Exact Matches":{
        "VENDOR_NAME": "exact",
        "INVOICE_DATE": "exact",
        "REFERENCE": "exact",
        "VALUE": "exact",
        "_VENDOR_ID": "exact"
    }
}
search_pattern1 = {
    "All Exact Matches":{
        "VENDOR_NAME": "exact",
        "INVOICE_DATE": "exact",
        "REFERENCE": "exact",
        "VALUE": "exact",
        "_VENDOR_ID": "exact"
    }
}

In [42]:

            
                Copied!
                
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=search_pattern1)
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=search_pattern1)

2021-02-14 16:27:43 - Celonis Duplicate Checker: Preprocessing DataFrame
2021-02-14 16:27:43 - Celonis Duplicate Checker: Searching for fuzzy matches ...

0it [00:00, ?it/s]

Out[42]:

	_CASE_KEY	REFERENCE	VENDOR_NAME	INVOICE_DATE	_VENDOR_ID	VALUE	GROUP_ID	PATTERN	MATCH_FOUND
0	5	1234	Uipath Corp.	2020-10-01	6	220	IDs:(5,6)	All Exact Matches	YES
1	6	1234	Uipath Corp.	2020-10-01	6	220	IDs:(5,6)	All Exact Matches	YES
2	1	ABCDEF123	Celonis SE	2020-09-01	1	1000	NaN	NaN	NO
3	2	ABCEF123	Celonis SE	2020-09-01	1	1000	NaN	NaN	NO
4	3	asdas	Celonis SE	2020-08-01	1	1000	NaN	NaN	NO
5	4	1234	Uipath Corp.	2020-10-01	4	4400	NaN	NaN	NO

Search Option 2: Search for rows where a subset of fields (besides the unique id columns) are matching exactly

In [6]:

            
                Copied!
                
                    
                    
                
                

        
search_pattern2 = {
    "All Exact ignore REFERENCE":{
        "VENDOR_NAME": "exact",
        "INVOICE_DATE": "exact",
        "REFERENCE": "ignore",
        "VALUE": "exact",
        "_VENDOR_ID": "exact"
    }
}
search_pattern2 = {
    "All Exact ignore REFERENCE":{
        "VENDOR_NAME": "exact",
        "INVOICE_DATE": "exact",
        "REFERENCE": "ignore",
        "VALUE": "exact",
        "_VENDOR_ID": "exact"
    }
}

In [7]:

            
                Copied!
                
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=search_pattern2)
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=search_pattern2)

2021-02-14 16:08:56 - Celonis Duplicate Checker: Preprocessing DataFrame
2021-02-14 16:08:56 - Celonis Duplicate Checker: Searching for fuzzy matches ...

0it [00:00, ?it/s]

Out[7]:

	_CASE_KEY	REFERENCE	VENDOR_NAME	INVOICE_DATE	_VENDOR_ID	VALUE	GROUP_ID	PATTERN	MATCH_FOUND
0	1	ABCDEF123	Celonis SE	2020-09-01	1	1000	IDs:(1,2)	All Exact ignore REFERENCE	YES
1	2	ABCEF123	Celonis SE	2020-09-01	1	1000	IDs:(1,2)	All Exact ignore REFERENCE	YES
2	5	1234	Uipath Corp.	2020-10-01	6	220	IDs:(5,6)	All Exact ignore REFERENCE	YES
3	6	1234	Uipath Corp.	2020-10-01	6	220	IDs:(5,6)	All Exact ignore REFERENCE	YES
4	3	asdas	Celonis SE	2020-08-01	1	1000	NaN	NaN	NO
5	4	1234	Uipath Corp.	2020-10-01	4	4400	NaN	NaN	NO

Search Option 3: Search for rows where all fields (besides the unique id columns) are matching exactly except REFERENCE where entries must be DIFFERENCT to create a match.

In [8]:

            
                Copied!
                
                    
                    
                
                

        
search_pattern3 = {
    "All Exact but REFERENCE different":{
        "VENDOR_NAME": "exact",
        "INVOICE_DATE": "exact",
        "REFERENCE": "different",
        "VALUE": "exact",
        "_VENDOR_ID": "exact"
    }
}
search_pattern3 = {
    "All Exact but REFERENCE different":{
        "VENDOR_NAME": "exact",
        "INVOICE_DATE": "exact",
        "REFERENCE": "different",
        "VALUE": "exact",
        "_VENDOR_ID": "exact"
    }
}

In [9]:

            
                Copied!
                
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=search_pattern3)
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=search_pattern3)

2021-02-14 16:09:02 - Celonis Duplicate Checker: Preprocessing DataFrame
2021-02-14 16:09:02 - Celonis Duplicate Checker: Searching for fuzzy matches ...

0it [00:00, ?it/s]

Out[9]:

	_CASE_KEY	REFERENCE	VENDOR_NAME	INVOICE_DATE	_VENDOR_ID	VALUE	GROUP_ID	PATTERN	MATCH_FOUND
0	1	ABCDEF123	Celonis SE	2020-09-01	1	1000	IDs:(1,2)	All Exact but REFERENCE different	YES
1	2	ABCEF123	Celonis SE	2020-09-01	1	1000	IDs:(1,2)	All Exact but REFERENCE different	YES
2	3	asdas	Celonis SE	2020-08-01	1	1000	NaN	NaN	NO
3	4	1234	Uipath Corp.	2020-10-01	4	4400	NaN	NaN	NO
4	5	1234	Uipath Corp.	2020-10-01	6	220	NaN	NaN	NO
5	6	1234	Uipath Corp.	2020-10-01	6	220	NaN	NaN	NO

Search Options Combined: Apply multiple search patterns at once

In [10]:

            
                Copied!
                
combined_patterns = {}
combined_patterns.update(search_pattern1)
combined_patterns.update(search_pattern2)
combined_patterns.update(search_pattern3)
combined_patterns
combined_patterns = {}
combined_patterns.update(search_pattern1)
combined_patterns.update(search_pattern2)
combined_patterns.update(search_pattern3)
combined_patterns

Out[10]:

{'All Exact Matches': {'VENDOR_NAME': 'exact',
  'INVOICE_DATE': 'exact',
  'REFERENCE': 'exact',
  'VALUE': 'exact',
  '_VENDOR_ID': 'exact'},
 'All Exact ignore REFERENCE': {'VENDOR_NAME': 'exact',
  'INVOICE_DATE': 'exact',
  'REFERENCE': 'ignore',
  'VALUE': 'exact',
  '_VENDOR_ID': 'exact'},
 'All Exact but REFERENCE different': {'VENDOR_NAME': 'exact',
  'INVOICE_DATE': 'exact',
  'REFERENCE': 'different',
  'VALUE': 'exact',
  '_VENDOR_ID': 'exact'}}

In [11]:

            
                Copied!
                
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=combined_patterns)
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=combined_patterns)

2021-02-14 16:09:05 - Celonis Duplicate Checker: Preprocessing DataFrame
2021-02-14 16:09:06 - Celonis Duplicate Checker: Searching for fuzzy matches ...

0it [00:00, ?it/s]

Out[11]:

	_CASE_KEY	REFERENCE	VENDOR_NAME	INVOICE_DATE	_VENDOR_ID	VALUE	GROUP_ID	PATTERN	MATCH_FOUND
0	1	ABCDEF123	Celonis SE	2020-09-01	1	1000	IDs:(1,2)	All Exact ignore REFERENCE	YES
1	2	ABCEF123	Celonis SE	2020-09-01	1	1000	IDs:(1,2)	All Exact ignore REFERENCE	YES
2	5	1234	Uipath Corp.	2020-10-01	6	220	IDs:(5,6)	All Exact Matches	YES
3	6	1234	Uipath Corp.	2020-10-01	6	220	IDs:(5,6)	All Exact Matches	YES
4	3	asdas	Celonis SE	2020-08-01	1	1000	NaN	NaN	NO
5	4	1234	Uipath Corp.	2020-10-01	4	4400	NaN	NaN	NO

2. Find fuzzy matches¶

Match Fuzzy Reference Numbers

Search for rows where all fields (besides the unique id columns) are matching exactly and one column (REFERENCE) is a fuzzy match.

Comparer Developed for finding duplicates in Invoice References, by:

1) removing all special characters, white spaces etc besides: [^a-zA-ZА-я\d].
E.g. "A R-AMC1234" <-> "AR  AMC\ 1234"

2) checking whether they are exactly equal except for 0-3 extra characters 
in one of the two records. E.g. "AR-AMC1234" <-> "A-AMC1234"

3) checking for common scanning errors like "8" <-> "B". 
E.g. "AR-AMC1238" <-> "RA-AMC123B"

4) checking for turners in characters. E.g. "AR-AMC1234" <-> "RA-AMC1234"

The similarity is 1 in case of agreement and 0 otherwise.

In [12]:

            
                Copied!
                
                    
                    
                
                

        
d = {
        "_CASE_KEY": ["1", "2", "3", "4", "5", "6"],
        "REFERENCE": ["Booking 1/23", "Boking1/23", "asdas", "1234", "1234", "AR-1234"],
        "VENDOR_NAME": ["Celonis SE", "Celonis SE", "Celonis SE", "Uipath Corp.", "Uipath Corp.", "Uipath Corp."],
        "INVOICE_DATE": ["2020-09-01", "2020-09-01", "2020-08-01", "2020-10-01", "2020-10-01", "2020-10-01"],
        "_VENDOR_ID": ["1", "1", "1", "4", "6", "6"],
        "VALUE": [1000, 1000, 1000, 4400, 220, 220],
}
df = pd.DataFrame(data=d)
df
d = {
        "_CASE_KEY": ["1", "2", "3", "4", "5", "6"],
        "REFERENCE": ["Booking 1/23", "Boking1/23", "asdas", "1234", "1234", "AR-1234"],
        "VENDOR_NAME": ["Celonis SE", "Celonis SE", "Celonis SE", "Uipath Corp.", "Uipath Corp.", "Uipath Corp."],
        "INVOICE_DATE": ["2020-09-01", "2020-09-01", "2020-08-01", "2020-10-01", "2020-10-01", "2020-10-01"],
        "_VENDOR_ID": ["1", "1", "1", "4", "6", "6"],
        "VALUE": [1000, 1000, 1000, 4400, 220, 220],
}
df = pd.DataFrame(data=d)
df

Out[12]:

	_CASE_KEY	REFERENCE	VENDOR_NAME	INVOICE_DATE	_VENDOR_ID	VALUE
0	1	Booking 1/23	Celonis SE	2020-09-01	1	1000
1	2	Boking1/23	Celonis SE	2020-09-01	1	1000
2	3	asdas	Celonis SE	2020-08-01	1	1000
3	4	1234	Uipath Corp.	2020-10-01	4	4400
4	5	1234	Uipath Corp.	2020-10-01	6	220
5	6	AR-1234	Uipath Corp.	2020-10-01	6	220

In [13]:

            
                Copied!
                
                    
                    
                
                

        
fuzzyReferencePattern = {
    "REFERENCE_fuzzy_match": {
        "VENDOR_NAME": "exact",
        "INVOICE_DATE": "exact",
        "REFERENCE": "InvoiceReferenceFuzzy",
        "VALUE": "exact",
    }}
fuzzyReferencePattern = {
    "REFERENCE_fuzzy_match": {
        "VENDOR_NAME": "exact",
        "INVOICE_DATE": "exact",
        "REFERENCE": "InvoiceReferenceFuzzy",
        "VALUE": "exact",
    }}

In [14]:

            
                Copied!
                
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=fuzzyReferencePattern)
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=fuzzyReferencePattern)

2021-02-14 16:09:12 - Celonis Duplicate Checker: Preprocessing DataFrame
2021-02-14 16:09:12 - Celonis Duplicate Checker: Searching for fuzzy matches ...

100%|██████████| 1/1 [00:00<00:00, 11.36it/s]

Out[14]:

	_CASE_KEY	GROUP_ID	REFERENCE	VENDOR_NAME	INVOICE_DATE	_VENDOR_ID	VALUE	PATTERN	MATCH_FOUND
0	1	IDs:(1,2)	Booking 1/23	Celonis SE	2020-09-01	1	1000	REFERENCE_fuzzy_match	YES
1	2	IDs:(1,2)	Boking1/23	Celonis SE	2020-09-01	1	1000	REFERENCE_fuzzy_match	YES
2	5	IDs:(5,6)	1234	Uipath Corp.	2020-10-01	6	220	REFERENCE_fuzzy_match	YES
3	6	IDs:(5,6)	AR-1234	Uipath Corp.	2020-10-01	6	220	REFERENCE_fuzzy_match	YES
4	3	NaN	asdas	Celonis SE	2020-08-01	1	1000	NaN	NO
5	4	NaN	1234	Uipath Corp.	2020-10-01	4	4400	NaN	NO

Match Fuzzy Company Names

Search for rows where all fields (besides the unique id columns) are matching exactly and one column (Vendor) is a fuzzy match.

Comparer Developed for finding duplicates in Company Names, by:

1) removing all special characters, white spaces etc besides: [^a-zA-ZА-я\d].
E.g. "A R-AMC1234" <-> "AR  AMC\ 1234"

2) removing all company key words like "corp", "llc" etc.

3) checking for matches with the given string similarity metric and the given threshold.
The similarity is 1 if the given string similarity metric is bigger than the threshold.

In [15]:

            
                Copied!
                
                    
                    
                
                

        
d = {
        "_CASE_KEY": ["1", "2", "3", "4", "5", "6"],
        "REFERENCE": ["Booking123", "Booking123", "asdas", "1234", "1234", "AR-1234"],
        "VENDOR_NAME": ["Celonis SE", "Celones SE", "Celonis SE", "uipath", "Uipath Corp.", "Uipath Corp."],
        "INVOICE_DATE": ["2020-09-01", "2020-09-01", "2020-08-01", "2020-10-01", "2020-10-01", "2020-10-01"],
        "_VENDOR_ID": ["1", "1", "1", "4", "4", "4"],
        "VALUE": [1000, 1000, 1000, 220, 220, 220],
}
df = pd.DataFrame(data=d)
df
d = {
        "_CASE_KEY": ["1", "2", "3", "4", "5", "6"],
        "REFERENCE": ["Booking123", "Booking123", "asdas", "1234", "1234", "AR-1234"],
        "VENDOR_NAME": ["Celonis SE", "Celones SE", "Celonis SE", "uipath", "Uipath Corp.", "Uipath Corp."],
        "INVOICE_DATE": ["2020-09-01", "2020-09-01", "2020-08-01", "2020-10-01", "2020-10-01", "2020-10-01"],
        "_VENDOR_ID": ["1", "1", "1", "4", "4", "4"],
        "VALUE": [1000, 1000, 1000, 220, 220, 220],
}
df = pd.DataFrame(data=d)
df

Out[15]:

	_CASE_KEY	REFERENCE	VENDOR_NAME	INVOICE_DATE	_VENDOR_ID	VALUE
0	1	Booking123	Celonis SE	2020-09-01	1	1000
1	2	Booking123	Celones SE	2020-09-01	1	1000
2	3	asdas	Celonis SE	2020-08-01	1	1000
3	4	1234	uipath	2020-10-01	4	220
4	5	1234	Uipath Corp.	2020-10-01	4	220
5	6	AR-1234	Uipath Corp.	2020-10-01	4	220

In [16]:

            
                Copied!
                
                    
                    
                
                

        
fuzzyVendorPattern = {
    "Vendor_fuzzy_match": {
        "VENDOR_NAME": "CompanyNameFuzzy",
        "INVOICE_DATE": "exact",
        "REFERENCE": "exact",
        "VALUE": "exact",
    }}
fuzzyVendorPattern = {
    "Vendor_fuzzy_match": {
        "VENDOR_NAME": "CompanyNameFuzzy",
        "INVOICE_DATE": "exact",
        "REFERENCE": "exact",
        "VALUE": "exact",
    }}

In [17]:

            
                Copied!
                
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=fuzzyVendorPattern)
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=fuzzyVendorPattern)

2021-02-14 16:09:20 - Celonis Duplicate Checker: Preprocessing DataFrame
2021-02-14 16:09:20 - Celonis Duplicate Checker: Searching for fuzzy matches ...

100%|██████████| 1/1 [00:00<00:00, 13.14it/s]

Out[17]:

	_CASE_KEY	GROUP_ID	REFERENCE	VENDOR_NAME	INVOICE_DATE	_VENDOR_ID	VALUE	PATTERN	MATCH_FOUND
0	1	IDs:(1,2)	Booking123	Celonis SE	2020-09-01	1	1000	Vendor_fuzzy_match	YES
1	2	IDs:(1,2)	Booking123	Celones SE	2020-09-01	1	1000	Vendor_fuzzy_match	YES
2	4	IDs:(4,5)	1234	uipath	2020-10-01	4	220	Vendor_fuzzy_match	YES
3	5	IDs:(4,5)	1234	Uipath Corp.	2020-10-01	4	220	Vendor_fuzzy_match	YES
4	3	NaN	asdas	Celonis SE	2020-08-01	1	1000	NaN	NO
5	6	NaN	AR-1234	Uipath Corp.	2020-10-01	4	220	NaN	NO

Match Fuzzy Dates

Search for rows where all fields (besides the unique id columns) are matching exactly and one column (Date) is a fuzzy match.

Comparer Developed for finding duplicates in Dates. The following are flagged as duplicates::

1) Exact same date. E.g. "2020-11-09" <-> "2020-11-09"

2) Month and day swapped. E.g. "2020-01-02" <-> "2020-02-01"

2) Month swapped commom error. E.g. "2020-07-02" <-> "2020-06-02"

3) Distance between lesse than 7 days.
I.e. | "2020-07-02" - "2020-07-08" | < 7 days.

In [18]:

            
                Copied!
                
                    
                    
                
                

        
d = {
        "_CASE_KEY": ["1", "2", "3", "4", "5", "6"],
        "REFERENCE": ["Booking123", "Booking123", "asdas", "1234", "1234", "AR-1234"],
        "VENDOR_NAME": ["Celonis SE", "Celonis SE", "Celonis SE", "Uipath Corp.", "Uipath Corp.", "Uipath Corp."],
        "INVOICE_DATE": ["2020-02-01", "2020-01-02", "2020-08-01", "2020-10-01", "2020-10-04", "2020-10-01"],
        "_VENDOR_ID": ["1", "1", "1", "4", "4", "4"],
        "VALUE": [1000, 1000, 1000, 220, 220, 220],
}
df = pd.DataFrame(data=d)
df.INVOICE_DATE = df.INVOICE_DATE.astype("datetime64")
df
d = {
        "_CASE_KEY": ["1", "2", "3", "4", "5", "6"],
        "REFERENCE": ["Booking123", "Booking123", "asdas", "1234", "1234", "AR-1234"],
        "VENDOR_NAME": ["Celonis SE", "Celonis SE", "Celonis SE", "Uipath Corp.", "Uipath Corp.", "Uipath Corp."],
        "INVOICE_DATE": ["2020-02-01", "2020-01-02", "2020-08-01", "2020-10-01", "2020-10-04", "2020-10-01"],
        "_VENDOR_ID": ["1", "1", "1", "4", "4", "4"],
        "VALUE": [1000, 1000, 1000, 220, 220, 220],
}
df = pd.DataFrame(data=d)
df.INVOICE_DATE = df.INVOICE_DATE.astype("datetime64")
df

Out[18]:

	_CASE_KEY	REFERENCE	VENDOR_NAME	INVOICE_DATE	_VENDOR_ID	VALUE
0	1	Booking123	Celonis SE	2020-02-01	1	1000
1	2	Booking123	Celonis SE	2020-01-02	1	1000
2	3	asdas	Celonis SE	2020-08-01	1	1000
3	4	1234	Uipath Corp.	2020-10-01	4	220
4	5	1234	Uipath Corp.	2020-10-04	4	220
5	6	AR-1234	Uipath Corp.	2020-10-01	4	220

In [19]:

            
                Copied!
                
                    
                    
                
                

        
fuzzyDatePattern = {
    "INVOICE_DATE_fuzzy_match": {
        "VENDOR_NAME": "exact",
        "INVOICE_DATE": "DateFuzzy",
        "REFERENCE": "exact",
        "VALUE": "exact",
    }}
fuzzyDatePattern = {
    "INVOICE_DATE_fuzzy_match": {
        "VENDOR_NAME": "exact",
        "INVOICE_DATE": "DateFuzzy",
        "REFERENCE": "exact",
        "VALUE": "exact",
    }}

In [20]:

            
                Copied!
                
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=fuzzyDatePattern)
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=fuzzyDatePattern)

2021-02-14 16:09:23 - Celonis Duplicate Checker: Preprocessing DataFrame
2021-02-14 16:09:24 - Celonis Duplicate Checker: Searching for fuzzy matches ...

100%|██████████| 1/1 [00:00<00:00, 12.02it/s]

Out[20]:

	_CASE_KEY	GROUP_ID	REFERENCE	VENDOR_NAME	INVOICE_DATE	_VENDOR_ID	VALUE	PATTERN	MATCH_FOUND
0	1	IDs:(1,2)	Booking123	Celonis SE	2020-02-01	1	1000	INVOICE_DATE_fuzzy_match	YES
1	2	IDs:(1,2)	Booking123	Celonis SE	2020-01-02	1	1000	INVOICE_DATE_fuzzy_match	YES
2	4	IDs:(4,5)	1234	Uipath Corp.	2020-10-01	4	220	INVOICE_DATE_fuzzy_match	YES
3	5	IDs:(4,5)	1234	Uipath Corp.	2020-10-04	4	220	INVOICE_DATE_fuzzy_match	YES
4	3	NaN	asdas	Celonis SE	2020-08-01	1	1000	NaN	NO
5	6	NaN	AR-1234	Uipath Corp.	2020-10-01	4	220	NaN	NO

Match Fuzzy Numbers

Search for rows where all fields (besides the unique id columns) are matching exactly and one column (VALUE) is a fuzzy match.

Docs: TODO

In [21]:

            
                Copied!
                
                    
                    
                
                

        
d = {
        "_CASE_KEY": ["1", "2", "3", "4", "5", "6"],
        "REFERENCE": ["Booking123", "Booking123", "asdas", "1234", "1234", "AR-1234"],
        "VENDOR_NAME": ["Celonis SE", "Celonis SE", "Celonis SE", "Uipath Corp.", "Uipath Corp.", "Uipath Corp."],
        "INVOICE_DATE": ["2020-01-01", "2020-01-01", "2020-10-01", "2020-10-01", "2020-10-01", "2020-10-01"],
        "_VENDOR_ID": ["1", "1", "1", "4", "4", "4"],
        "VALUE": [1000.1, 1049.2, 1000, 220, 220.1, 220],
}
df = pd.DataFrame(data=d)
df.INVOICE_DATE = df.INVOICE_DATE.astype("datetime64")
df
d = {
        "_CASE_KEY": ["1", "2", "3", "4", "5", "6"],
        "REFERENCE": ["Booking123", "Booking123", "asdas", "1234", "1234", "AR-1234"],
        "VENDOR_NAME": ["Celonis SE", "Celonis SE", "Celonis SE", "Uipath Corp.", "Uipath Corp.", "Uipath Corp."],
        "INVOICE_DATE": ["2020-01-01", "2020-01-01", "2020-10-01", "2020-10-01", "2020-10-01", "2020-10-01"],
        "_VENDOR_ID": ["1", "1", "1", "4", "4", "4"],
        "VALUE": [1000.1, 1049.2, 1000, 220, 220.1, 220],
}
df = pd.DataFrame(data=d)
df.INVOICE_DATE = df.INVOICE_DATE.astype("datetime64")
df

Out[21]:

	_CASE_KEY	REFERENCE	VENDOR_NAME	INVOICE_DATE	_VENDOR_ID	VALUE
0	1	Booking123	Celonis SE	2020-01-01	1	1000.1
1	2	Booking123	Celonis SE	2020-01-01	1	1049.2
2	3	asdas	Celonis SE	2020-10-01	1	1000.0
3	4	1234	Uipath Corp.	2020-10-01	4	220.0
4	5	1234	Uipath Corp.	2020-10-01	4	220.1
5	6	AR-1234	Uipath Corp.	2020-10-01	4	220.0

In [22]:

            
                Copied!
                
                    
                    
                
                

        
fuzzyValuePattern = {
    "INVOICE_Value_fuzzy_match": {
        "VENDOR_NAME": "exact",
        "INVOICE_DATE": "exact",
        "REFERENCE": "exact",
        "VALUE": "InvoiceValueFuzzy",
    }}
fuzzyValuePattern = {
    "INVOICE_Value_fuzzy_match": {
        "VENDOR_NAME": "exact",
        "INVOICE_DATE": "exact",
        "REFERENCE": "exact",
        "VALUE": "InvoiceValueFuzzy",
    }}

In [23]:

            
                Copied!
                
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=fuzzyValuePattern)
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=fuzzyValuePattern)

2021-02-14 16:09:27 - Celonis Duplicate Checker: Preprocessing DataFrame
2021-02-14 16:09:27 - Celonis Duplicate Checker: Searching for fuzzy matches ...

100%|██████████| 1/1 [00:00<00:00, 15.25it/s]

Out[23]:

	_CASE_KEY	GROUP_ID	REFERENCE	VENDOR_NAME	INVOICE_DATE	_VENDOR_ID	VALUE	PATTERN	MATCH_FOUND
0	1	IDs:(1,2)	Booking123	Celonis SE	2020-01-01	1	1000.1	INVOICE_Value_fuzzy_match	YES
1	2	IDs:(1,2)	Booking123	Celonis SE	2020-01-01	1	1049.2	INVOICE_Value_fuzzy_match	YES
2	4	IDs:(4,5)	1234	Uipath Corp.	2020-10-01	4	220.0	INVOICE_Value_fuzzy_match	YES
3	5	IDs:(4,5)	1234	Uipath Corp.	2020-10-01	4	220.1	INVOICE_Value_fuzzy_match	YES
4	3	NaN	asdas	Celonis SE	2020-10-01	1	1000.0	NaN	NO
5	6	NaN	AR-1234	Uipath Corp.	2020-10-01	4	220.0	NaN	NO

Search Option Combined

Apply multiple fuzzy matching searches at once.

In [24]:

            
                Copied!
                
                    
                    
                
                

        
d = {
        "_CASE_KEY": ["1", "2", "3", "4", "5", "6","7","8"],
        "REFERENCE": ["Booking123", "Booking123", "ABC", "ABC", "1234", "AR-1234","123","123"],
        "VENDOR_NAME": ["Celonis SE", "Celonis SE", "Celones", "Celonis SE", "Uipath Corp.", "Uipath Corp.","Uipath Corp.", "Uipath Corp."],
        "INVOICE_DATE": ["2020-01-01", "2020-01-01", "2020-10-01", "2020-10-01", "2020-10-01", "2020-10-01","2020-10-03", "2020-10-01"],
        "_VENDOR_ID": ["1", "1", "4", "4", "4", "4","4","4"],
        "VALUE": [1000, 1010.1, 1000, 1000, 220, 220,220,220],
}
df = pd.DataFrame(data=d)
df.INVOICE_DATE = df.INVOICE_DATE.astype("datetime64")
df
d = {
        "_CASE_KEY": ["1", "2", "3", "4", "5", "6","7","8"],
        "REFERENCE": ["Booking123", "Booking123", "ABC", "ABC", "1234", "AR-1234","123","123"],
        "VENDOR_NAME": ["Celonis SE", "Celonis SE", "Celones", "Celonis SE", "Uipath Corp.", "Uipath Corp.","Uipath Corp.", "Uipath Corp."],
        "INVOICE_DATE": ["2020-01-01", "2020-01-01", "2020-10-01", "2020-10-01", "2020-10-01", "2020-10-01","2020-10-03", "2020-10-01"],
        "_VENDOR_ID": ["1", "1", "4", "4", "4", "4","4","4"],
        "VALUE": [1000, 1010.1, 1000, 1000, 220, 220,220,220],
}
df = pd.DataFrame(data=d)
df.INVOICE_DATE = df.INVOICE_DATE.astype("datetime64")
df

Out[24]:

	_CASE_KEY	REFERENCE	VENDOR_NAME	INVOICE_DATE	_VENDOR_ID	VALUE
0	1	Booking123	Celonis SE	2020-01-01	1	1000.0
1	2	Booking123	Celonis SE	2020-01-01	1	1010.1
2	3	ABC	Celones	2020-10-01	4	1000.0
3	4	ABC	Celonis SE	2020-10-01	4	1000.0
4	5	1234	Uipath Corp.	2020-10-01	4	220.0
5	6	AR-1234	Uipath Corp.	2020-10-01	4	220.0
6	7	123	Uipath Corp.	2020-10-03	4	220.0
7	8	123	Uipath Corp.	2020-10-01	4	220.0

In [25]:

            
                Copied!
                
                    
                    
                
                

        
combined_patterns = {}
combined_patterns.update(fuzzyVendorPattern)
combined_patterns.update(fuzzyDatePattern)
combined_patterns.update(fuzzyValuePattern)
combined_patterns.update(fuzzyReferencePattern)
combined_patterns
combined_patterns = {}
combined_patterns.update(fuzzyVendorPattern)
combined_patterns.update(fuzzyDatePattern)
combined_patterns.update(fuzzyValuePattern)
combined_patterns.update(fuzzyReferencePattern)
combined_patterns

Out[25]:

{'Vendor_fuzzy_match': {'VENDOR_NAME': 'CompanyNameFuzzy',
  'INVOICE_DATE': 'exact',
  'REFERENCE': 'exact',
  'VALUE': 'exact'},
 'INVOICE_DATE_fuzzy_match': {'VENDOR_NAME': 'exact',
  'INVOICE_DATE': 'DateFuzzy',
  'REFERENCE': 'exact',
  'VALUE': 'exact'},
 'INVOICE_Value_fuzzy_match': {'VENDOR_NAME': 'exact',
  'INVOICE_DATE': 'exact',
  'REFERENCE': 'exact',
  'VALUE': 'InvoiceValueFuzzy'},
 'REFERENCE_fuzzy_match': {'VENDOR_NAME': 'exact',
  'INVOICE_DATE': 'exact',
  'REFERENCE': 'InvoiceReferenceFuzzy',
  'VALUE': 'exact'}}

In [26]:

            
                Copied!
                
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=combined_patterns)
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=combined_patterns)

2021-02-14 16:09:32 - Celonis Duplicate Checker: Preprocessing DataFrame
2021-02-14 16:09:32 - Celonis Duplicate Checker: Searching for fuzzy matches ...

100%|██████████| 4/4 [00:00<00:00, 16.29it/s]

Out[26]:

	_CASE_KEY	GROUP_ID	REFERENCE	VENDOR_NAME	INVOICE_DATE	_VENDOR_ID	VALUE	PATTERN	MATCH_FOUND
0	1	IDs:(1,2)	Booking123	Celonis SE	2020-01-01	1	1000.0	INVOICE_Value_fuzzy_match	YES
1	2	IDs:(1,2)	Booking123	Celonis SE	2020-01-01	1	1010.1	INVOICE_Value_fuzzy_match	YES
2	3	IDs:(3,4)	ABC	Celones	2020-10-01	4	1000.0	Vendor_fuzzy_match	YES
3	4	IDs:(3,4)	ABC	Celonis SE	2020-10-01	4	1000.0	Vendor_fuzzy_match	YES
4	5	IDs:(5,6)	1234	Uipath Corp.	2020-10-01	4	220.0	REFERENCE_fuzzy_match	YES
5	6	IDs:(5,6)	AR-1234	Uipath Corp.	2020-10-01	4	220.0	REFERENCE_fuzzy_match	YES
6	7	IDs:(7,8)	123	Uipath Corp.	2020-10-03	4	220.0	INVOICE_DATE_fuzzy_match	YES
7	8	IDs:(7,8)	123	Uipath Corp.	2020-10-01	4	220.0	INVOICE_DATE_fuzzy_match	YES

Match Fuzzy Material Descriptions

Search for rows where all fields (besides the unique id columns) are matching exactly and one column (MATERIAL_TEXT) is a fuzzy match.

In [27]:

            
                Copied!
                
                    
                    
                
                

        
d = {
        "MATNR": ["1", "2", "3", "4", "5", "6","7","8","9","0"],
        "MATERIAL_TEXT": ["BOX/TOILE EMERI LARG.40MM.LONG.25M", 
                          "BOX1234 155 cm",
                          "BOX/TOILE EMERI LARG.40MM.LG.25M", 
                          "Abdeckfolie Nr.3842146901", 
                          "Material-123",
                          "AA Grease MU EP1 18kg bal. KN:1234", 
                          "Abdeckfolie 3842146901", 
                          "AA Grease MU EP1 18kg KN:1234","Material 123", "BOX1234 150 cm"]
}
df = pd.DataFrame(data=d)
df
d = {
        "MATNR": ["1", "2", "3", "4", "5", "6","7","8","9","0"],
        "MATERIAL_TEXT": ["BOX/TOILE EMERI LARG.40MM.LONG.25M", 
                          "BOX1234 155 cm",
                          "BOX/TOILE EMERI LARG.40MM.LG.25M", 
                          "Abdeckfolie Nr.3842146901", 
                          "Material-123",
                          "AA Grease MU EP1 18kg bal. KN:1234", 
                          "Abdeckfolie 3842146901", 
                          "AA Grease MU EP1 18kg KN:1234","Material 123", "BOX1234 150 cm"]
}
df = pd.DataFrame(data=d)
df

Out[27]:

	MATNR	MATERIAL_TEXT
0	1	BOX/TOILE EMERI LARG.40MM.LONG.25M
1	2	BOX1234 155 cm
2	3	BOX/TOILE EMERI LARG.40MM.LG.25M
3	4	Abdeckfolie Nr.3842146901
4	5	Material-123
5	6	AA Grease MU EP1 18kg bal. KN:1234
6	7	Abdeckfolie 3842146901
7	8	AA Grease MU EP1 18kg KN:1234
8	9	Material 123
9	0	BOX1234 150 cm

In [28]:

            
                Copied!
                
fuzzyMaterialPattern = {
    "Material_fuzzy_match": {
        "MATERIAL_TEXT": "MaterialFuzzy",
    }}
fuzzyMaterialPattern = {
    "Material_fuzzy_match": {
        "MATERIAL_TEXT": "MaterialFuzzy",
    }}

In [29]:

            
                Copied!
                
dc.apply(df, unique_id_columns="MATNR",search_patterns=fuzzyMaterialPattern)
dc.apply(df, unique_id_columns="MATNR",search_patterns=fuzzyMaterialPattern)

2021-02-14 16:09:51 - Celonis Duplicate Checker: Preprocessing DataFrame
WARNING:recordlinkage:indexing - performance warning - A full index can result in large number of record pairs.

WARNING:recordlinkage:indexing - performance warning - A full index can result in large number of record pairs.

2021-02-14 16:09:51 - Celonis Duplicate Checker: Searching for fuzzy matches ...

100%|██████████| 1/1 [00:00<00:00, 12.43it/s]

Out[29]:

	GROUP_ID	MATNR	MATERIAL_TEXT	PATTERN	MATCH_FOUND
0	IDs:(1,3)	1	BOX/TOILE EMERI LARG.40MM.LONG.25M	Material_fuzzy_match	YES
1	IDs:(1,3)	3	BOX/TOILE EMERI LARG.40MM.LG.25M	Material_fuzzy_match	YES
2	IDs:(4,7)	4	Abdeckfolie Nr.3842146901	Material_fuzzy_match	YES
3	IDs:(4,7)	7	Abdeckfolie 3842146901	Material_fuzzy_match	YES
4	IDs:(5,9)	5	Material-123	Material_fuzzy_match	YES
5	IDs:(5,9)	9	Material 123	Material_fuzzy_match	YES
6	IDs:(6,8)	6	AA Grease MU EP1 18kg bal. KN:1234	Material_fuzzy_match	YES
7	IDs:(6,8)	8	AA Grease MU EP1 18kg KN:1234	Material_fuzzy_match	YES
8	NaN	2	BOX1234 155 cm	NaN	NO
9	NaN	0	BOX1234 150 cm	NaN	NO

Create and use a custom Matching Algorithm

Build you own comparison algorithm, as a BaseComparFeature of the record linkage package and use it in the duplicate checker. To see how to build such a Pattern see the example below or go to: https://recordlinkage.readthedocs.io/en/latest/ref-compare.html#user-defined-algorithms

In [29]:

            
                Copied!
                
                    
                    
                
                

        
# create data frame
d = {
        "CUSTOMER": ["1", "2", "3", "4", "5","6","7"],
        "ZIP_CODE": ["54757", "54837", "80802", "80333", "12345","80666","13345"]
}

df = pd.DataFrame(data=d)
df
# create data frame
d = {
        "CUSTOMER": ["1", "2", "3", "4", "5","6","7"],
        "ZIP_CODE": ["54757", "54837", "80802", "80333", "12345","80666","13345"]
}

df = pd.DataFrame(data=d)
df

Out[29]:

	CUSTOMER	ZIP_CODE
0	1	54757
1	2	54837
2	3	80802
3	4	80333
4	5	12345
5	6	80666
6	7	13345

In [30]:

            
                Copied!
                
                    
                    
                
                

        
# create custom class

class CompareZipCodes(BaseCompareFeature):
    def _compute_vectorized(self, s1, s2):
        """Compare zipcodes.
        If the zipcodes in both records are identical, the similarity
        is 0. If the first two values agree and the last two don't, then
        the similarity is 0.5. Otherwise, the similarity is 0.
        """
        # check if the zipcode are identical (return 1 or 0)
        sim = (s1 == s2).astype(float)
        # check the first 2 numbers of the distinct comparisons
        sim[(sim == 0) & (s1.str[0:2] == s2.str[0:2])] = 0.5
        return sim
# create custom class

class CompareZipCodes(BaseCompareFeature):
    def _compute_vectorized(self, s1, s2):
        """Compare zipcodes.
        If the zipcodes in both records are identical, the similarity
        is 0. If the first two values agree and the last two don't, then
        the similarity is 0.5. Otherwise, the similarity is 0.
        """
        # check if the zipcode are identical (return 1 or 0)
        sim = (s1 == s2).astype(float)
        # check the first 2 numbers of the distinct comparisons
        sim[(sim == 0) & (s1.str[0:2] == s2.str[0:2])] = 0.5
        return sim

In [31]:

            
                Copied!
                
# create pattern
    
zip_code_pattern = {
    "zip_code_pattern": {
        "ZIP_CODE": CompareZipCodes("ZIP_CODE", "ZIP_CODE", label="ZIP_CODE"),
    }}
# create pattern
    
zip_code_pattern = {
    "zip_code_pattern": {
        "ZIP_CODE": CompareZipCodes("ZIP_CODE", "ZIP_CODE", label="ZIP_CODE"),
    }}

In [32]:

            
                Copied!
                
# run duplicate checking
dc.apply(df, unique_id_columns="CUSTOMER",search_patterns=zip_code_pattern)
# run duplicate checking
dc.apply(df, unique_id_columns="CUSTOMER",search_patterns=zip_code_pattern)

2021-04-23 15:18:16 - Celonis Duplicate Checker: Preprocessing DataFrame
WARNING:recordlinkage:indexing - performance warning - A full index can result in large number of record pairs.

WARNING:recordlinkage:indexing - performance warning - A full index can result in large number of record pairs.

2021-04-23 15:18:16 - Celonis Duplicate Checker: Searching for fuzzy matches ...

100%|██████████| 1/1 [00:00<00:00, 23.59it/s]

Out[32]:

	GROUP_ID	CUSTOMER	ZIP_CODE	PATTERN	MATCH_FOUND
0	IDs:(1,2)	1	54757	zip_code_pattern	YES
1	IDs:(1,2)	2	54837	zip_code_pattern	YES
2	IDs:(3,4,6)	3	80802	zip_code_pattern	YES
3	IDs:(3,4,6)	4	80333	zip_code_pattern	YES
4	IDs:(3,4,6)	6	80666	zip_code_pattern	YES
5	NaN	5	12345	NaN	NO
6	NaN	7	13345	NaN	NO

In [ ]: