Duplicate Checker Basic Usage Tutorial¶
import pandas as pd
from pycelonis.data_deduplication.duplicate_checker import DuplicateChecker
Step 1: Create DuplicateChecker object
dc = DuplicateChecker()
1. Find exact matches¶
Example Data:
Below you see an example data set containing the unique identifier of each invoice (_CASE_KEY) and the columns that are supposed to be checked for duplicate entries. In this example only the last two invoice (_CASE_KEY = 6, 7) are exact matches where all the fields are the same.
d = {
"_CASE_KEY": ["1", "2", "3", "4", "5", "6"],
"REFERENCE": ["ABCDEF123", "ABCEF123", "asdas", "1234", "1234", "1234"],
"VENDOR_NAME": ["Celonis SE", "Celonis SE", "Celonis SE", "Uipath Corp.", "Uipath Corp.", "Uipath Corp."],
"INVOICE_DATE": ["2020-09-01", "2020-09-01", "2020-08-01", "2020-10-01", "2020-10-01", "2020-10-01"],
"_VENDOR_ID": ["1", "1", "1", "4", "6", "6"],
"VALUE": [1000, 1000, 1000, 4400, 220, 220],
}
df = pd.DataFrame(data=d)
df
_CASE_KEY | REFERENCE | VENDOR_NAME | INVOICE_DATE | _VENDOR_ID | VALUE | |
---|---|---|---|---|---|---|
0 | 1 | ABCDEF123 | Celonis SE | 2020-09-01 | 1 | 1000 |
1 | 2 | ABCEF123 | Celonis SE | 2020-09-01 | 1 | 1000 |
2 | 3 | asdas | Celonis SE | 2020-08-01 | 1 | 1000 |
3 | 4 | 1234 | Uipath Corp. | 2020-10-01 | 4 | 4400 |
4 | 5 | 1234 | Uipath Corp. | 2020-10-01 | 6 | 220 |
5 | 6 | 1234 | Uipath Corp. | 2020-10-01 | 6 | 220 |
Search Option 1: Search for rows where all fields (besides the unique id columns) are matching exactly
search_pattern1 = {
"All Exact Matches":{
"VENDOR_NAME": "exact",
"INVOICE_DATE": "exact",
"REFERENCE": "exact",
"VALUE": "exact",
"_VENDOR_ID": "exact"
}
}
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=search_pattern1)
2021-02-14 16:27:43 - Celonis Duplicate Checker: Preprocessing DataFrame 2021-02-14 16:27:43 - Celonis Duplicate Checker: Searching for fuzzy matches ...
0it [00:00, ?it/s]
_CASE_KEY | REFERENCE | VENDOR_NAME | INVOICE_DATE | _VENDOR_ID | VALUE | GROUP_ID | PATTERN | MATCH_FOUND | |
---|---|---|---|---|---|---|---|---|---|
0 | 5 | 1234 | Uipath Corp. | 2020-10-01 | 6 | 220 | IDs:(5,6) | All Exact Matches | YES |
1 | 6 | 1234 | Uipath Corp. | 2020-10-01 | 6 | 220 | IDs:(5,6) | All Exact Matches | YES |
2 | 1 | ABCDEF123 | Celonis SE | 2020-09-01 | 1 | 1000 | NaN | NaN | NO |
3 | 2 | ABCEF123 | Celonis SE | 2020-09-01 | 1 | 1000 | NaN | NaN | NO |
4 | 3 | asdas | Celonis SE | 2020-08-01 | 1 | 1000 | NaN | NaN | NO |
5 | 4 | 1234 | Uipath Corp. | 2020-10-01 | 4 | 4400 | NaN | NaN | NO |
Search Option 2: Search for rows where a subset of fields (besides the unique id columns) are matching exactly
search_pattern2 = {
"All Exact ignore REFERENCE":{
"VENDOR_NAME": "exact",
"INVOICE_DATE": "exact",
"REFERENCE": "ignore",
"VALUE": "exact",
"_VENDOR_ID": "exact"
}
}
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=search_pattern2)
2021-02-14 16:08:56 - Celonis Duplicate Checker: Preprocessing DataFrame 2021-02-14 16:08:56 - Celonis Duplicate Checker: Searching for fuzzy matches ...
0it [00:00, ?it/s]
_CASE_KEY | REFERENCE | VENDOR_NAME | INVOICE_DATE | _VENDOR_ID | VALUE | GROUP_ID | PATTERN | MATCH_FOUND | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | ABCDEF123 | Celonis SE | 2020-09-01 | 1 | 1000 | IDs:(1,2) | All Exact ignore REFERENCE | YES |
1 | 2 | ABCEF123 | Celonis SE | 2020-09-01 | 1 | 1000 | IDs:(1,2) | All Exact ignore REFERENCE | YES |
2 | 5 | 1234 | Uipath Corp. | 2020-10-01 | 6 | 220 | IDs:(5,6) | All Exact ignore REFERENCE | YES |
3 | 6 | 1234 | Uipath Corp. | 2020-10-01 | 6 | 220 | IDs:(5,6) | All Exact ignore REFERENCE | YES |
4 | 3 | asdas | Celonis SE | 2020-08-01 | 1 | 1000 | NaN | NaN | NO |
5 | 4 | 1234 | Uipath Corp. | 2020-10-01 | 4 | 4400 | NaN | NaN | NO |
Search Option 3: Search for rows where all fields (besides the unique id columns) are matching exactly except REFERENCE where entries must be DIFFERENCT to create a match.
search_pattern3 = {
"All Exact but REFERENCE different":{
"VENDOR_NAME": "exact",
"INVOICE_DATE": "exact",
"REFERENCE": "different",
"VALUE": "exact",
"_VENDOR_ID": "exact"
}
}
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=search_pattern3)
2021-02-14 16:09:02 - Celonis Duplicate Checker: Preprocessing DataFrame 2021-02-14 16:09:02 - Celonis Duplicate Checker: Searching for fuzzy matches ...
0it [00:00, ?it/s]
_CASE_KEY | REFERENCE | VENDOR_NAME | INVOICE_DATE | _VENDOR_ID | VALUE | GROUP_ID | PATTERN | MATCH_FOUND | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | ABCDEF123 | Celonis SE | 2020-09-01 | 1 | 1000 | IDs:(1,2) | All Exact but REFERENCE different | YES |
1 | 2 | ABCEF123 | Celonis SE | 2020-09-01 | 1 | 1000 | IDs:(1,2) | All Exact but REFERENCE different | YES |
2 | 3 | asdas | Celonis SE | 2020-08-01 | 1 | 1000 | NaN | NaN | NO |
3 | 4 | 1234 | Uipath Corp. | 2020-10-01 | 4 | 4400 | NaN | NaN | NO |
4 | 5 | 1234 | Uipath Corp. | 2020-10-01 | 6 | 220 | NaN | NaN | NO |
5 | 6 | 1234 | Uipath Corp. | 2020-10-01 | 6 | 220 | NaN | NaN | NO |
Search Options Combined: Apply multiple search patterns at once
combined_patterns = {}
combined_patterns.update(search_pattern1)
combined_patterns.update(search_pattern2)
combined_patterns.update(search_pattern3)
combined_patterns
{'All Exact Matches': {'VENDOR_NAME': 'exact', 'INVOICE_DATE': 'exact', 'REFERENCE': 'exact', 'VALUE': 'exact', '_VENDOR_ID': 'exact'}, 'All Exact ignore REFERENCE': {'VENDOR_NAME': 'exact', 'INVOICE_DATE': 'exact', 'REFERENCE': 'ignore', 'VALUE': 'exact', '_VENDOR_ID': 'exact'}, 'All Exact but REFERENCE different': {'VENDOR_NAME': 'exact', 'INVOICE_DATE': 'exact', 'REFERENCE': 'different', 'VALUE': 'exact', '_VENDOR_ID': 'exact'}}
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=combined_patterns)
2021-02-14 16:09:05 - Celonis Duplicate Checker: Preprocessing DataFrame 2021-02-14 16:09:06 - Celonis Duplicate Checker: Searching for fuzzy matches ...
0it [00:00, ?it/s]
_CASE_KEY | REFERENCE | VENDOR_NAME | INVOICE_DATE | _VENDOR_ID | VALUE | GROUP_ID | PATTERN | MATCH_FOUND | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | ABCDEF123 | Celonis SE | 2020-09-01 | 1 | 1000 | IDs:(1,2) | All Exact ignore REFERENCE | YES |
1 | 2 | ABCEF123 | Celonis SE | 2020-09-01 | 1 | 1000 | IDs:(1,2) | All Exact ignore REFERENCE | YES |
2 | 5 | 1234 | Uipath Corp. | 2020-10-01 | 6 | 220 | IDs:(5,6) | All Exact Matches | YES |
3 | 6 | 1234 | Uipath Corp. | 2020-10-01 | 6 | 220 | IDs:(5,6) | All Exact Matches | YES |
4 | 3 | asdas | Celonis SE | 2020-08-01 | 1 | 1000 | NaN | NaN | NO |
5 | 4 | 1234 | Uipath Corp. | 2020-10-01 | 4 | 4400 | NaN | NaN | NO |
2. Find fuzzy matches¶
Match Fuzzy Reference Numbers
Search for rows where all fields (besides the unique id columns) are matching exactly and one column (REFERENCE) is a fuzzy match.
Comparer Developed for finding duplicates in Invoice References, by:
1) removing all special characters, white spaces etc besides: [^a-zA-ZА-я\d].
E.g. "A R-AMC1234" <-> "AR AMC\ 1234"
2) checking whether they are exactly equal except for 0-3 extra characters
in one of the two records. E.g. "AR-AMC1234" <-> "A-AMC1234"
3) checking for common scanning errors like "8" <-> "B".
E.g. "AR-AMC1238" <-> "RA-AMC123B"
4) checking for turners in characters. E.g. "AR-AMC1234" <-> "RA-AMC1234"
The similarity is 1 in case of agreement and 0 otherwise.
d = {
"_CASE_KEY": ["1", "2", "3", "4", "5", "6"],
"REFERENCE": ["Booking 1/23", "Boking1/23", "asdas", "1234", "1234", "AR-1234"],
"VENDOR_NAME": ["Celonis SE", "Celonis SE", "Celonis SE", "Uipath Corp.", "Uipath Corp.", "Uipath Corp."],
"INVOICE_DATE": ["2020-09-01", "2020-09-01", "2020-08-01", "2020-10-01", "2020-10-01", "2020-10-01"],
"_VENDOR_ID": ["1", "1", "1", "4", "6", "6"],
"VALUE": [1000, 1000, 1000, 4400, 220, 220],
}
df = pd.DataFrame(data=d)
df
_CASE_KEY | REFERENCE | VENDOR_NAME | INVOICE_DATE | _VENDOR_ID | VALUE | |
---|---|---|---|---|---|---|
0 | 1 | Booking 1/23 | Celonis SE | 2020-09-01 | 1 | 1000 |
1 | 2 | Boking1/23 | Celonis SE | 2020-09-01 | 1 | 1000 |
2 | 3 | asdas | Celonis SE | 2020-08-01 | 1 | 1000 |
3 | 4 | 1234 | Uipath Corp. | 2020-10-01 | 4 | 4400 |
4 | 5 | 1234 | Uipath Corp. | 2020-10-01 | 6 | 220 |
5 | 6 | AR-1234 | Uipath Corp. | 2020-10-01 | 6 | 220 |
fuzzyReferencePattern = {
"REFERENCE_fuzzy_match": {
"VENDOR_NAME": "exact",
"INVOICE_DATE": "exact",
"REFERENCE": "InvoiceReferenceFuzzy",
"VALUE": "exact",
}}
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=fuzzyReferencePattern)
2021-02-14 16:09:12 - Celonis Duplicate Checker: Preprocessing DataFrame 2021-02-14 16:09:12 - Celonis Duplicate Checker: Searching for fuzzy matches ...
100%|██████████| 1/1 [00:00<00:00, 11.36it/s]
_CASE_KEY | GROUP_ID | REFERENCE | VENDOR_NAME | INVOICE_DATE | _VENDOR_ID | VALUE | PATTERN | MATCH_FOUND | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | IDs:(1,2) | Booking 1/23 | Celonis SE | 2020-09-01 | 1 | 1000 | REFERENCE_fuzzy_match | YES |
1 | 2 | IDs:(1,2) | Boking1/23 | Celonis SE | 2020-09-01 | 1 | 1000 | REFERENCE_fuzzy_match | YES |
2 | 5 | IDs:(5,6) | 1234 | Uipath Corp. | 2020-10-01 | 6 | 220 | REFERENCE_fuzzy_match | YES |
3 | 6 | IDs:(5,6) | AR-1234 | Uipath Corp. | 2020-10-01 | 6 | 220 | REFERENCE_fuzzy_match | YES |
4 | 3 | NaN | asdas | Celonis SE | 2020-08-01 | 1 | 1000 | NaN | NO |
5 | 4 | NaN | 1234 | Uipath Corp. | 2020-10-01 | 4 | 4400 | NaN | NO |
Match Fuzzy Company Names
Search for rows where all fields (besides the unique id columns) are matching exactly and one column (Vendor) is a fuzzy match.
Comparer Developed for finding duplicates in Company Names, by:
1) removing all special characters, white spaces etc besides: [^a-zA-ZА-я\d].
E.g. "A R-AMC1234" <-> "AR AMC\ 1234"
2) removing all company key words like "corp", "llc" etc.
3) checking for matches with the given string similarity metric and the given threshold.
The similarity is 1 if the given string similarity metric is bigger than the threshold.
d = {
"_CASE_KEY": ["1", "2", "3", "4", "5", "6"],
"REFERENCE": ["Booking123", "Booking123", "asdas", "1234", "1234", "AR-1234"],
"VENDOR_NAME": ["Celonis SE", "Celones SE", "Celonis SE", "uipath", "Uipath Corp.", "Uipath Corp."],
"INVOICE_DATE": ["2020-09-01", "2020-09-01", "2020-08-01", "2020-10-01", "2020-10-01", "2020-10-01"],
"_VENDOR_ID": ["1", "1", "1", "4", "4", "4"],
"VALUE": [1000, 1000, 1000, 220, 220, 220],
}
df = pd.DataFrame(data=d)
df
_CASE_KEY | REFERENCE | VENDOR_NAME | INVOICE_DATE | _VENDOR_ID | VALUE | |
---|---|---|---|---|---|---|
0 | 1 | Booking123 | Celonis SE | 2020-09-01 | 1 | 1000 |
1 | 2 | Booking123 | Celones SE | 2020-09-01 | 1 | 1000 |
2 | 3 | asdas | Celonis SE | 2020-08-01 | 1 | 1000 |
3 | 4 | 1234 | uipath | 2020-10-01 | 4 | 220 |
4 | 5 | 1234 | Uipath Corp. | 2020-10-01 | 4 | 220 |
5 | 6 | AR-1234 | Uipath Corp. | 2020-10-01 | 4 | 220 |
fuzzyVendorPattern = {
"Vendor_fuzzy_match": {
"VENDOR_NAME": "CompanyNameFuzzy",
"INVOICE_DATE": "exact",
"REFERENCE": "exact",
"VALUE": "exact",
}}
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=fuzzyVendorPattern)
2021-02-14 16:09:20 - Celonis Duplicate Checker: Preprocessing DataFrame 2021-02-14 16:09:20 - Celonis Duplicate Checker: Searching for fuzzy matches ...
100%|██████████| 1/1 [00:00<00:00, 13.14it/s]
_CASE_KEY | GROUP_ID | REFERENCE | VENDOR_NAME | INVOICE_DATE | _VENDOR_ID | VALUE | PATTERN | MATCH_FOUND | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | IDs:(1,2) | Booking123 | Celonis SE | 2020-09-01 | 1 | 1000 | Vendor_fuzzy_match | YES |
1 | 2 | IDs:(1,2) | Booking123 | Celones SE | 2020-09-01 | 1 | 1000 | Vendor_fuzzy_match | YES |
2 | 4 | IDs:(4,5) | 1234 | uipath | 2020-10-01 | 4 | 220 | Vendor_fuzzy_match | YES |
3 | 5 | IDs:(4,5) | 1234 | Uipath Corp. | 2020-10-01 | 4 | 220 | Vendor_fuzzy_match | YES |
4 | 3 | NaN | asdas | Celonis SE | 2020-08-01 | 1 | 1000 | NaN | NO |
5 | 6 | NaN | AR-1234 | Uipath Corp. | 2020-10-01 | 4 | 220 | NaN | NO |
Match Fuzzy Dates
Search for rows where all fields (besides the unique id columns) are matching exactly and one column (Date) is a fuzzy match.
Comparer Developed for finding duplicates in Dates. The following are flagged as duplicates::
1) Exact same date. E.g. "2020-11-09" <-> "2020-11-09"
2) Month and day swapped. E.g. "2020-01-02" <-> "2020-02-01"
2) Month swapped commom error. E.g. "2020-07-02" <-> "2020-06-02"
3) Distance between lesse than 7 days.
I.e. | "2020-07-02" - "2020-07-08" | < 7 days.
d = {
"_CASE_KEY": ["1", "2", "3", "4", "5", "6"],
"REFERENCE": ["Booking123", "Booking123", "asdas", "1234", "1234", "AR-1234"],
"VENDOR_NAME": ["Celonis SE", "Celonis SE", "Celonis SE", "Uipath Corp.", "Uipath Corp.", "Uipath Corp."],
"INVOICE_DATE": ["2020-02-01", "2020-01-02", "2020-08-01", "2020-10-01", "2020-10-04", "2020-10-01"],
"_VENDOR_ID": ["1", "1", "1", "4", "4", "4"],
"VALUE": [1000, 1000, 1000, 220, 220, 220],
}
df = pd.DataFrame(data=d)
df.INVOICE_DATE = df.INVOICE_DATE.astype("datetime64")
df
_CASE_KEY | REFERENCE | VENDOR_NAME | INVOICE_DATE | _VENDOR_ID | VALUE | |
---|---|---|---|---|---|---|
0 | 1 | Booking123 | Celonis SE | 2020-02-01 | 1 | 1000 |
1 | 2 | Booking123 | Celonis SE | 2020-01-02 | 1 | 1000 |
2 | 3 | asdas | Celonis SE | 2020-08-01 | 1 | 1000 |
3 | 4 | 1234 | Uipath Corp. | 2020-10-01 | 4 | 220 |
4 | 5 | 1234 | Uipath Corp. | 2020-10-04 | 4 | 220 |
5 | 6 | AR-1234 | Uipath Corp. | 2020-10-01 | 4 | 220 |
fuzzyDatePattern = {
"INVOICE_DATE_fuzzy_match": {
"VENDOR_NAME": "exact",
"INVOICE_DATE": "DateFuzzy",
"REFERENCE": "exact",
"VALUE": "exact",
}}
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=fuzzyDatePattern)
2021-02-14 16:09:23 - Celonis Duplicate Checker: Preprocessing DataFrame 2021-02-14 16:09:24 - Celonis Duplicate Checker: Searching for fuzzy matches ...
100%|██████████| 1/1 [00:00<00:00, 12.02it/s]
_CASE_KEY | GROUP_ID | REFERENCE | VENDOR_NAME | INVOICE_DATE | _VENDOR_ID | VALUE | PATTERN | MATCH_FOUND | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | IDs:(1,2) | Booking123 | Celonis SE | 2020-02-01 | 1 | 1000 | INVOICE_DATE_fuzzy_match | YES |
1 | 2 | IDs:(1,2) | Booking123 | Celonis SE | 2020-01-02 | 1 | 1000 | INVOICE_DATE_fuzzy_match | YES |
2 | 4 | IDs:(4,5) | 1234 | Uipath Corp. | 2020-10-01 | 4 | 220 | INVOICE_DATE_fuzzy_match | YES |
3 | 5 | IDs:(4,5) | 1234 | Uipath Corp. | 2020-10-04 | 4 | 220 | INVOICE_DATE_fuzzy_match | YES |
4 | 3 | NaN | asdas | Celonis SE | 2020-08-01 | 1 | 1000 | NaN | NO |
5 | 6 | NaN | AR-1234 | Uipath Corp. | 2020-10-01 | 4 | 220 | NaN | NO |
Match Fuzzy Numbers
Search for rows where all fields (besides the unique id columns) are matching exactly and one column (VALUE) is a fuzzy match.
Docs: TODO
d = {
"_CASE_KEY": ["1", "2", "3", "4", "5", "6"],
"REFERENCE": ["Booking123", "Booking123", "asdas", "1234", "1234", "AR-1234"],
"VENDOR_NAME": ["Celonis SE", "Celonis SE", "Celonis SE", "Uipath Corp.", "Uipath Corp.", "Uipath Corp."],
"INVOICE_DATE": ["2020-01-01", "2020-01-01", "2020-10-01", "2020-10-01", "2020-10-01", "2020-10-01"],
"_VENDOR_ID": ["1", "1", "1", "4", "4", "4"],
"VALUE": [1000.1, 1049.2, 1000, 220, 220.1, 220],
}
df = pd.DataFrame(data=d)
df.INVOICE_DATE = df.INVOICE_DATE.astype("datetime64")
df
_CASE_KEY | REFERENCE | VENDOR_NAME | INVOICE_DATE | _VENDOR_ID | VALUE | |
---|---|---|---|---|---|---|
0 | 1 | Booking123 | Celonis SE | 2020-01-01 | 1 | 1000.1 |
1 | 2 | Booking123 | Celonis SE | 2020-01-01 | 1 | 1049.2 |
2 | 3 | asdas | Celonis SE | 2020-10-01 | 1 | 1000.0 |
3 | 4 | 1234 | Uipath Corp. | 2020-10-01 | 4 | 220.0 |
4 | 5 | 1234 | Uipath Corp. | 2020-10-01 | 4 | 220.1 |
5 | 6 | AR-1234 | Uipath Corp. | 2020-10-01 | 4 | 220.0 |
fuzzyValuePattern = {
"INVOICE_Value_fuzzy_match": {
"VENDOR_NAME": "exact",
"INVOICE_DATE": "exact",
"REFERENCE": "exact",
"VALUE": "InvoiceValueFuzzy",
}}
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=fuzzyValuePattern)
2021-02-14 16:09:27 - Celonis Duplicate Checker: Preprocessing DataFrame 2021-02-14 16:09:27 - Celonis Duplicate Checker: Searching for fuzzy matches ...
100%|██████████| 1/1 [00:00<00:00, 15.25it/s]
_CASE_KEY | GROUP_ID | REFERENCE | VENDOR_NAME | INVOICE_DATE | _VENDOR_ID | VALUE | PATTERN | MATCH_FOUND | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | IDs:(1,2) | Booking123 | Celonis SE | 2020-01-01 | 1 | 1000.1 | INVOICE_Value_fuzzy_match | YES |
1 | 2 | IDs:(1,2) | Booking123 | Celonis SE | 2020-01-01 | 1 | 1049.2 | INVOICE_Value_fuzzy_match | YES |
2 | 4 | IDs:(4,5) | 1234 | Uipath Corp. | 2020-10-01 | 4 | 220.0 | INVOICE_Value_fuzzy_match | YES |
3 | 5 | IDs:(4,5) | 1234 | Uipath Corp. | 2020-10-01 | 4 | 220.1 | INVOICE_Value_fuzzy_match | YES |
4 | 3 | NaN | asdas | Celonis SE | 2020-10-01 | 1 | 1000.0 | NaN | NO |
5 | 6 | NaN | AR-1234 | Uipath Corp. | 2020-10-01 | 4 | 220.0 | NaN | NO |
Search Option Combined
Apply multiple fuzzy matching searches at once.
d = {
"_CASE_KEY": ["1", "2", "3", "4", "5", "6","7","8"],
"REFERENCE": ["Booking123", "Booking123", "ABC", "ABC", "1234", "AR-1234","123","123"],
"VENDOR_NAME": ["Celonis SE", "Celonis SE", "Celones", "Celonis SE", "Uipath Corp.", "Uipath Corp.","Uipath Corp.", "Uipath Corp."],
"INVOICE_DATE": ["2020-01-01", "2020-01-01", "2020-10-01", "2020-10-01", "2020-10-01", "2020-10-01","2020-10-03", "2020-10-01"],
"_VENDOR_ID": ["1", "1", "4", "4", "4", "4","4","4"],
"VALUE": [1000, 1010.1, 1000, 1000, 220, 220,220,220],
}
df = pd.DataFrame(data=d)
df.INVOICE_DATE = df.INVOICE_DATE.astype("datetime64")
df
_CASE_KEY | REFERENCE | VENDOR_NAME | INVOICE_DATE | _VENDOR_ID | VALUE | |
---|---|---|---|---|---|---|
0 | 1 | Booking123 | Celonis SE | 2020-01-01 | 1 | 1000.0 |
1 | 2 | Booking123 | Celonis SE | 2020-01-01 | 1 | 1010.1 |
2 | 3 | ABC | Celones | 2020-10-01 | 4 | 1000.0 |
3 | 4 | ABC | Celonis SE | 2020-10-01 | 4 | 1000.0 |
4 | 5 | 1234 | Uipath Corp. | 2020-10-01 | 4 | 220.0 |
5 | 6 | AR-1234 | Uipath Corp. | 2020-10-01 | 4 | 220.0 |
6 | 7 | 123 | Uipath Corp. | 2020-10-03 | 4 | 220.0 |
7 | 8 | 123 | Uipath Corp. | 2020-10-01 | 4 | 220.0 |
combined_patterns = {}
combined_patterns.update(fuzzyVendorPattern)
combined_patterns.update(fuzzyDatePattern)
combined_patterns.update(fuzzyValuePattern)
combined_patterns.update(fuzzyReferencePattern)
combined_patterns
{'Vendor_fuzzy_match': {'VENDOR_NAME': 'CompanyNameFuzzy', 'INVOICE_DATE': 'exact', 'REFERENCE': 'exact', 'VALUE': 'exact'}, 'INVOICE_DATE_fuzzy_match': {'VENDOR_NAME': 'exact', 'INVOICE_DATE': 'DateFuzzy', 'REFERENCE': 'exact', 'VALUE': 'exact'}, 'INVOICE_Value_fuzzy_match': {'VENDOR_NAME': 'exact', 'INVOICE_DATE': 'exact', 'REFERENCE': 'exact', 'VALUE': 'InvoiceValueFuzzy'}, 'REFERENCE_fuzzy_match': {'VENDOR_NAME': 'exact', 'INVOICE_DATE': 'exact', 'REFERENCE': 'InvoiceReferenceFuzzy', 'VALUE': 'exact'}}
dc.apply(df, unique_id_columns="_CASE_KEY",search_patterns=combined_patterns)
2021-02-14 16:09:32 - Celonis Duplicate Checker: Preprocessing DataFrame 2021-02-14 16:09:32 - Celonis Duplicate Checker: Searching for fuzzy matches ...
100%|██████████| 4/4 [00:00<00:00, 16.29it/s]
_CASE_KEY | GROUP_ID | REFERENCE | VENDOR_NAME | INVOICE_DATE | _VENDOR_ID | VALUE | PATTERN | MATCH_FOUND | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | IDs:(1,2) | Booking123 | Celonis SE | 2020-01-01 | 1 | 1000.0 | INVOICE_Value_fuzzy_match | YES |
1 | 2 | IDs:(1,2) | Booking123 | Celonis SE | 2020-01-01 | 1 | 1010.1 | INVOICE_Value_fuzzy_match | YES |
2 | 3 | IDs:(3,4) | ABC | Celones | 2020-10-01 | 4 | 1000.0 | Vendor_fuzzy_match | YES |
3 | 4 | IDs:(3,4) | ABC | Celonis SE | 2020-10-01 | 4 | 1000.0 | Vendor_fuzzy_match | YES |
4 | 5 | IDs:(5,6) | 1234 | Uipath Corp. | 2020-10-01 | 4 | 220.0 | REFERENCE_fuzzy_match | YES |
5 | 6 | IDs:(5,6) | AR-1234 | Uipath Corp. | 2020-10-01 | 4 | 220.0 | REFERENCE_fuzzy_match | YES |
6 | 7 | IDs:(7,8) | 123 | Uipath Corp. | 2020-10-03 | 4 | 220.0 | INVOICE_DATE_fuzzy_match | YES |
7 | 8 | IDs:(7,8) | 123 | Uipath Corp. | 2020-10-01 | 4 | 220.0 | INVOICE_DATE_fuzzy_match | YES |
Match Fuzzy Material Descriptions
Search for rows where all fields (besides the unique id columns) are matching exactly and one column (MATERIAL_TEXT) is a fuzzy match.
d = {
"MATNR": ["1", "2", "3", "4", "5", "6","7","8","9","0"],
"MATERIAL_TEXT": ["BOX/TOILE EMERI LARG.40MM.LONG.25M",
"BOX1234 155 cm",
"BOX/TOILE EMERI LARG.40MM.LG.25M",
"Abdeckfolie Nr.3842146901",
"Material-123",
"AA Grease MU EP1 18kg bal. KN:1234",
"Abdeckfolie 3842146901",
"AA Grease MU EP1 18kg KN:1234","Material 123", "BOX1234 150 cm"]
}
df = pd.DataFrame(data=d)
df
MATNR | MATERIAL_TEXT | |
---|---|---|
0 | 1 | BOX/TOILE EMERI LARG.40MM.LONG.25M |
1 | 2 | BOX1234 155 cm |
2 | 3 | BOX/TOILE EMERI LARG.40MM.LG.25M |
3 | 4 | Abdeckfolie Nr.3842146901 |
4 | 5 | Material-123 |
5 | 6 | AA Grease MU EP1 18kg bal. KN:1234 |
6 | 7 | Abdeckfolie 3842146901 |
7 | 8 | AA Grease MU EP1 18kg KN:1234 |
8 | 9 | Material 123 |
9 | 0 | BOX1234 150 cm |
fuzzyMaterialPattern = {
"Material_fuzzy_match": {
"MATERIAL_TEXT": "MaterialFuzzy",
}}
dc.apply(df, unique_id_columns="MATNR",search_patterns=fuzzyMaterialPattern)
2021-02-14 16:09:51 - Celonis Duplicate Checker: Preprocessing DataFrame WARNING:recordlinkage:indexing - performance warning - A full index can result in large number of record pairs.
WARNING:recordlinkage:indexing - performance warning - A full index can result in large number of record pairs.
2021-02-14 16:09:51 - Celonis Duplicate Checker: Searching for fuzzy matches ...
100%|██████████| 1/1 [00:00<00:00, 12.43it/s]
GROUP_ID | MATNR | MATERIAL_TEXT | PATTERN | MATCH_FOUND | |
---|---|---|---|---|---|
0 | IDs:(1,3) | 1 | BOX/TOILE EMERI LARG.40MM.LONG.25M | Material_fuzzy_match | YES |
1 | IDs:(1,3) | 3 | BOX/TOILE EMERI LARG.40MM.LG.25M | Material_fuzzy_match | YES |
2 | IDs:(4,7) | 4 | Abdeckfolie Nr.3842146901 | Material_fuzzy_match | YES |
3 | IDs:(4,7) | 7 | Abdeckfolie 3842146901 | Material_fuzzy_match | YES |
4 | IDs:(5,9) | 5 | Material-123 | Material_fuzzy_match | YES |
5 | IDs:(5,9) | 9 | Material 123 | Material_fuzzy_match | YES |
6 | IDs:(6,8) | 6 | AA Grease MU EP1 18kg bal. KN:1234 | Material_fuzzy_match | YES |
7 | IDs:(6,8) | 8 | AA Grease MU EP1 18kg KN:1234 | Material_fuzzy_match | YES |
8 | NaN | 2 | BOX1234 155 cm | NaN | NO |
9 | NaN | 0 | BOX1234 150 cm | NaN | NO |
Create and use a custom Matching Algorithm
Build you own comparison algorithm, as a BaseComparFeature of the record linkage package and use it in the duplicate checker. To see how to build such a Pattern see the example below or go to: https://recordlinkage.readthedocs.io/en/latest/ref-compare.html#user-defined-algorithms
# create data frame
d = {
"CUSTOMER": ["1", "2", "3", "4", "5","6","7"],
"ZIP_CODE": ["54757", "54837", "80802", "80333", "12345","80666","13345"]
}
df = pd.DataFrame(data=d)
df
CUSTOMER | ZIP_CODE | |
---|---|---|
0 | 1 | 54757 |
1 | 2 | 54837 |
2 | 3 | 80802 |
3 | 4 | 80333 |
4 | 5 | 12345 |
5 | 6 | 80666 |
6 | 7 | 13345 |
# create custom class
class CompareZipCodes(BaseCompareFeature):
def _compute_vectorized(self, s1, s2):
"""Compare zipcodes.
If the zipcodes in both records are identical, the similarity
is 0. If the first two values agree and the last two don't, then
the similarity is 0.5. Otherwise, the similarity is 0.
"""
# check if the zipcode are identical (return 1 or 0)
sim = (s1 == s2).astype(float)
# check the first 2 numbers of the distinct comparisons
sim[(sim == 0) & (s1.str[0:2] == s2.str[0:2])] = 0.5
return sim
# create pattern
zip_code_pattern = {
"zip_code_pattern": {
"ZIP_CODE": CompareZipCodes("ZIP_CODE", "ZIP_CODE", label="ZIP_CODE"),
}}
# run duplicate checking
dc.apply(df, unique_id_columns="CUSTOMER",search_patterns=zip_code_pattern)
2021-04-23 15:18:16 - Celonis Duplicate Checker: Preprocessing DataFrame WARNING:recordlinkage:indexing - performance warning - A full index can result in large number of record pairs.
WARNING:recordlinkage:indexing - performance warning - A full index can result in large number of record pairs.
2021-04-23 15:18:16 - Celonis Duplicate Checker: Searching for fuzzy matches ...
100%|██████████| 1/1 [00:00<00:00, 23.59it/s]
GROUP_ID | CUSTOMER | ZIP_CODE | PATTERN | MATCH_FOUND | |
---|---|---|---|---|---|
0 | IDs:(1,2) | 1 | 54757 | zip_code_pattern | YES |
1 | IDs:(1,2) | 2 | 54837 | zip_code_pattern | YES |
2 | IDs:(3,4,6) | 3 | 80802 | zip_code_pattern | YES |
3 | IDs:(3,4,6) | 4 | 80333 | zip_code_pattern | YES |
4 | IDs:(3,4,6) | 6 | 80666 | zip_code_pattern | YES |
5 | NaN | 5 | 12345 | NaN | NO |
6 | NaN | 7 | 13345 | NaN | NO |