Use Case: Vendor Deduplication¶
This notebook shows you how you can use the Celonis Duplicate Checker to clean your Vendor (or Customer) master table.
import pandas as pd
import numpy as np
from pycelonis.data_deduplication.duplicate_checker import DuplicateChecker
from pycelonis import get_celonis
from pycelonis import pql
0. Connect to Celonis
celonis = get_celonis()
2021-02-01 08:27:04 - pycelonis: Login successful! Hello SimonBrand
1. Connect to Data Model
datamodel_id = "3f6ef2f6-6635-49de-aa8c-7090188bb78b" # edit
datamodel = celonis.datamodels.find(datamodel_id) # do not edit this line
datamodel.name # do not edit this line
'SAP ECC - Purchase to Pay Data Model'
2. Define Vendor Id and Vendor Text fields
In order to deduplicate your vendor we recommend to use the vendor text field. In addition you need to provide the field that are the unique identity of each entry of the vendor table.
query = pql.PQL()
query += pql.PQLColumn("LFA1.MANDT", "MANDT")
query += pql.PQLColumn("LFA1.LIFNR", "LIFNR")
query += pql.PQLColumn("LFA1.NAME1", "NAME1")
query += pql.PQLColumn("LFA1.STRAS", "STRAS")
df = datamodel.get_data_frame(query)
print(df.shape)
df.head()
(1953, 4)
MANDT | LIFNR | NAME1 | STRAS | |
---|---|---|---|---|
0 | 800 | T-K500R22 | AluCast AG&Co.KG Gr.22 | Liebigstrasse 34 |
1 | 800 | T-K515A02 | Sapsota Company Limited | Industriepark 48 |
2 | 800 | T-K515A04 | Sapsota Company Limited | Industriepark 48 |
3 | 800 | T-K515A12 | Sapsota Company Limited | Industriepark 48 |
4 | 800 | T-K515A13 | Sapsota Company Limited | Industriepark 48 |
Define the fields which unique identify each row of the table
unique_id_columns=["MANDT","LIFNR"]
3. Define the Search Pattern
For Vendor deduplication we recommend the CompanyNameComparer comparison algorithm. Via the threshold you can steer how similar two entities need to be in order to be matched (0 = no similarity, 1 = identical)
# Fuzzy matching algorithm
threshold = 0.9 # Floar 0 - 1
search_pattern = {
"fuzzy":{
"NAME1": ("comparers.CompanyNameComparer(column, column,label=column,"
f"threshold={str(threshold)},method='jarowinkler')"),
#"STRAS" : "exact", # Optionally take into consideration also the adress
}
}
4. Set Max Chunk Size
Adjust chunk size to the size of you workbench. If you are using a standard workbench you should not go over 500. A lower threshold makes it a little slower but uses less RAM.
max_chunk_size=500 # adjust chunk size to the size of you workbench
5. Run Duplicate Checker
dc = DuplicateChecker(max_chunk_size=max_chunk_size)
res = dc.apply(df, unique_id_columns=["MANDT","LIFNR"],search_patterns=search_pattern,fast_mode=True)
# CREATE UNIQUE MATERIAL TEXT
res = res.merge(res.groupby("GROUP_ID")["NAME1"].max().reset_index().rename(columns={"NAME1":"REAL_NAME"}),on="GROUP_ID",how="left")
2021-02-01 08:53:20 - Celonis Duplicate Checker: Preprocessing DataFrame WARNING:recordlinkage:indexing - performance warning - A full index can result in large number of record pairs. WARNING:recordlinkage:indexing - performance warning - A full index can result in large number of record pairs. 2021-02-01 08:53:21 - Celonis Duplicate Checker: Searching for fuzzy matches ... 100%|██████████| 27/27 [00:07<00:00, 3.62it/s]
6. Inspect Results
pd.set_option('display.max_rows', 200)
res.head(200)
GROUP_ID | MANDT | LIFNR | NAME1 | STRAS | PATTERN | REAL_NAME | |
---|---|---|---|---|---|---|---|
0 | IDs:(800-0000000004,800-0000000030) | 800 | 0000000004 | omar | NAME1 | omar2 | |
1 | IDs:(800-0000000004,800-0000000030) | 800 | 0000000030 | omar2 | NAME1 | omar2 | |
2 | IDs:(800-0000000007,800-0000000032) | 800 | 0000000032 | isaac | NAME1 | isaac1 | |
3 | IDs:(800-0000000007,800-0000000032) | 800 | 0000000007 | isaac1 | NAME1 | isaac1 | |
4 | IDs:(800-0000000011,800-0000000022) | 800 | 0000000022 | sara | cardenal | NAME1 | sara |
... | ... | ... | ... | ... | ... | ... | ... |
195 | IDs:(800-ARAMI-00,800-ARAMI-01,800-ARAMI-02,80... | 800 | ARAMI-00 | Aramingo-00 | 3456 West Chester Pike | NAME1 | Aramingo-30 |
196 | IDs:(800-ARAMI-00,800-ARAMI-01,800-ARAMI-02,80... | 800 | ARAMI-24 | Aramingo-24 | 1234 West Chester Pike | NAME1 | Aramingo-30 |
197 | IDs:(800-ARAMI-00,800-ARAMI-01,800-ARAMI-02,80... | 800 | ARAMI-01 | Aramingo-01 | 1234 West Chester Pike | NAME1 | Aramingo-30 |
198 | IDs:(800-ARAMI-00,800-ARAMI-01,800-ARAMI-02,80... | 800 | ARAMI-02 | Aramingo-02 | 1234 West Chester Pike | NAME1 | Aramingo-30 |
199 | IDs:(800-ARAMI-00,800-ARAMI-01,800-ARAMI-02,80... | 800 | ARAMI-07 | Aramingo-07 | 1234 West Chester Pike | NAME1 | Aramingo-30 |
200 rows × 7 columns
7. Push Results to Data Pool
datamodel.pool.push_table(res,'DUPLICATE_VENDOR',reload_datamodel=False,if_exists = 'replace')
2021-01-21 09:16:34 - pycelonis: Execution of Data Job running 2021-01-21 09:16:35 - pycelonis: Execution of Data Job running 2021-01-21 09:16:36 - pycelonis: Execution of Data Job running 2021-01-21 09:16:37 - pycelonis: Execution of Data Job running 2021-01-21 09:16:37 - pycelonis: Execution of Data Job running 2021-01-21 09:16:38 - pycelonis: Execution of Data Job running 2021-01-21 09:16:39 - pycelonis: Execution of Data Job running 2021-01-21 09:16:40 - pycelonis: Execution of Data Job running 2021-01-21 09:16:40 - pycelonis: Execution of Data Job running 2021-01-21 09:16:45 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'tableSchema': None, 'csvParsingOptions': None, 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'mirrorTargetNames': [], 'optionalTenantId': None} 2021-01-21 09:16:47 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'tableSchema': None, 'mirrorTargetNames': [], 'csvParsingOptions': None, 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'optionalTenantId': None} 2021-01-21 09:16:50 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'tableSchema': None, 'mirrorTargetNames': [], 'csvParsingOptions': None, 'optionalTenantId': None} 2021-01-21 09:16:52 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'tableSchema': None, 'csvParsingOptions': None, 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'mirrorTargetNames': [], 'optionalTenantId': None} 2021-01-21 09:16:55 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'mirrorTargetNames': [], 'csvParsingOptions': None, 'tableSchema': None, 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'optionalTenantId': None} 2021-01-21 09:16:58 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'tableSchema': None, 'mirrorTargetNames': [], 'csvParsingOptions': None, 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'optionalTenantId': None}
<DatamodelTable, id 60beaefa-1660-413c-b833-383553e3986d, name res_total>
8. Add Table to Data Model
datamodel.add_table_from_pool(table_name="DUPLICATE_VENDOR",
alias= "DUPLICATE_VENDOR",
added_table_join_type="source",
new_foreign_key_to_table="...",
foreign_key_columns=["..."],
reload="PARTIAL_ON_TABLE")