Use Case: Vendor Deduplication¶

This notebook shows you how you can use the Celonis Duplicate Checker to clean your Vendor (or Customer) master table.

In [1]:

            
                Copied!
                
import pandas as pd
import numpy as np
from pycelonis.data_deduplication.duplicate_checker import DuplicateChecker
from pycelonis import get_celonis
from pycelonis import pql
import pandas as pd
import numpy as np
from pycelonis.data_deduplication.duplicate_checker import DuplicateChecker
from pycelonis import get_celonis
from pycelonis import pql

0. Connect to Celonis

In [3]:

            
                Copied!
                
celonis = get_celonis()
celonis = get_celonis()

2021-02-01 08:27:04 - pycelonis: Login successful! Hello SimonBrand

1. Connect to Data Model

In [4]:

            
                Copied!
                
datamodel_id = "3f6ef2f6-6635-49de-aa8c-7090188bb78b" # edit
datamodel = celonis.datamodels.find(datamodel_id) # do not edit this line
datamodel.name                                    # do not edit this line
datamodel_id = "3f6ef2f6-6635-49de-aa8c-7090188bb78b" # edit
datamodel = celonis.datamodels.find(datamodel_id) # do not edit this line
datamodel.name                                    # do not edit this line

Out[4]:

'SAP ECC - Purchase to Pay Data Model'

2. Define Vendor Id and Vendor Text fields

In order to deduplicate your vendor we recommend to use the vendor text field. In addition you need to provide the field that are the unique identity of each entry of the vendor table.

In [8]:

            
                Copied!
                
                    
                    
                
                

        
query = pql.PQL()
query += pql.PQLColumn("LFA1.MANDT", "MANDT")
query += pql.PQLColumn("LFA1.LIFNR", "LIFNR")
query += pql.PQLColumn("LFA1.NAME1", "NAME1")
query += pql.PQLColumn("LFA1.STRAS", "STRAS")
df = datamodel.get_data_frame(query)
print(df.shape)
df.head()
query = pql.PQL()
query += pql.PQLColumn("LFA1.MANDT", "MANDT")
query += pql.PQLColumn("LFA1.LIFNR", "LIFNR")
query += pql.PQLColumn("LFA1.NAME1", "NAME1")
query += pql.PQLColumn("LFA1.STRAS", "STRAS")
df = datamodel.get_data_frame(query)
print(df.shape)
df.head()

(1953, 4)

Out[8]:

	MANDT	LIFNR	NAME1	STRAS
0	800	T-K500R22	AluCast AG&Co.KG Gr.22	Liebigstrasse 34
1	800	T-K515A02	Sapsota Company Limited	Industriepark 48
2	800	T-K515A04	Sapsota Company Limited	Industriepark 48
3	800	T-K515A12	Sapsota Company Limited	Industriepark 48
4	800	T-K515A13	Sapsota Company Limited	Industriepark 48

Define the fields which unique identify each row of the table

In [9]:

            
                Copied!
                
unique_id_columns=["MANDT","LIFNR"]
unique_id_columns=["MANDT","LIFNR"]

3. Define the Search Pattern

For Vendor deduplication we recommend the CompanyNameComparer comparison algorithm. Via the threshold you can steer how similar two entities need to be in order to be matched (0 = no similarity, 1 = identical)

In [16]:

            
                Copied!
                
                    
                    
                
                

        
# Fuzzy matching algorithm
threshold = 0.9 # Floar 0 - 1

search_pattern = {
    "fuzzy":{
        "NAME1": ("comparers.CompanyNameComparer(column, column,label=column,"
                  f"threshold={str(threshold)},method='jarowinkler')"),
        #"STRAS" : "exact", # Optionally take into consideration also the adress
    }
}

# Fuzzy matching algorithm
threshold = 0.9 # Floar 0 - 1

search_pattern = {
    "fuzzy":{
        "NAME1": ("comparers.CompanyNameComparer(column, column,label=column,"
                  f"threshold={str(threshold)},method='jarowinkler')"),
        #"STRAS" : "exact", # Optionally take into consideration also the adress
    }
} 

In [ ]:

4. Set Max Chunk Size

Adjust chunk size to the size of you workbench. If you are using a standard workbench you should not go over 500. A lower threshold makes it a little slower but uses less RAM.

In [17]:

            
                Copied!
                
max_chunk_size=500 # adjust chunk size to the size of you workbench
max_chunk_size=500 # adjust chunk size to the size of you workbench

5. Run Duplicate Checker

In [18]:

            
                Copied!
                
dc = DuplicateChecker(max_chunk_size=max_chunk_size) 
res = dc.apply(df, unique_id_columns=["MANDT","LIFNR"],search_patterns=search_pattern,fast_mode=True)
# CREATE UNIQUE MATERIAL TEXT
res = res.merge(res.groupby("GROUP_ID")["NAME1"].max().reset_index().rename(columns={"NAME1":"REAL_NAME"}),on="GROUP_ID",how="left")
dc = DuplicateChecker(max_chunk_size=max_chunk_size) 
res = dc.apply(df, unique_id_columns=["MANDT","LIFNR"],search_patterns=search_pattern,fast_mode=True)
# CREATE UNIQUE MATERIAL TEXT
res = res.merge(res.groupby("GROUP_ID")["NAME1"].max().reset_index().rename(columns={"NAME1":"REAL_NAME"}),on="GROUP_ID",how="left")

2021-02-01 08:53:20 - Celonis Duplicate Checker: Preprocessing DataFrame
WARNING:recordlinkage:indexing - performance warning - A full index can result in large number of record pairs.
WARNING:recordlinkage:indexing - performance warning - A full index can result in large number of record pairs.
2021-02-01 08:53:21 - Celonis Duplicate Checker: Searching for fuzzy matches ...
100%|██████████| 27/27 [00:07<00:00,  3.62it/s]

6. Inspect Results

In [19]:

            
                Copied!
                
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_rows', 200)

In [20]:

            
                Copied!
                
res.head(200)
res.head(200)

Out[20]:

	GROUP_ID	MANDT	LIFNR	NAME1	STRAS	PATTERN	REAL_NAME
0	IDs:(800-0000000004,800-0000000030)	800	0000000004	omar		NAME1	omar2
1	IDs:(800-0000000004,800-0000000030)	800	0000000030	omar2		NAME1	omar2
2	IDs:(800-0000000007,800-0000000032)	800	0000000032	isaac		NAME1	isaac1
3	IDs:(800-0000000007,800-0000000032)	800	0000000007	isaac1		NAME1	isaac1
4	IDs:(800-0000000011,800-0000000022)	800	0000000022	sara	cardenal	NAME1	sara
...	...	...	...	...	...	...	...
195	IDs:(800-ARAMI-00,800-ARAMI-01,800-ARAMI-02,80...	800	ARAMI-00	Aramingo-00	3456 West Chester Pike	NAME1	Aramingo-30
196	IDs:(800-ARAMI-00,800-ARAMI-01,800-ARAMI-02,80...	800	ARAMI-24	Aramingo-24	1234 West Chester Pike	NAME1	Aramingo-30
197	IDs:(800-ARAMI-00,800-ARAMI-01,800-ARAMI-02,80...	800	ARAMI-01	Aramingo-01	1234 West Chester Pike	NAME1	Aramingo-30
198	IDs:(800-ARAMI-00,800-ARAMI-01,800-ARAMI-02,80...	800	ARAMI-02	Aramingo-02	1234 West Chester Pike	NAME1	Aramingo-30
199	IDs:(800-ARAMI-00,800-ARAMI-01,800-ARAMI-02,80...	800	ARAMI-07	Aramingo-07	1234 West Chester Pike	NAME1	Aramingo-30

200 rows × 7 columns

7. Push Results to Data Pool

In [114]:

            
                Copied!
                
datamodel.pool.push_table(res,'DUPLICATE_VENDOR',reload_datamodel=False,if_exists = 'replace')
datamodel.pool.push_table(res,'DUPLICATE_VENDOR',reload_datamodel=False,if_exists = 'replace')

2021-01-21 09:16:34 - pycelonis: Execution of Data Job running
2021-01-21 09:16:35 - pycelonis: Execution of Data Job running
2021-01-21 09:16:36 - pycelonis: Execution of Data Job running
2021-01-21 09:16:37 - pycelonis: Execution of Data Job running
2021-01-21 09:16:37 - pycelonis: Execution of Data Job running
2021-01-21 09:16:38 - pycelonis: Execution of Data Job running
2021-01-21 09:16:39 - pycelonis: Execution of Data Job running
2021-01-21 09:16:40 - pycelonis: Execution of Data Job running
2021-01-21 09:16:40 - pycelonis: Execution of Data Job running
2021-01-21 09:16:45 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'tableSchema': None, 'csvParsingOptions': None, 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'mirrorTargetNames': [], 'optionalTenantId': None}
2021-01-21 09:16:47 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'tableSchema': None, 'mirrorTargetNames': [], 'csvParsingOptions': None, 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'optionalTenantId': None}
2021-01-21 09:16:50 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'tableSchema': None, 'mirrorTargetNames': [], 'csvParsingOptions': None, 'optionalTenantId': None}
2021-01-21 09:16:52 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'tableSchema': None, 'csvParsingOptions': None, 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'mirrorTargetNames': [], 'optionalTenantId': None}
2021-01-21 09:16:55 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'mirrorTargetNames': [], 'csvParsingOptions': None, 'tableSchema': None, 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'optionalTenantId': None}
2021-01-21 09:16:58 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'tableSchema': None, 'mirrorTargetNames': [], 'csvParsingOptions': None, 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'optionalTenantId': None}

Out[114]:

<DatamodelTable, id 60beaefa-1660-413c-b833-383553e3986d, name res_total>

8. Add Table to Data Model

In [ ]:

            
                Copied!
                
                    
                    
                
                

        
datamodel.add_table_from_pool(table_name="DUPLICATE_VENDOR",
                               alias= "DUPLICATE_VENDOR",
                               added_table_join_type="source",
                               new_foreign_key_to_table="...",
                               foreign_key_columns=["..."],
                               reload="PARTIAL_ON_TABLE")
datamodel.add_table_from_pool(table_name="DUPLICATE_VENDOR",
                               alias= "DUPLICATE_VENDOR",
                               added_table_join_type="source",
                               new_foreign_key_to_table="...",
                               foreign_key_columns=["..."],
                               reload="PARTIAL_ON_TABLE")