Use Case: Material Deduplication¶

This notebook shows you how you can use the Celonis Duplicate Checker to clean your material master table.

In [1]:

            
                Copied!
                
import pandas as pd
import numpy as np
from pycelonis.data_deduplication.duplicate_checker import DuplicateChecker
from pycelonis import get_celonis
from pycelonis import pql
import pandas as pd
import numpy as np
from pycelonis.data_deduplication.duplicate_checker import DuplicateChecker
from pycelonis import get_celonis
from pycelonis import pql

0. Connect to Celonis

In [6]:

            
                Copied!
                
celonis = get_celonis()
celonis = get_celonis()

2021-01-21 12:09:18 - pycelonis: Login successful! Hello s.brand@celonis.com
2021-01-21 12:09:18 - pycelonis: Your PyCelonis Version 1.4.1 is outdated (Newest Version: 1.5). Please upgrade the package via: pip install --extra-index-url=https://pypi.celonis.cloud/ pycelonis --upgrade

1. Connect to Data Model

In [7]:

            
                Copied!
                
datamodel_id = "f62b5b7b-8fc0-4e41-a106-2664af6ab61a" # edit
datamodel = celonis.datamodels.find(datamodel_id) # do not edit this line
datamodel.name                                    # do not edit this line
datamodel_id = "f62b5b7b-8fc0-4e41-a106-2664af6ab61a" # edit
datamodel = celonis.datamodels.find(datamodel_id) # do not edit this line
datamodel.name                                    # do not edit this line

Out[7]:

'Source to Pay'

2. Define Material Number and Material Text fields

In order to deduplicate your material we recommend to use the material text field. In addition you need to provide the field that are the unique identity of each entry of the material table.

In [8]:

            
                Copied!
                
                    
                    
                
                

        
query = pql.PQL()
query += pql.PQLColumn("MARA_MASTER.MATNR", "MATNR")
query += pql.PQLColumn("MARA_MASTER.MANDT", "MANDT")
query += pql.PQLColumn("MARA_MASTER.MAKTX", "MAKTX")
df = datamodel.get_data_frame(query)
df.head()
query = pql.PQL()
query += pql.PQLColumn("MARA_MASTER.MATNR", "MATNR")
query += pql.PQLColumn("MARA_MASTER.MANDT", "MANDT")
query += pql.PQLColumn("MARA_MASTER.MAKTX", "MAKTX")
df = datamodel.get_data_frame(query)
df.head()

Out[8]:

	MATNR	MANDT	MAKTX
0	000000000080550582	100	Drahtseilklemme 5 mm DIN 1142 verz.
1	000007200510003000	111	Ethernet Switch 5-Port SPIDER 5TX
2	000009170000397500	111	Frischware Dowclene 1621 im Safetainer
3	000000000007370542	100	Nitril-Handschuh Gr.8,5-9* L
4	000007200538950000	111	PCU50 1,2GHZ, WIN XP

In [9]:

            
                Copied!
                
unique_id_columns=["MANDT","MATNR"]
material_text_column = "MAKTX"
unique_id_columns=["MANDT","MATNR"]
material_text_column = "MAKTX"

3. Choose Search Pattern

In [11]:

            
                Copied!
                
                    
                    
                
                

        
search_pattern_exact = {
    "exact":{
        "MAKTX": "exact"
    }
} # Fast

search_pattern_fuzzy = {
    "fuzzy":{
        "MAKTX": "MaterialFuzzy"
    }
} # Slow
search_pattern_exact = {
    "exact":{
        "MAKTX": "exact"
    }
} # Fast

search_pattern_fuzzy = {
    "fuzzy":{
        "MAKTX": "MaterialFuzzy"
    }
} # Slow

In [12]:

            
                Copied!
                
search_pattern = search_pattern_fuzzy
search_pattern = search_pattern_fuzzy

4. Set Max Chunk Size

Adjust chunk size to the size of you workbench. If you have little RAM, use a small chunksize to prevent memory errors.

In [13]:

            
                Copied!
                
max_chunk_size=1000 # adjust chunk size to the size of you workbench
max_chunk_size=1000 # adjust chunk size to the size of you workbench

5. Run Duplicate Checker

In [ ]:

            
                Copied!
                
dc = DuplicateChecker() 
res = dc.apply(df, unique_id_columns=["MANDT","MATNR"],search_patterns=search_pattern,fast_mode=True)
# CREATE UNIQUE MATERIAL TEXT
res = res.merge(res.groupby("GROUP_ID")[material_text_column].max().reset_index().rename(columns={material_text_column:"REAL_MATERIAL"}),on="GROUP_ID",how="left")
dc = DuplicateChecker() 
res = dc.apply(df, unique_id_columns=["MANDT","MATNR"],search_patterns=search_pattern,fast_mode=True)
# CREATE UNIQUE MATERIAL TEXT
res = res.merge(res.groupby("GROUP_ID")[material_text_column].max().reset_index().rename(columns={material_text_column:"REAL_MATERIAL"}),on="GROUP_ID",how="left")

In [ ]:

            
                Copied!
                
pattern = res.groupby("GROUP_ID")[material_text_column].nunique()
res["PATTERN"] = np.where(res.GROUP_ID.isin(pattern[pattern > 1].index),"FUZZY_MATCH","EXACT_MATCH")
pattern = res.groupby("GROUP_ID")[material_text_column].nunique()
res["PATTERN"] = np.where(res.GROUP_ID.isin(pattern[pattern > 1].index),"FUZZY_MATCH","EXACT_MATCH")

6. Inspect Results

In [16]:

            
                Copied!
                
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_rows', 200)

In [ ]:

            
                Copied!
                
res.head(200)
res.head(200)

7. Push Results to Data Pool

In [114]:

            
                Copied!
                
datamodel.pool.push_table(res,'DUPLICATE_MATERIAL',reload_datamodel=False,if_exists = 'replace')
datamodel.pool.push_table(res,'DUPLICATE_MATERIAL',reload_datamodel=False,if_exists = 'replace')

2021-01-21 09:16:34 - pycelonis: Execution of Data Job running
2021-01-21 09:16:35 - pycelonis: Execution of Data Job running
2021-01-21 09:16:36 - pycelonis: Execution of Data Job running
2021-01-21 09:16:37 - pycelonis: Execution of Data Job running
2021-01-21 09:16:37 - pycelonis: Execution of Data Job running
2021-01-21 09:16:38 - pycelonis: Execution of Data Job running
2021-01-21 09:16:39 - pycelonis: Execution of Data Job running
2021-01-21 09:16:40 - pycelonis: Execution of Data Job running
2021-01-21 09:16:40 - pycelonis: Execution of Data Job running
2021-01-21 09:16:45 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'tableSchema': None, 'csvParsingOptions': None, 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'mirrorTargetNames': [], 'optionalTenantId': None}
2021-01-21 09:16:47 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'tableSchema': None, 'mirrorTargetNames': [], 'csvParsingOptions': None, 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'optionalTenantId': None}
2021-01-21 09:16:50 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'tableSchema': None, 'mirrorTargetNames': [], 'csvParsingOptions': None, 'optionalTenantId': None}
2021-01-21 09:16:52 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'tableSchema': None, 'csvParsingOptions': None, 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'mirrorTargetNames': [], 'optionalTenantId': None}
2021-01-21 09:16:55 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'mirrorTargetNames': [], 'csvParsingOptions': None, 'tableSchema': None, 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'optionalTenantId': None}
2021-01-21 09:16:58 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'tableSchema': None, 'mirrorTargetNames': [], 'csvParsingOptions': None, 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'optionalTenantId': None}

Out[114]:

<DatamodelTable, id 60beaefa-1660-413c-b833-383553e3986d, name res_total>

8. Add Table to Data Model

In [ ]:

            
                Copied!
                
                    
                    
                
                

        
datamodel.add_table_from_pool(table_name="DUPLICATE_MATERIAL",
                               alias= "DUPLICATE_MATERIAL",
                               added_table_join_type="source",
                               new_foreign_key_to_table="...",
                               foreign_key_columns=["..."],
                               reload="PARTIAL_ON_TABLE")
datamodel.add_table_from_pool(table_name="DUPLICATE_MATERIAL",
                               alias= "DUPLICATE_MATERIAL",
                               added_table_join_type="source",
                               new_foreign_key_to_table="...",
                               foreign_key_columns=["..."],
                               reload="PARTIAL_ON_TABLE")