Use Case: Material Deduplication¶
This notebook shows you how you can use the Celonis Duplicate Checker to clean your material master table.
In [1]:
Copied!
import pandas as pd
import numpy as np
from pycelonis.data_deduplication.duplicate_checker import DuplicateChecker
from pycelonis import get_celonis
from pycelonis import pql
import pandas as pd
import numpy as np
from pycelonis.data_deduplication.duplicate_checker import DuplicateChecker
from pycelonis import get_celonis
from pycelonis import pql
0. Connect to Celonis
In [6]:
Copied!
celonis = get_celonis()
celonis = get_celonis()
2021-01-21 12:09:18 - pycelonis: Login successful! Hello s.brand@celonis.com 2021-01-21 12:09:18 - pycelonis: Your PyCelonis Version 1.4.1 is outdated (Newest Version: 1.5). Please upgrade the package via: pip install --extra-index-url=https://pypi.celonis.cloud/ pycelonis --upgrade
1. Connect to Data Model
In [7]:
Copied!
datamodel_id = "f62b5b7b-8fc0-4e41-a106-2664af6ab61a" # edit
datamodel = celonis.datamodels.find(datamodel_id) # do not edit this line
datamodel.name # do not edit this line
datamodel_id = "f62b5b7b-8fc0-4e41-a106-2664af6ab61a" # edit
datamodel = celonis.datamodels.find(datamodel_id) # do not edit this line
datamodel.name # do not edit this line
Out[7]:
'Source to Pay'
2. Define Material Number and Material Text fields
In order to deduplicate your material we recommend to use the material text field. In addition you need to provide the field that are the unique identity of each entry of the material table.
In [8]:
Copied!
query = pql.PQL()
query += pql.PQLColumn("MARA_MASTER.MATNR", "MATNR")
query += pql.PQLColumn("MARA_MASTER.MANDT", "MANDT")
query += pql.PQLColumn("MARA_MASTER.MAKTX", "MAKTX")
df = datamodel.get_data_frame(query)
df.head()
query = pql.PQL()
query += pql.PQLColumn("MARA_MASTER.MATNR", "MATNR")
query += pql.PQLColumn("MARA_MASTER.MANDT", "MANDT")
query += pql.PQLColumn("MARA_MASTER.MAKTX", "MAKTX")
df = datamodel.get_data_frame(query)
df.head()
Out[8]:
MATNR | MANDT | MAKTX | |
---|---|---|---|
0 | 000000000080550582 | 100 | Drahtseilklemme 5 mm DIN 1142 verz. |
1 | 000007200510003000 | 111 | Ethernet Switch 5-Port SPIDER 5TX |
2 | 000009170000397500 | 111 | Frischware Dowclene 1621 im Safetainer |
3 | 000000000007370542 | 100 | Nitril-Handschuh Gr.8,5-9* L |
4 | 000007200538950000 | 111 | PCU50 1,2GHZ, WIN XP |
In [9]:
Copied!
unique_id_columns=["MANDT","MATNR"]
material_text_column = "MAKTX"
unique_id_columns=["MANDT","MATNR"]
material_text_column = "MAKTX"
3. Choose Search Pattern
In [11]:
Copied!
search_pattern_exact = {
"exact":{
"MAKTX": "exact"
}
} # Fast
search_pattern_fuzzy = {
"fuzzy":{
"MAKTX": "MaterialFuzzy"
}
} # Slow
search_pattern_exact = {
"exact":{
"MAKTX": "exact"
}
} # Fast
search_pattern_fuzzy = {
"fuzzy":{
"MAKTX": "MaterialFuzzy"
}
} # Slow
In [12]:
Copied!
search_pattern = search_pattern_fuzzy
search_pattern = search_pattern_fuzzy
4. Set Max Chunk Size
Adjust chunk size to the size of you workbench. If you have little RAM, use a small chunksize to prevent memory errors.
In [13]:
Copied!
max_chunk_size=1000 # adjust chunk size to the size of you workbench
max_chunk_size=1000 # adjust chunk size to the size of you workbench
5. Run Duplicate Checker
In [ ]:
Copied!
dc = DuplicateChecker()
res = dc.apply(df, unique_id_columns=["MANDT","MATNR"],search_patterns=search_pattern,fast_mode=True)
# CREATE UNIQUE MATERIAL TEXT
res = res.merge(res.groupby("GROUP_ID")[material_text_column].max().reset_index().rename(columns={material_text_column:"REAL_MATERIAL"}),on="GROUP_ID",how="left")
dc = DuplicateChecker()
res = dc.apply(df, unique_id_columns=["MANDT","MATNR"],search_patterns=search_pattern,fast_mode=True)
# CREATE UNIQUE MATERIAL TEXT
res = res.merge(res.groupby("GROUP_ID")[material_text_column].max().reset_index().rename(columns={material_text_column:"REAL_MATERIAL"}),on="GROUP_ID",how="left")
In [ ]:
Copied!
pattern = res.groupby("GROUP_ID")[material_text_column].nunique()
res["PATTERN"] = np.where(res.GROUP_ID.isin(pattern[pattern > 1].index),"FUZZY_MATCH","EXACT_MATCH")
pattern = res.groupby("GROUP_ID")[material_text_column].nunique()
res["PATTERN"] = np.where(res.GROUP_ID.isin(pattern[pattern > 1].index),"FUZZY_MATCH","EXACT_MATCH")
6. Inspect Results
In [16]:
Copied!
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_rows', 200)
In [ ]:
Copied!
res.head(200)
res.head(200)
7. Push Results to Data Pool
In [114]:
Copied!
datamodel.pool.push_table(res,'DUPLICATE_MATERIAL',reload_datamodel=False,if_exists = 'replace')
datamodel.pool.push_table(res,'DUPLICATE_MATERIAL',reload_datamodel=False,if_exists = 'replace')
2021-01-21 09:16:34 - pycelonis: Execution of Data Job running 2021-01-21 09:16:35 - pycelonis: Execution of Data Job running 2021-01-21 09:16:36 - pycelonis: Execution of Data Job running 2021-01-21 09:16:37 - pycelonis: Execution of Data Job running 2021-01-21 09:16:37 - pycelonis: Execution of Data Job running 2021-01-21 09:16:38 - pycelonis: Execution of Data Job running 2021-01-21 09:16:39 - pycelonis: Execution of Data Job running 2021-01-21 09:16:40 - pycelonis: Execution of Data Job running 2021-01-21 09:16:40 - pycelonis: Execution of Data Job running 2021-01-21 09:16:45 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'tableSchema': None, 'csvParsingOptions': None, 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'mirrorTargetNames': [], 'optionalTenantId': None} 2021-01-21 09:16:47 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'tableSchema': None, 'mirrorTargetNames': [], 'csvParsingOptions': None, 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'optionalTenantId': None} 2021-01-21 09:16:50 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'tableSchema': None, 'mirrorTargetNames': [], 'csvParsingOptions': None, 'optionalTenantId': None} 2021-01-21 09:16:52 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'tableSchema': None, 'csvParsingOptions': None, 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'mirrorTargetNames': [], 'optionalTenantId': None} 2021-01-21 09:16:55 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'mirrorTargetNames': [], 'csvParsingOptions': None, 'tableSchema': None, 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'optionalTenantId': None} 2021-01-21 09:16:58 - pycelonis: Data push job running, status: {'id': '8b850cf2-53c2-4f82-a737-6905be3218eb', 'targetName': 'res_total', 'lastModified': 1611220602084, 'lastPing': None, 'status': 'RUNNING', 'type': 'REPLACE', 'fileType': None, 'targetSchema': 'a9f8217f-90e2-4114-a100-675909dc6879', 'upsertStrategy': 'UPSERT_WITH_UNCHANGED_METADATA', 'fallbackVarcharLength': None, 'dataPoolId': 'a9f8217f-90e2-4114-a100-675909dc6879', 'connectionId': None, 'keys': [], 'tableSchema': None, 'mirrorTargetNames': [], 'csvParsingOptions': None, 'logs': ['2021-01-21T09:16:42 - Starting execution, status set to RUNNING'], 'optionalTenantId': None}
Out[114]:
<DatamodelTable, id 60beaefa-1660-413c-b833-383553e3986d, name res_total>
8. Add Table to Data Model
In [ ]:
Copied!
datamodel.add_table_from_pool(table_name="DUPLICATE_MATERIAL",
alias= "DUPLICATE_MATERIAL",
added_table_join_type="source",
new_foreign_key_to_table="...",
foreign_key_columns=["..."],
reload="PARTIAL_ON_TABLE")
datamodel.add_table_from_pool(table_name="DUPLICATE_MATERIAL",
alias= "DUPLICATE_MATERIAL",
added_table_join_type="source",
new_foreign_key_to_table="...",
foreign_key_columns=["..."],
reload="PARTIAL_ON_TABLE")