Use Case: Duplicate Invoice Checker¶

This notebook can be used to configure the Celonis Duplicate Invoice Checker and deploy it in the Event Collection Pipeline for continuous duplicate invoice checking.

            **MAKE SURE TO EXECUTE THIS NOTEBOOK CELL BY CELL FROM TOP TO BOTTOM **

In [ ]:

            
                Copied!
                
                    
                    
                
                

        
# Module Imports
import json
import os
import numpy as np
import pandas as pd
from pycelonis import get_celonis
from pycelonis.data_deduplication.duplicate_invoice_checker import DuplicateInvoiceChecker
# Module Imports
import json
import os
import numpy as np
import pandas as pd
from pycelonis import get_celonis
from pycelonis.data_deduplication.duplicate_invoice_checker import DuplicateInvoiceChecker

Step 0: Connect to Celonis¶

With celonis = get_celonis() you can connect to your Celonis Team, using the Application Key of this Workbench. By default this might not have the necessary permissions. You can see how to change the permissions under:

[YOUR_CELONIS_TEAM_URL]/help/display/CIBC/Getting+started+with+the+ML+workbench

In [ ]:

            
                Copied!
                
celonis = get_celonis()
celonis = get_celonis()

Step 1: Connect to Knoledge Model and Data Model.¶

Select the Data Model that you want to do duplicate invoice checking on.

In [ ]:

            
                Copied!
                
# Set to False if you are not using knowledgemodel
use_knowledge_model = True
# Set to False if you are not using knowledgemodel
use_knowledge_model = True

In [ ]:

            
                Copied!
                
# USE KNOWLEDGEMODEL
if use_knowledge_model:
    package = celonis.packages.find("PACKAGE_NAME")
    km = package.knowledge_models.find("KNOWLEDGE_MODEL_NAME")
    datamodel_id = km.content.get("dataModelId")
# USE KNOWLEDGEMODEL
if use_knowledge_model:
    package = celonis.packages.find("PACKAGE_NAME")
    km = package.knowledge_models.find("KNOWLEDGE_MODEL_NAME")
    datamodel_id = km.content.get("dataModelId")

In [ ]:

            
                Copied!
                
# ELSE: MANUAL DATAMODEL CONNECTION
if not use_knowledge_model:
    datamodel_id = "..."
# ELSE: MANUAL DATAMODEL CONNECTION
if not use_knowledge_model:
    datamodel_id = "..."

In [ ]:

            
                Copied!
                
# Create Connection to Data Model
datamodel = celonis.datamodels.find(datamodel_id) # do not edit this line
datamodel.name                                    # do not edit this line
# Create Connection to Data Model
datamodel = celonis.datamodels.find(datamodel_id) # do not edit this line
datamodel.name                                    # do not edit this line

Step 2: Create Duplicate Invoice Checker¶

In this step you create the 'dc_invoice' object which contains all the necessary functions for duplicate invoice checking.

In [ ]:

            
                Copied!
                
dc_invoice = DuplicateInvoiceChecker(datamodel=datamodel,
                                      celonis_table_name="DUPLICATE_INVOICES")
dc_invoice = DuplicateInvoiceChecker(datamodel=datamodel,
                                      celonis_table_name="DUPLICATE_INVOICES")

Step 2: Define an invoice and the relevant information.¶

1/4: Please specify the PQL that uniquely defines an invoice.

In SAP Accounts Payable the standard on LINE level is:

invoice_id_record = {
     'id': 'INVOICE_ID', # do not edit this line
     'displayName': 'Invoice ID',
     'pql': '"BSEG"."MANDT" || "BSEG"."BUKRS" || "BSEG"."GJAHR" || "BSEG"."BELNR" || "BSEG"."BUZEI"',
     }

In SAP Accounts Payable the standard on HEADER level is:

invoice_id_record = {
     'id': 'INVOICE_ID', # do not edit this line
     'displayName': 'Invoice ID',
     'pql': '"BKPF"."MANDT" || "BKPF"."BUKRS" || "BKPF"."GJAHR" || "BKPF"."BELNR"',
     }

In [ ]:

            
                Copied!
                
invoice_id_record = {
     'id': '',
     'displayName': '',
     'pql': '',
     }
invoice_id_record = {
     'id': '',
     'displayName': '',
     'pql': '',
     }

2/4: Specify the datamodel columns that you want to use These are the columns that can be used for comparing and finding duplicates, but also simply as addition information. Make sure to include any columns that you want to use to join the results to a table (e.g. the case table) of the datamodel later.

In SAP Accounts Payable the standard on LINE level is:

duplicate_relevant_dimensions = [
        # columns for duplicate checking
        {"id": "VENDOR_NAME", "displayName": "Vendor Name", "pql": '"LFA1"."NAME1"'},
        {"id": "_VENDOR_ID", "displayName": "Vendor ID", "pql": '"LFA1"."LIFNR"'},
        {"id": "VALUE", "displayName": "Value (WRBTR)", "pql": '"BSEG"."WRBTR"'},
        {"id": "REFERENCE", "displayName": "Reference", "pql":  '"BKPF"."XBLNR"'},
        {"id": "INVOICE_DATE", "displayName": "BLDAT", "pql": '"BKPF"."TS_BLDAT"'},
        # columns for joining the results to the case table.
        {'id': 'MANDT', 'displayName': 'MANDT', 'pql': '"BSEG"."MANDT"'},
        {'id': 'BUKRS', 'displayName': 'BUKRS', 'pql': '"BSEG"."BUKRS"'},
        {'id': 'GJAHR', 'displayName': 'GJAHR', 'pql': '"BSEG"."GJAHR"'},
        {'id': 'BELNR', 'displayName': 'BELNR', 'pql': ' "BSEG"."BELNR"'},
        {'id': 'BUZEI', 'displayName': 'BUZEI', 'pql': '"BSEG"."BUZEI"'},
    ]

In SAP Accounts Payable the standard on HEADER level is:

duplicate_relevant_dimensions = [
        # columns for duplicate checking
        {"id": "VENDOR_NAME", "displayName": "Vendor Name", "pql": '"LFA1"."NAME1"'},
        {"id": "_VENDOR_ID", "displayName": "Vendor ID", "pql": '"LFA1"."LIFNR"'},
        {"id": "VALUE", "displayName": "Value (WRBTR)", "pql": 'SUM("BSEG"."WRBTR")'},
        {"id": "REFERENCE", "displayName": "Reference", "pql":  '"BKPF"."XBLNR"'},
        {"id": "INVOICE_DATE", "displayName": "BLDAT", "pql": '"BKPF"."TS_BLDAT"'},
        # columns for joining the results to the case table.
        {'id': 'MANDT', 'displayName': 'MANDT', 'pql': '"BKPF"."MANDT"'},
        {'id': 'BUKRS', 'displayName': 'BUKRS', 'pql': '"BKPF"."BUKRS"'},
        {'id': 'GJAHR', 'displayName': 'GJAHR', 'pql': '"BKPF"."GJAHR"'},
        {'id': 'BELNR', 'displayName': 'BELNR', 'pql': ' "BKPF"."BELNR"'},
    ]

In [ ]:

            
                Copied!
                
duplicate_relevant_dimensions = [
        
    ]
duplicate_relevant_dimensions = [
        
    ]

3/4 (optional): Specify any filter that you want to apply before checking for duplicates These are pql filters that will be applied when extracting the data from the datamodel. They can be for example filters to only use external invoices, use only invoice with a certain value or exclude recurrent invoices etc.

In SAP Accounts Payable on both HEADER AND LINE level the standard is:

duplicate_relevant_filters=[
    """FILTER MATCH_ACTIVITIES(EXCLUDING['Create Credit Memo', 'Reverse Invoice'] ) = 1;""",
    """FILTER PU_COUNT(DOMAIN_TABLE("BKPF"."XBLNR"),
        "BSEG"."BELNR"||"BSEG"."BUKRS"||"BSEG"."GJAHR"||"BSEG"."MANDT") < 10;""",
    """FILTER "BKPF"."XBLNR" NOT LIKE 'Leasing';""",
    """FILTER "BKPF"."XBLNR" NOT LIKE 'Miete';""",
    """FILTER "BKPF"."XBLNR" NOT LIKE 'Rent';""",
    """FILTER LEN("BKPF"."XBLNR") > 4;""",
    """FILTER ISNULL("LFA1"."VBUND") = 1;""",
    ]

In [ ]:

            
                Copied!
                
duplicate_relevant_filters=[
    
    ]
duplicate_relevant_filters=[
    
    ]

4/4 (optional but very important): Specify how to identify reversals One typical challenge in duplicate invoice checking is only to filter out duplicates that were not reversed yet, to filter out all irrelevant duplicates in the data. To do this, you need to specify how to identify invoice reversals in the Celonis Data Model. This is done by a CASE WHEN statement of the following format:

In SAP Accounts Payable the standard on LINE level is:

invoice_reversed_flag_pql= """ 

    CASE WHEN BSEG.SHKZG  = 'S' THEN 1 ELSE 0 END

"""

In SAP Accounts Payable the standard on HEADER level is:


invoice_reversed_flag_pql= """ 

    CASE WHEN PU_FIRST(BKPF,BSEG.SHKZG) = 'S' THEN 1 ELSE 0 END


"""

In [ ]:

            
                Copied!
                
# insert your PQL statement here 
# If you dont have a reversal_flag execute cell with:invoice_reversed_flag_pql = ""
invoice_reversed_flag_pql = ""
# insert your PQL statement here 
# If you dont have a reversal_flag execute cell with:invoice_reversed_flag_pql = ""
invoice_reversed_flag_pql = ""

In [ ]:

            
                Copied!
                
                    
                    
                
                

        
# Summing all data extraction configuration up. 
# do not edit this cell
dc_invoice.data_extraction_config = {"datamodel_id": datamodel_id,
                          "invoice_id_record": invoice_id_record, 
                          "duplicate_relevant_dimensions": duplicate_relevant_dimensions, 
                          "duplicate_relevant_filters": duplicate_relevant_filters, 
                          "invoice_reversed_flag_pql": invoice_reversed_flag_pql,
                            }
# Summing all data extraction configuration up. 
# do not edit this cell
dc_invoice.data_extraction_config = {"datamodel_id": datamodel_id,
                          "invoice_id_record": invoice_id_record, 
                          "duplicate_relevant_dimensions": duplicate_relevant_dimensions, 
                          "duplicate_relevant_filters": duplicate_relevant_filters, 
                          "invoice_reversed_flag_pql": invoice_reversed_flag_pql,
                            }
 

Step 2.1: Adding invoices from another Data Model

(Execute one of the two options below) If your setup is such that you have two Data Models: One with high reload frequency and containing only data for execution apps and in addition one Data Model containing the process history for analytics, then fill and execute the next cell with the data extraction parameters for this datamodel.

----------------------------------------- Option 1: Add second Data Model! -----------------------------------------

In [ ]:

            
                Copied!
                
                    
                    
                
                

        
# ONLY EXECUTE IF 2 Datamodels used 
# EDIT START CURRENT SETTINGS ON ITEM LEVEL: 
analytical_datamodel_id = ""
analytical_invoice_id_record = {
     'id': 'INVOICE_ID',
     'displayName': 'Invoice ID',
     'pql': '...',
     }
analytical_duplicate_relevant_dimensions = [
        {"id": "VENDOR_NAME", "displayName": "Vendor Name", "pql": '"LFA1"."NAME1"'},
        {"id": "_VENDOR_ID", "displayName": "Vendor ID", "pql": '"LFA1"."LIFNR"'},
        {"id": "VALUE", "displayName": "Value (WRBTR)", "pql": '"BSEG"."WRBTR"'},
        {"id": "REFERENCE", "displayName": "Reference", "pql":  '"BKPF"."XBLNR"'},
        {"id": "INVOICE_DATE", "displayName": "BLDAT", "pql": '"BKPF"."TS_BLDAT"'},
    ]
analytical_duplicate_relevant_filters=[
    ]
analytical_invoice_reversed_flag_pql = ""
##### EDIT END.

dc_invoice.analytical_data_extraction_config = {
     "datamodel_id": analytical_datamodel_id,
     "invoice_id_record": analytical_invoice_id_record, 
     "duplicate_relevant_dimensions": analytical_duplicate_relevant_dimensions, 
     "duplicate_relevant_filters": analytical_duplicate_relevant_filters, 
     "invoice_reversed_flag_pql": analytical_invoice_reversed_flag_pql,
      }
# ONLY EXECUTE IF 2 Datamodels used 
# EDIT START CURRENT SETTINGS ON ITEM LEVEL: 
analytical_datamodel_id = ""
analytical_invoice_id_record = {
     'id': 'INVOICE_ID',
     'displayName': 'Invoice ID',
     'pql': '...',
     }
analytical_duplicate_relevant_dimensions = [
        {"id": "VENDOR_NAME", "displayName": "Vendor Name", "pql": '"LFA1"."NAME1"'},
        {"id": "_VENDOR_ID", "displayName": "Vendor ID", "pql": '"LFA1"."LIFNR"'},
        {"id": "VALUE", "displayName": "Value (WRBTR)", "pql": '"BSEG"."WRBTR"'},
        {"id": "REFERENCE", "displayName": "Reference", "pql":  '"BKPF"."XBLNR"'},
        {"id": "INVOICE_DATE", "displayName": "BLDAT", "pql": '"BKPF"."TS_BLDAT"'},
    ]
analytical_duplicate_relevant_filters=[
    ]
analytical_invoice_reversed_flag_pql = ""
##### EDIT END.

dc_invoice.analytical_data_extraction_config = {
     "datamodel_id": analytical_datamodel_id,
     "invoice_id_record": analytical_invoice_id_record, 
     "duplicate_relevant_dimensions": analytical_duplicate_relevant_dimensions, 
     "duplicate_relevant_filters": analytical_duplicate_relevant_filters, 
     "invoice_reversed_flag_pql": analytical_invoice_reversed_flag_pql,
      }

----------------------------------------- Option 2: NO second Data Model! -----------------------------------------

In [ ]:

            
                Copied!
                
dc_invoice.analytical_data_extraction_config = None
dc_invoice.analytical_data_extraction_config = None

Step 3: Load data from the datamodel¶

By executing the next cell you extract the data from the Data Model.

In [ ]:

            
                Copied!
                
df, df_reference = dc_invoice.load_data()
df.head()
df, df_reference = dc_invoice.load_data()
df.head()

Step 4: Create Search Patterns¶

If you have successfully loaded the data, you can now create search patterns which to apply in order to find duplicate invoices. If you are not sure how to do this, take a look at the "00_Tutorial_Duplicate_Checking.ipynb" Notebook provided in the same folder.

In SAP Accounts Payable the standard is:

search_patterns = {   
    'Vendor_fuzzy_match': {
        'VENDOR_NAME': 'CompanyNameFuzzy',
        'INVOICE_DATE': 'exact',
        'REFERENCE': 'exact',
        'VALUE': 'exact'},
    'INVOICE_DATE_fuzzy_match': {
        'VENDOR_NAME': 'exact',
        'INVOICE_DATE': 'DateFuzzy',
        'REFERENCE': 'exact',
        'VALUE': 'exact'},
    'INVOICE_Value_fuzzy_match': {
        'VENDOR_NAME': 'exact',
        'INVOICE_DATE': 'exact',
        'REFERENCE': 'exact',
        'VALUE': 'InvoiceValueFuzzy'},
    'REFERENCE_fuzzy_match': {
        'VENDOR_NAME': 'exact',
        'INVOICE_DATE': 'exact',
        'REFERENCE': 'InvoiceReferenceFuzzy',
        'VALUE': 'exact'},
    'Exact': {
        'VENDOR_NAME': 'exact',
        'INVOICE_DATE': 'exact',
        'REFERENCE': 'exact',
        'VALUE': 'exact'}
        }

In [ ]:

            
                Copied!
                
                    
                    
                
                

        
# edit here
dc_invoice.search_patterns = {   
    'Vendor_fuzzy_match': {
        'VENDOR_NAME': 'CompanyNameFuzzy',
        'INVOICE_DATE': 'exact',
        'REFERENCE': 'exact',
        'VALUE': 'exact'},
    'INVOICE_DATE_fuzzy_match': {
        'VENDOR_NAME': 'exact',
        'INVOICE_DATE': 'DateFuzzy',
        'REFERENCE': 'exact',
        'VALUE': 'exact'},
    'INVOICE_Value_fuzzy_match': {
        'VENDOR_NAME': 'exact',
        'INVOICE_DATE': 'exact',
        'REFERENCE': 'exact',
        'VALUE': 'InvoiceValueFuzzy'},
    'REFERENCE_fuzzy_match': {
        'VENDOR_NAME': 'exact',
        'INVOICE_DATE': 'exact',
        'REFERENCE': 'InvoiceReferenceFuzzy',
        'VALUE': 'exact'},
    'Exact': {
        'VENDOR_NAME': 'exact',
        'INVOICE_DATE': 'exact',
        'REFERENCE': 'exact',
        'VALUE': 'exact'}
        }
# edit here
dc_invoice.search_patterns = {   
    'Vendor_fuzzy_match': {
        'VENDOR_NAME': 'CompanyNameFuzzy',
        'INVOICE_DATE': 'exact',
        'REFERENCE': 'exact',
        'VALUE': 'exact'},
    'INVOICE_DATE_fuzzy_match': {
        'VENDOR_NAME': 'exact',
        'INVOICE_DATE': 'DateFuzzy',
        'REFERENCE': 'exact',
        'VALUE': 'exact'},
    'INVOICE_Value_fuzzy_match': {
        'VENDOR_NAME': 'exact',
        'INVOICE_DATE': 'exact',
        'REFERENCE': 'exact',
        'VALUE': 'InvoiceValueFuzzy'},
    'REFERENCE_fuzzy_match': {
        'VENDOR_NAME': 'exact',
        'INVOICE_DATE': 'exact',
        'REFERENCE': 'InvoiceReferenceFuzzy',
        'VALUE': 'exact'},
    'Exact': {
        'VENDOR_NAME': 'exact',
        'INVOICE_DATE': 'exact',
        'REFERENCE': 'exact',
        'VALUE': 'exact'}
        }

Step 5: Computing the duplicate invoices¶

By executing the next cell you start the actual duplicate checking computations.

In [ ]:

            
                Copied!
                
results = dc_invoice.apply(df=df,
                            search_patterns=dc_invoice.search_patterns,
                            df_reference=df_reference)
results
results = dc_invoice.apply(df=df,
                            search_patterns=dc_invoice.search_patterns,
                            df_reference=df_reference)
results

In [ ]:

            
                Copied!
                
results.PATTERN.value_counts()
results.PATTERN.value_counts()

Step 6: Add a mactches column for executional app representation (optional)¶

For a better representation in the execution app you can add columns that for each invoice contain only the respective field values of the matched invoice, separated by commas.

In [ ]:

            
                Copied!
                
dc_invoice.matching_display_columns = ["INVOICE_ID"]
dc_invoice.matching_display_columns = ["INVOICE_ID"]

In [ ]:

            
                Copied!
                
results = dc_invoice.add_matching_display_columns(results, dc_invoice.matching_display_columns)
results
results = dc_invoice.add_matching_display_columns(results, dc_invoice.matching_display_columns)
results

Final Step 8: Deploy the duplicate invoice checking into the pipeline¶

Step 1 / 3: Add the duplicate checker results to the datamodel.

In [ ]:

            
                Copied!
                
                    
                    
                
                

        
# Add table to Pool
table = datamodel.pool.create_table(results,
                                    dc_invoice.celonis_table_name,
                                    column_config=[{"columnName":"GROUP_ID",
                                                    "fieldLength":int(results.GROUP_ID.str.len().max()*1.5),
                                                    "columnType":"STRING"}])
# Add table to Pool
table = datamodel.pool.create_table(results,
                                    dc_invoice.celonis_table_name,
                                    column_config=[{"columnName":"GROUP_ID",
                                                    "fieldLength":int(results.GROUP_ID.str.len().max()*1.5),
                                                    "columnType":"STRING"}])

Specify the table you want to join the results to

In SAP Accounts Payable the standard on LINE level is:

dc_invoice.join_dm_target_table = "BSEG"

In SAP Accounts Payable the standard on HEADER level is:

dc_invoice.join_dm_target_table = "BKPF"

In [ ]:

            
                Copied!
                
#table to join results to
dc_invoice.join_dm_target_table = ".."
#table to join results to
dc_invoice.join_dm_target_table = ".."

Specify the foreign key colums you want to join the results to the dm table with

In SAP Accounts Payable the standard on LINE level is:

dc_invoice.foreign_keys = [('MANDT','MANDT'),
                           ('BUKRS','BUKRS'),
                           ('GJAHR','GJAHR'),
                           ('BELNR','BELNR'),
                           ('BUZEI','BUZEI')]

In SAP Accounts Payable the standard on HEADER level is:

dc_invoice.foreign_keys = [('MANDT','MANDT'),
                           ('BUKRS','BUKRS'),
                           ('GJAHR','GJAHR'),
                           ('BELNR','BELNR')]

In [ ]:

            
                Copied!
                
# columns to join through
dc_invoice.foreign_keys = []
# columns to join through
dc_invoice.foreign_keys = []

Add table to datamodel

In [ ]:

            
                Copied!
                
# Add table to datamodel
dm_table = datamodel.add_table_from_pool(table_name=dc_invoice.celonis_table_name,
                                         alias=dc_invoice.celonis_table_name)
# Add table to datamodel
dm_table = datamodel.add_table_from_pool(table_name=dc_invoice.celonis_table_name,
                                         alias=dc_invoice.celonis_table_name)

In [ ]:

            
                Copied!
                
# Create Foreign Key
keys = datamodel.create_foreign_key(target_table=dc_invoice.join_dm_target_table,
                                    source_table=dc_invoice.celonis_table_name,
                                    columns=dc_invoice.foreign_keys)
# Create Foreign Key
keys = datamodel.create_foreign_key(target_table=dc_invoice.join_dm_target_table,
                                    source_table=dc_invoice.celonis_table_name,
                                    columns=dc_invoice.foreign_keys)

In [ ]:

            
                Copied!
                
# Partial Reload of datamodel
datamodel.reload(tables=dc_invoice.celonis_table_name)
# Partial Reload of datamodel
datamodel.reload(tables=dc_invoice.celonis_table_name)

Step 2 / 3: Create a customObject with the duplicate checker configurations

In [ ]:

            
                Copied!
                
config_object = dc_invoice.save_settings_to_config()
config_object = dc_invoice.save_settings_to_config()

In [ ]:

            
                Copied!
                
config_object
config_object

Step 3 / 3 IF USING KNOWLEDGE MODEL: Add the config as custom object to the knowledge model and publish it

The duplicate checker will be executed everytime the datamodel reload or the package is published

In [ ]:

            
                Copied!
                
if use_knowledge_model:
    km.add_custom_object(**config_object)
if use_knowledge_model:
    km.add_custom_object(**config_object)

Publish package containing Knowledge Model

In [ ]:

            
                Copied!
                
if use_knowledge_model:
    package.publish()
if use_knowledge_model:
    package.publish()

Step 3 / 3 IF MANUAL SCHEDULING:

Go to MY_TEAM_URL/machine-learning/ui/schedule and create a "NEW SCHEDULE" Pointing at the notebook Schedule_Duplicate_Invoice_Checking.ipynb in the SAME Folder as this script.

Monitoring

For monitoring if everything is working you can go to the /logs folder in the home directory of you workbench. There every execution of the duplicate invoice checker is logged.

IF MANUAL SCHEDULING: Also you can track the triggered executions here : MY_TEAM_URL/machine-learning/ui/schedule

IF USE KNOWLEDGE MODEL: Also you can track the triggered executions here : MY_TEAM_URL/machine-learning/ui/triggered

In [ ]: