Skip to content

Deprecations

Deprecation Warnings

In order to keep PyCelonis up-to-date and guarantee support, some outdated modules are marked deprecated and will be removed from PyCelonis in Version 2.0:
- Data Deduplication: please contact Service Desk to migrate to the official Duplicate Invoice Checker App.
- Root Cause Analysis: will be moved into a new package called PyCelonis Apps (further details will follow)

DuplicateChecker

Class to check for data deduplication.

apply(self, df, search_patterns, unique_id_columns, df_reference=None, max_chunk_size=5000, return_preprocessed_df=False, allow_group_intersection=False, fast_mode=False, disable_tqdm=False)

Computes the duplicates on the table df, based on the search patterns specified in the init of the DuplicateChecker Object. If df_reference is given, in addition the duplicates between df and df_reference are computed.

Parameters:

Name Type Description Default
df

pd.DataFrame DataFrame containing the unique_id_columns and the columns that are to be compared against each other to find duplicates.

required
search_patterns Dict

dict, optional if search_patterns already set before. dict containing key value pairs of the form pattern_name: pattern where pattern_name is a string and pattern is a dict where the keys are columns of df and the values are the matching logic to apply. E.g.: search_patterns={ "Some Patter Name":{ "VENDOR_NAME": "exact", "INVOICE_DATE": "exact", "REFERENCE": "different", "VALUE": "exact", "_VENDOR_ID": "exact"},...}

The last used search patterns will always be stored in the DuplicateChecker object under .search_patterns .

required
unique_id_columns

str or List[str] The column or list of columns to be used a unique identifier.

required
df_reference

pd.DataFrame, optional DataFrame of same structure containing the already processed items, all items of df will be checked against each other and against all of df_reference, NOT checked will the items of df_reference against each other, by default None.

None
max_chunk_size

int, optional Size of the chunks compared at a time, decrease if memory problems occur, increase for speed. takes

5000

Returns:

Type Description
pd.DataFrame

DataFrame containing duplicates rows of df + 2 Additional columns: * GROUP_ID : Column which uniquely identifies and maps together those rows that are duplciates of each other. * PATTERN : Name of the seach pattern that maps the items of a group.

RCA

This automated root cause analysis function searches all dimensions (First Level root causes) of the selected tables to find single dimensions that have bad performance with respect to the different KPIs . E.g. it would find if you have a certain vendor, plan, country, city , customer etc where the kpi is particulary low.

Parameters:

Name Type Description Default
kpis dict

Dictionary of kpis. Key should be name of kpi and value the pql query. Important: The pql query needs to be of the following format: CASE WHEN "WANTED_BEHAVIOUR" THEN 0 WHEN "UNWANTED_BEHAVIOUR" THEN 1 ELSE NULL END

{}
celonis_filter str/list

Filter string or list of filter strings used if you want to limit the search to e.g. a specific plant.

None
datamodel Datamodel

datamodel to query, obtained e.g. via dm = celonis.datamodels.find('ID_OF_MY_DM')

None
selected_tables list

list of names (strings) of the datamodel tables you want to include in the search

[]
chunk_size int

size of chunks to be extracted. Bigger chunks might be faster, but can CRASH the Datamodel, better leave at 20.

20

Returns:

Type Description
pd.DataFrame

dataframe containing the results of the search for all the kpis.

Back to top