Deprecations
Deprecation Warnings
In order to keep PyCelonis up-to-date and guarantee support, some outdated modules are marked
deprecated and will be removed from PyCelonis in Version 2.0:
- Data Deduplication
: please contact Service Desk to migrate to the official Duplicate Invoice Checker App.
- Root Cause Analysis
: will be moved into a new package called PyCelonis Apps (further details will follow)
DuplicateChecker
¶
Class to check for data deduplication.
apply(self, df, search_patterns, unique_id_columns, df_reference=None, max_chunk_size=5000, return_preprocessed_df=False, allow_group_intersection=False, fast_mode=False, disable_tqdm=False)
¶
Computes the duplicates on the table df, based on the search patterns specified in the init of the DuplicateChecker Object. If df_reference is given, in addition the duplicates between df and df_reference are computed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
pd.DataFrame DataFrame containing the unique_id_columns and the columns that are to be compared against each other to find duplicates. |
required | |
search_patterns |
Dict |
dict, optional if search_patterns already set before. dict containing key value pairs of the form pattern_name: pattern where pattern_name is a string and pattern is a dict where the keys are columns of df and the values are the matching logic to apply. E.g.: search_patterns={ "Some Patter Name":{ "VENDOR_NAME": "exact", "INVOICE_DATE": "exact", "REFERENCE": "different", "VALUE": "exact", "_VENDOR_ID": "exact"},...} The last used search patterns will always be stored in the DuplicateChecker object under .search_patterns . |
required |
unique_id_columns |
str or List[str] The column or list of columns to be used a unique identifier. |
required | |
df_reference |
pd.DataFrame, optional DataFrame of same structure containing the already processed items, all items of df will be checked against each other and against all of df_reference, NOT checked will the items of df_reference against each other, by default None. |
None |
|
max_chunk_size |
int, optional Size of the chunks compared at a time, decrease if memory problems occur, increase for speed. takes |
5000 |
Returns:
Type | Description |
---|---|
pd.DataFrame |
DataFrame containing duplicates rows of df + 2 Additional columns: * GROUP_ID : Column which uniquely identifies and maps together those rows that are duplciates of each other. * PATTERN : Name of the seach pattern that maps the items of a group. |
RCA
¶
This automated root cause analysis function searches all dimensions (First Level root causes) of the selected tables to find single dimensions that have bad performance with respect to the different KPIs . E.g. it would find if you have a certain vendor, plan, country, city , customer etc where the kpi is particulary low.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
kpis |
dict |
Dictionary of kpis. Key should be name of kpi and value the pql query. Important: The pql query needs to be of the following format: CASE WHEN "WANTED_BEHAVIOUR" THEN 0 WHEN "UNWANTED_BEHAVIOUR" THEN 1 ELSE NULL END |
{} |
celonis_filter |
str/list |
Filter string or list of filter strings used if you want to limit the search to e.g. a specific plant. |
None |
datamodel |
Datamodel |
datamodel to query, obtained e.g. via dm = celonis.datamodels.find('ID_OF_MY_DM') |
None |
selected_tables |
list |
list of names (strings) of the datamodel tables you want to include in the search |
[] |
chunk_size |
int |
size of chunks to be extracted. Bigger chunks might be faster, but can CRASH the Datamodel, better leave at 20. |
20 |
Returns:
Type | Description |
---|---|
pd.DataFrame |
dataframe containing the results of the search for all the kpis. |