Deprecations

Deprecation Warnings

In order to keep PyCelonis up-to-date and guarantee support, some outdated modules are marked deprecated and will be removed from PyCelonis in Version 2.0:
- Data Deduplication: please contact Service Desk to migrate to the official Duplicate Invoice Checker App.

`DuplicateChecker` ¶

Class to check for data deduplication.

`apply(self, df, search_patterns, unique_id_columns, df_reference=None, max_chunk_size=5000, return_preprocessed_df=False, allow_group_intersection=False, fast_mode=False, disable_tqdm=False)` ¶

Computes the duplicates on the table df, based on the search patterns specified in the init of the DuplicateChecker Object. If df_reference is given, in addition the duplicates between df and df_reference are computed.

Parameters:

Name	Type	Description	Default
`df`		pd.DataFrame DataFrame containing the unique_id_columns and the columns that are to be compared against each other to find duplicates.	required
`search_patterns`	`Dict`	dict, optional if search_patterns already set before. dict containing key value pairs of the form pattern_name: pattern where pattern_name is a string and pattern is a dict where the keys are columns of df and the values are the matching logic to apply. E.g.: search_patterns={ "Some Patter Name":{ "VENDOR_NAME": "exact", "INVOICE_DATE": "exact", "REFERENCE": "different", "VALUE": "exact", "_VENDOR_ID": "exact"},...} The last used search patterns will always be stored in the DuplicateChecker object under .search_patterns .	required
`unique_id_columns`		str or List[str] The column or list of columns to be used a unique identifier.	required
`df_reference`		pd.DataFrame, optional DataFrame of same structure containing the already processed items, all items of df will be checked against each other and against all of df_reference, NOT checked will the items of df_reference against each other, by default None.	`None`
`max_chunk_size`		int, optional Size of the chunks compared at a time, decrease if memory problems occur, increase for speed. takes	`5000`

Returns:

Type	Description
`pd.DataFrame`	DataFrame containing duplicates rows of df + 2 Additional columns: * GROUP_ID : Column which uniquely identifies and maps together those rows that are duplciates of each other. * PATTERN : Name of the seach pattern that maps the items of a group.

Deprecations

DuplicateChecker ¶

apply(self, df, search_patterns, unique_id_columns, df_reference=None, max_chunk_size=5000, return_preprocessed_df=False, allow_group_intersection=False, fast_mode=False, disable_tqdm=False) ¶

`DuplicateChecker` ¶

`apply(self, df, search_patterns, unique_id_columns, df_reference=None, max_chunk_size=5000, return_preprocessed_df=False, allow_group_intersection=False, fast_mode=False, disable_tqdm=False)` ¶