Deprecations
Deprecation Warnings
In order to keep PyCelonis up-to-date and guarantee support, some outdated modules are marked 
deprecated and will be removed from PyCelonis in Version 2.0:
    - Data Deduplication: please contact Service Desk to migrate to the official Duplicate Invoice Checker App.   
        
DuplicateChecker        
¶
    Class to check for data deduplication.
apply(self, df, search_patterns, unique_id_columns, df_reference=None, max_chunk_size=5000, return_preprocessed_df=False, allow_group_intersection=False, fast_mode=False, disable_tqdm=False)
¶
    Computes the duplicates on the table df, based on the search patterns specified in the init of the DuplicateChecker Object. If df_reference is given, in addition the duplicates between df and df_reference are computed.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| df | pd.DataFrame DataFrame containing the unique_id_columns and the columns that are to be compared against each other to find duplicates. | required | |
| search_patterns | Dict | dict, optional if search_patterns already set before. dict containing key value pairs of the form pattern_name: pattern where pattern_name is a string and pattern is a dict where the keys are columns of df and the values are the matching logic to apply. E.g.: search_patterns={ "Some Patter Name":{ "VENDOR_NAME": "exact", "INVOICE_DATE": "exact", "REFERENCE": "different", "VALUE": "exact", "_VENDOR_ID": "exact"},...} The last used search patterns will always be stored in the DuplicateChecker object under .search_patterns . | required | 
| unique_id_columns | str or List[str] The column or list of columns to be used a unique identifier. | required | |
| df_reference | pd.DataFrame, optional DataFrame of same structure containing the already processed items, all items of df will be checked against each other and against all of df_reference, NOT checked will the items of df_reference against each other, by default None. | None | |
| max_chunk_size | int, optional Size of the chunks compared at a time, decrease if memory problems occur, increase for speed. takes | 5000 | 
Returns:
| Type | Description | 
|---|---|
| pd.DataFrame | DataFrame containing duplicates rows of df + 2 Additional columns: * GROUP_ID : Column which uniquely identifies and maps together those rows that are duplciates of each other. * PATTERN : Name of the seach pattern that maps the items of a group. |