Deprecations
Deprecation Warnings
In order to keep PyCelonis up-to-date and guarantee support, some outdated modules are marked
deprecated and will be removed from PyCelonis in Version 2.0:
- Data Deduplication
: please contact Service Desk to migrate to the official Duplicate Invoice Checker App.
DuplicateChecker
¶
Class to check for data deduplication.
apply(self, df, search_patterns, unique_id_columns, df_reference=None, max_chunk_size=5000, return_preprocessed_df=False, allow_group_intersection=False, fast_mode=False, disable_tqdm=False)
¶
Computes the duplicates on the table df, based on the search patterns specified in the init of the DuplicateChecker Object. If df_reference is given, in addition the duplicates between df and df_reference are computed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
pd.DataFrame DataFrame containing the unique_id_columns and the columns that are to be compared against each other to find duplicates. |
required | |
search_patterns |
Dict |
dict, optional if search_patterns already set before. dict containing key value pairs of the form pattern_name: pattern where pattern_name is a string and pattern is a dict where the keys are columns of df and the values are the matching logic to apply. E.g.: search_patterns={ "Some Patter Name":{ "VENDOR_NAME": "exact", "INVOICE_DATE": "exact", "REFERENCE": "different", "VALUE": "exact", "_VENDOR_ID": "exact"},...} The last used search patterns will always be stored in the DuplicateChecker object under .search_patterns . |
required |
unique_id_columns |
str or List[str] The column or list of columns to be used a unique identifier. |
required | |
df_reference |
pd.DataFrame, optional DataFrame of same structure containing the already processed items, all items of df will be checked against each other and against all of df_reference, NOT checked will the items of df_reference against each other, by default None. |
None |
|
max_chunk_size |
int, optional Size of the chunks compared at a time, decrease if memory problems occur, increase for speed. takes |
5000 |
Returns:
Type | Description |
---|---|
pd.DataFrame |
DataFrame containing duplicates rows of df + 2 Additional columns: * GROUP_ID : Column which uniquely identifies and maps together those rows that are duplciates of each other. * PATTERN : Name of the seach pattern that maps the items of a group. |