Skip to content

Deprecations

Deprecation Warnings

In order to keep PyCelonis up-to-date and guarantee support, some outdated modules are marked deprecated and will be removed from PyCelonis in Version 2.0:
- Data Deduplication: please contact Service Desk to migrate to the official Duplicate Invoice Checker App.

DuplicateChecker

Class to check for data deduplication.

apply(self, df, search_patterns, unique_id_columns, df_reference=None, max_chunk_size=5000, return_preprocessed_df=False, allow_group_intersection=False, fast_mode=False, disable_tqdm=False)

Computes the duplicates on the table df, based on the search patterns specified in the init of the DuplicateChecker Object. If df_reference is given, in addition the duplicates between df and df_reference are computed.

Parameters:

Name Type Description Default
df

pd.DataFrame DataFrame containing the unique_id_columns and the columns that are to be compared against each other to find duplicates.

required
search_patterns Dict

dict, optional if search_patterns already set before. dict containing key value pairs of the form pattern_name: pattern where pattern_name is a string and pattern is a dict where the keys are columns of df and the values are the matching logic to apply. E.g.: search_patterns={ "Some Patter Name":{ "VENDOR_NAME": "exact", "INVOICE_DATE": "exact", "REFERENCE": "different", "VALUE": "exact", "_VENDOR_ID": "exact"},...}

The last used search patterns will always be stored in the DuplicateChecker object under .search_patterns .

required
unique_id_columns

str or List[str] The column or list of columns to be used a unique identifier.

required
df_reference

pd.DataFrame, optional DataFrame of same structure containing the already processed items, all items of df will be checked against each other and against all of df_reference, NOT checked will the items of df_reference against each other, by default None.

None
max_chunk_size

int, optional Size of the chunks compared at a time, decrease if memory problems occur, increase for speed. takes

5000

Returns:

Type Description
pd.DataFrame

DataFrame containing duplicates rows of df + 2 Additional columns: * GROUP_ID : Column which uniquely identifies and maps together those rows that are duplciates of each other. * PATTERN : Name of the seach pattern that maps the items of a group.