Filtering¶
In this tutorial, you will learn:
- How to filter your data to reduce memory usage
- How to apply advanced filter conditions
- How to chain filters to include multiple conditions
Prerequisites:¶
To follow this tutorial, you need to have a Celonis EMS team set up and be familiar with the basic functionalities of PyCelonis.
Tutorial¶
1. Connect to your data model¶
To get started with SaolaPy, we first need to connect to our data model:
from pycelonis import get_celonis
import pycelonis.pql as pql
import pandas as pd
c = get_celonis()
data_pool = c.data_integration.get_data_pools().find("<YOUR_DATA_POOL>")
data_model = data_pool.get_data_models().find("<YOUR_DATA_MODEL>")
To avoid always having to specify the data model we set it as our global default data model:
from pycelonis.config import Config
Config.DEFAULT_DATA_MODEL = data_model
For convenience, we extract the columns of tables we want to use:
ekpo_columns = data_model.get_tables().find("EKPO").get_columns()
2. Filtering Basics¶
Let's first create two series to work with that contain price data and country codes:
net_price_series = pql.Series(ekpo_columns.find("NETPR"))
country_code_series = pql.Series(ekpo_columns.find("BUKRS"))
If we now want to get the mean price for a specific country code we can just filter by this country code and then compute the mean:
net_price_series[country_code_series == "GB03"].mean()
780.3879177267938
It is also possible to apply multiple filters. We can for example in addition to the country code filter exclude outliers:
NOTE: When combining filters, always put the single conditions in brackets!
net_price_series[(country_code_series == "GB03") & (net_price_series < 10000)].mean()
204.51182691653986
To filter for multiple country codes we can combine them with the OR
operator:
net_price_series[(country_code_series == "GB03") | (country_code_series == "CN01")].mean()
5193.763160074998
Lastly it is also possible to negate filters using the inversion operator:
net_price_series[~(country_code_series == "GB03")].mean()
6812.319775473179
Other conditional operators that can be used for filtering are:
# Lower than
net_price_series[country_code_series < "GB03"].mean()
# Lower equal
net_price_series[country_code_series <= "GB03"].mean()
# Equal
net_price_series[country_code_series == "GB03"].mean()
# Not equal
net_price_series[country_code_series != "GB03"].mean()
# Greater equal
net_price_series[country_code_series >= "GB03"].mean()
# Greater than
net_price_series[country_code_series > "GB03"].mean()
4498.652512344358
3. Chaining Filters¶
SaolaPy also allows to chain filters by applying them in multiple steps. In the following we filter for one country code and exclude outliers:
filtered_net_price_series = net_price_series[country_code_series == "GB03"]
filtered_net_price_series = filtered_net_price_series[filtered_net_price_series < 10000]
filtered_net_price_series.mean()
204.51182691653986
NOTE: It is only possible to combine series that have the same shape. Therefore, if a series has different filters it won't be possible to e.g. add it to a given series:
filtered_net_price_series = net_price_series[country_code_series == "GB03"] + net_price_series
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In [11], line 1 ----> 1 filtered_net_price_series = net_price_series[country_code_series == "GB03"] + net_price_series File ~/Code/saolapy/saolapy/pandas/series.py:328, in Series.__add__(self, other) 327 def __add__(self, other: Union["Series", NumericValue]) -> "Series": --> 328 return self.apply_binary_operator_dunder(other, AddOperator) File ~/Code/saolapy/saolapy/pandas/series.py:970, in Series.apply_binary_operator_dunder(self, other, operator, reverse) 968 """Combines series with the other for dunder methods.""" 969 try: --> 970 return self.apply_binary_operator(other, operator, reverse=reverse) 971 except TypeError: 972 return NotImplemented File ~/Code/saolapy/saolapy/pandas/series.py:957, in Series.apply_binary_operator(self, other, operator, reverse, **kwargs) 954 raise TypeError(f"Data type {type(other)} not supported.") 956 if isinstance(other, Series): --> 957 verify_compatibility(self.index, other.index, self.filters, other.filters) 959 other_data = other.data if isinstance(other, Series) else other 961 if reverse: File ~/Code/saolapy/saolapy/pandas/util.py:14, in verify_compatibility(index1, index2, filters1, filters2) 11 raise ValueError(f"Indices do not match: {index1} != {index2}.") 13 if filters1 != filters2: ---> 14 raise ValueError(f"Filters do not match: {filters1} != {filters2}.") ValueError: Filters do not match: [PQLFilter(query='FILTER ( "EKPO"."BUKRS" = \'GB03\' );')] != [].
To remove filters from a series you can simply call the reset_filters
method:
unfiltered_series = filtered_net_price_series.reset_filters()
unfiltered_series
Series(data='"EKPO"."NETPR"', index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[], name='NETPR')
4. Advanced Filter Conditions¶
SaolaPy comes with a range of advanced filter conditions to use.
First, one can filter for values that occur in a list:
country_code_series[country_code_series.isin(["GB03", "GB99", "HU01"])]
Series(data='"EKPO"."BUKRS"', index=RangeIndex(name='Index', start=0, step=1), filters=['FILTER ( "EKPO"."BUKRS" IN ( \'GB03\', \'GB99\', \'H...'], order_by_columns=[], name='BUKRS')
It is also possible to filter for a specific range of values:
net_price_series[net_price_series.between(0, 100)]
Series(data='"EKPO"."NETPR"', index=RangeIndex(name='Index', start=0, step=1), filters=['FILTER ( "EKPO"."NETPR" BETWEEN 0 AND 100 );'], order_by_columns=[], name='NETPR')
Lastly, it's possible to filter for null values:
net_price_series[net_price_series.isnull()]
Series(data='"EKPO"."NETPR"', index=RangeIndex(name='Index', start=0, step=1), filters=['FILTER ( "EKPO"."NETPR" IS NULL );'], order_by_columns=[], name='NETPR')
5. Filtering DataFrames¶
DataFrames can be filtered in a similar way to series:
df = pql.DataFrame({
"MENGE": ekpo_columns.find("MENGE"),
"NETPR": ekpo_columns.find("NETPR"),
})
filtered_df = df[net_price_series < 10000]
filtered_df
DataFrame(data={'MENGE': '"EKPO"."MENGE"', 'NETPR': '"EKPO"."NETPR"'}, index=RangeIndex(name='Index', start=0, step=1), filters=['FILTER ( "EKPO"."NETPR" < 10000 );'], order_by_columns=[])
If the filter is also a DataFrame, the condition will be applied to all columns. In the following example we filter for rows with MENGE
and NETPR
greater than 2:
filtered_df = df[df > 2]
filtered_df
DataFrame(data={'MENGE': '"EKPO"."MENGE"', 'NETPR': '"EKPO"."NETPR"'}, index=RangeIndex(name='Index', start=0, step=1), filters=['FILTER ( "EKPO"."MENGE" > 2 );', 'FILTER ( "EKPO"."NETPR" > 2 );'], order_by_columns=[])
Lastly, it's also possible to use pandas Series to apply different thresholds. For example to filter for MENGE<3
and NETPR<10000
:
filtered_df = df[df < pd.Series([3, 10000], index=["MENGE", "NETPR"])]
filtered_df
DataFrame(data={'MENGE': '"EKPO"."MENGE"', 'NETPR': '"EKPO"."NETPR"'}, index=RangeIndex(name='Index', start=0, step=1), filters=['FILTER ( "EKPO"."MENGE" < 3 );', 'FILTER ( "EKPO"."NETPR" < 10000 );'], order_by_columns=[])
This can be especially useful when for example filtering for data smaller than the median:
df[df < df.median()]
DataFrame(data={'MENGE': '"EKPO"."MENGE"', 'NETPR': '"EKPO"."NETPR"'}, index=RangeIndex(name='Index', start=0, step=1), filters=['FILTER ( "EKPO"."MENGE" < 4.0 );', 'FILTER ( "EKPO"."NETPR" < 63.54 );'], order_by_columns=[])
Conclusion¶
Congratulations! You have learned how to filter SaolaPy Series and DataFrames.