Quickstart¶
SaolaPy is a DataFrame library utilizing the power of PQL to transform and aggregate data from your data models. It was created to simplify your workflow, offering a Pandas-like interface for direct interactions. Our goal with SaolaPy is to make your work easier and more efficient. Here's why we believe SaolaPy will be helpful for you:
Familiarity: SaolaPy offers the ease and accessibility of Pandas, a tool that many data scientists and machine learning engineers already use. No need to learn PQL; jump right in with Python and get to work.
Optimized Performance: With SaolaPy, your computations run directly in SaolaDB, where they benefit from caching and query optimization. This ensures minimal memory usage and optimized performance. It eliminates the need to export full tables for transformations which often lead to a high memory consumption and in the worst case out of memory errors.
Error Prevention: SaolaPy offers a user-friendly environment where you can write and test your code. With Python's direct feedback and code completion features, you don't need to write PQL queries in the ML Workbench by hand anymore.
In this tutorial, you will learn:
- How to create SaolaPy Series and DataFrames
- Perform basic arithmetic, aggregation, and string operations to transform your data
Prerequisites:¶
To follow this tutorial, you need to have a Celonis EMS team set up and be familiar with the basic functionalities of PyCelonis.
Tutorial¶
1. Connect to your data model¶
To get started with SaolaPy, you first need to connect to our data model:
from pycelonis import get_celonis
import pycelonis.pql as pql
c = get_celonis()
data_pool = c.data_integration.get_data_pools().find("<YOUR_DATA_POOL>")
data_model = data_pool.get_data_models().find("<YOUR_DATA_MODEL>")
To avoid always having to specify the data model you can set it as the global default data model:
from pycelonis.config import Config
Config.DEFAULT_DATA_MODEL = data_model
For convenience we extract the columns of tables we want to use:
ekpo_columns = data_model.get_tables().find("EKPO").get_columns()
2. SaolaPy Series¶
Now it's time to start working with SaolaPy Series:
2.1. Basics¶
First, we have to initialize a SaolaPy Series for the net price column (we are using the default data model specified above here):
net_price_series = pql.Series(ekpo_columns.find("NETPR"))
net_price_series
Alternatively, we can specify the data model explicitly:
net_price_series = pql.Series(ekpo_columns.find("NETPR"), data_model=data_model)
net_price_series
Series(data='"EKPO"."NETPR"', index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[], name='NETPR')
It is also possible to instantiate a series using a PQL query string:
net_price_series = pql.Series('"EKPO"."NETPR"', name="NETPR")
net_price_series
Series(data='"EKPO"."NETPR"', index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[], name='NETPR')
We can then look at some general information of this series similar to how we would using pandas. First, we look at the shape of the data:
net_price_series.shape
(325493,)
Afterwards, we export the first few rows of the series:
net_price_series.head()
Index 0 165.60 1 111.01 2 149.50 3 372.60 4 265.80 Name: NETPR, dtype: float64
2.2. Arithmetic Operations¶
Similarly to how we do it in pandas we can perform basic arithmetic operations:
net_price_series = net_price_series + 5 * 10
net_price_series
Series(data='( "EKPO"."NETPR" + 50 )', index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[], name='NETPR')
It's important to note here that the data stays in SaolaDB and is not exported. Only once we call to_pandas
the computation is executed and the data exported to the ML Workbench:
net_price_series.to_pandas()
Index 0 215.60 1 161.01 2 199.50 3 422.60 4 315.80 ... 325488 86.94 325489 1167.50 325490 164.00 325491 269.51 325492 545.81 Name: NETPR, Length: 325493, dtype: float64
It is also possible to aggregate the given series:
print("Min price:", net_price_series.min())
print("Mean price:", net_price_series.mean())
print("Max price:", net_price_series.max())
Min price: 50.0 Mean price: 6039.272307177108 Max price: 76417050.0
Calling these aggregation functions, always applies a PQL operator. To find more about which operator is applied you can take a look at the docstrings which contain a link to the respective PQL function:
net_price_series.max?
Signature: net_price_series.max() -> Union[str, int, float, datetime.datetime] Docstring: Return the max of the values. Applies [MAX operator](https://docs.celonis.com/en/max.html) to column. Returns: Max of series values. File: ~/Code/saolapy/saolapy/pandas/series.py Type: method
2.3. Categorical Data¶
For categorical data we can look at the value counts, as is done here for company codes:
bukrs_series = pql.Series(ekpo_columns.find("BUKRS"))
bukrs_series.value_counts()
BUKRS 5100 90984 SG01 69590 CN01 57458 GB03 44413 4575 28862 MY01 25939 HU01 7537 FR01 394 GB99 219 AT01 96 SE01 1 Name: COUNT_BUKRS, dtype: int64
Furthermore, it's possible to perform basic string operations:
bukrs_series = bukrs_series.str.replace("GB", "UK")
bukrs_series.head()
Index 0 UK03 1 UK03 2 5100 3 5100 4 5100 Name: BUKRS, dtype: object
2.4 Filtering Data¶
Often it is necessary to filter data. For this we can use the same syntax as pandas to filter for e.g. all purchasing documents from the UK:
filtered_bukrs_series = bukrs_series[bukrs_series.str.startswith("UK")]
filtered_bukrs_series.head()
Index 0 UK03 1 UK03 2 UK03 3 UK03 4 UK03 Name: BUKRS, dtype: object
Filters can also be combined and applied to different series. E.g. to get the mean net price of all UK03
purchasing documents with a value higher than 1000
one can execute the following.
NOTE: It's important to use brackets around each condition!
net_price_series[(bukrs_series.str.startswith("UK")) & (net_price_series > 1000)].mean()
12571.304588096225
NOTE: It's only possible to combine series with the same filters (as they need to have the same shape).
df = pql.DataFrame({
"NETPR": net_price_series,
"BUKRS": bukrs_series
})
df
DataFrame(data={'NETPR': '( "EKPO"."NETPR" + 50 )', 'BUKRS': 'REPLACE( "EKPO"."BUKRS", \'GB\', \'UK\' )'}, index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[])
You can access the dimensions of the data frame:
df.shape
(325493, 2)
As well as, get the first few rows:
df.head()
NETPR | BUKRS | |
---|---|---|
Index | ||
0 | 215.60 | UK03 |
1 | 161.01 | UK03 |
2 | 199.50 | 5100 |
3 | 422.60 | 5100 |
4 | 315.80 | 5100 |
Columns can be accessed and assigned with the same notation as with pandas, df.NETPR
or df["NETPR"]
:
df.NETPR
Series(data='( "EKPO"."NETPR" + 50 )', index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[], name='NETPR')
df["NETPR"]
Series(data='( "EKPO"."NETPR" + 50 )', index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[], name='NETPR')
df["NEW_COLUMN"] = df.NETPR * 5
It is also possible to select a subset of columns:
df[["BUKRS", "NEW_COLUMN"]].head()
BUKRS | NEW_COLUMN | |
---|---|---|
Index | ||
0 | UK03 | 1078.00 |
1 | UK03 | 805.05 |
2 | 5100 | 997.50 |
3 | 5100 | 2113.00 |
4 | 5100 | 1579.00 |
3.2. Arithmetic Operations¶
As with series, you can apply arithmetic operations:
transformed_df = df[["NETPR", "NEW_COLUMN"]] * 2 / 5
transformed_df.head()
NETPR | NEW_COLUMN | |
---|---|---|
Index | ||
0 | 86.240 | 431.20 |
1 | 64.404 | 322.02 |
2 | 79.800 | 399.00 |
3 | 169.040 | 845.20 |
4 | 126.320 | 631.60 |
You can also apply aggregations:
transformed_df.mean()
NETPR 2415.708923 NEW_COLUMN 12078.544614 Name: 0, dtype: float64
Conclusion¶
Congratulations! You have learned how to use the basic functionalities of SaolaPy to transform and aggregate data from your data model.