Quickstart¶

SaolaPy is a DataFrame library utilizing the power of PQL to transform and aggregate data from your data models. It was created to simplify your workflow, offering a Pandas-like interface for direct interactions. Our goal with SaolaPy is to make your work easier and more efficient. Here's why we believe SaolaPy will be helpful for you:

Familiarity: SaolaPy offers the ease and accessibility of Pandas, a tool that many data scientists and machine learning engineers already use. No need to learn PQL; jump right in with Python and get to work.
Optimized Performance: With SaolaPy, your computations run directly in SaolaDB, where they benefit from caching and query optimization. This ensures minimal memory usage and optimized performance. It eliminates the need to export full tables for transformations which often lead to a high memory consumption and in the worst case out of memory errors.
Error Prevention: SaolaPy offers a user-friendly environment where you can write and test your code. With Python's direct feedback and code completion features, you don't need to write PQL queries in the ML Workbench by hand anymore.

In this tutorial, you will learn:

How to create SaolaPy Series and DataFrames
Perform basic arithmetic, aggregation, and string operations to transform your data

Prerequisites:¶

To follow this tutorial, you need to have a Celonis EMS team set up and be familiar with the basic functionalities of PyCelonis.

Tutorial¶

1. Connect to your data model¶

To get started with SaolaPy, you first need to connect to our data model:

In [1]:

Copied!





from pycelonis import get_celonis
import pycelonis.pql as pql

c = get_celonis()
data_pool = c.data_integration.get_data_pools().find("<YOUR_DATA_POOL>")
data_model = data_pool.get_data_models().find("<YOUR_DATA_MODEL>")
from pycelonis import get_celonis
import pycelonis.pql as pql

c = get_celonis()
data_pool = c.data_integration.get_data_pools().find("")
data_model = data_pool.get_data_models().find("")

To avoid always having to specify the data model you can set it as the global default data model:

In [2]:

Copied!

from pycelonis.config import Config

Config.DEFAULT_DATA_MODEL = data_model
from pycelonis.config import Config

Config.DEFAULT_DATA_MODEL = data_model

For convenience we extract the columns of tables we want to use:

In [3]:

Copied!

ekpo_columns = data_model.get_tables().find("EKPO").get_columns()
ekpo_columns = data_model.get_tables().find("EKPO").get_columns()

2. SaolaPy Series¶

Now it's time to start working with SaolaPy Series:

2.1. Basics¶

First, we have to initialize a SaolaPy Series for the net price column (we are using the default data model specified above here):

In [4]:

Copied!

net_price_series = pql.Series(ekpo_columns.find("NETPR"))
net_price_series
net_price_series = pql.Series(ekpo_columns.find("NETPR"))
net_price_series

Alternatively, we can specify the data model explicitly:

In [5]:

Copied!

net_price_series = pql.Series(ekpo_columns.find("NETPR"), data_model=data_model)
net_price_series
net_price_series = pql.Series(ekpo_columns.find("NETPR"), data_model=data_model)
net_price_series

Out[5]:

Series(data='"EKPO"."NETPR"', index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[], name='NETPR')

It is also possible to instantiate a series using a PQL query string:

In [6]:

Copied!

net_price_series = pql.Series('"EKPO"."NETPR"', name="NETPR")
net_price_series
net_price_series = pql.Series('"EKPO"."NETPR"', name="NETPR")
net_price_series

Out[6]:

Series(data='"EKPO"."NETPR"', index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[], name='NETPR')

We can then look at some general information of this series similar to how we would using pandas. First, we look at the shape of the data:

In [7]:

Copied!

net_price_series.shape
net_price_series.shape

Out[7]:

(325493,)

Afterwards, we export the first few rows of the series:

In [8]:

Copied!

net_price_series.head()
net_price_series.head()

Out[8]:

Index
0    165.60
1    111.01
2    149.50
3    372.60
4    265.80
Name: NETPR, dtype: float64

2.2. Arithmetic Operations¶

Similarly to how we do it in pandas we can perform basic arithmetic operations:

In [9]:

Copied!

net_price_series = net_price_series + 5 * 10
net_price_series
net_price_series = net_price_series + 5 * 10
net_price_series

Out[9]:

Series(data='( "EKPO"."NETPR" + 50 )', index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[], name='NETPR')

It's important to note here that the data stays in SaolaDB and is not exported. Only once we call to_pandas the computation is executed and the data exported to the ML Workbench:

In [10]:

Copied!

net_price_series.to_pandas()
net_price_series.to_pandas()

Out[10]:

Index
0          215.60
1          161.01
2          199.50
3          422.60
4          315.80
           ...   
325488      86.94
325489    1167.50
325490     164.00
325491     269.51
325492     545.81
Name: NETPR, Length: 325493, dtype: float64

It is also possible to aggregate the given series:

In [11]:

Copied!

print("Min price:", net_price_series.min())
print("Mean price:", net_price_series.mean())
print("Max price:", net_price_series.max())
print("Min price:", net_price_series.min())
print("Mean price:", net_price_series.mean())
print("Max price:", net_price_series.max())

Min price: 50.0
Mean price: 6039.272307177108
Max price: 76417050.0

Calling these aggregation functions, always applies a PQL operator. To find more about which operator is applied you can take a look at the docstrings which contain a link to the respective PQL function:

In [12]:

Copied!

net_price_series.max?
net_price_series.max?

Signature: net_price_series.max() -> Union[str, int, float, datetime.datetime]
Docstring:
Return the max of the values.

Applies [MAX operator](https://docs.celonis.com/en/max.html) to column.

Returns:
    Max of series values.
File:      ~/Code/saolapy/saolapy/pandas/series.py
Type:      method

2.3. Categorical Data¶

For categorical data we can look at the value counts, as is done here for company codes:

In [13]:

Copied!

bukrs_series = pql.Series(ekpo_columns.find("BUKRS"))
bukrs_series.value_counts()
bukrs_series = pql.Series(ekpo_columns.find("BUKRS"))
bukrs_series.value_counts()

Out[13]:

BUKRS
5100    90984
SG01    69590
CN01    57458
GB03    44413
4575    28862
MY01    25939
HU01     7537
FR01      394
GB99      219
AT01       96
SE01        1
Name: COUNT_BUKRS, dtype: int64

Furthermore, it's possible to perform basic string operations:

In [14]:

Copied!

bukrs_series = bukrs_series.str.replace("GB", "UK")
bukrs_series.head()
bukrs_series = bukrs_series.str.replace("GB", "UK")
bukrs_series.head()

Out[14]:

Index
0    UK03
1    UK03
2    5100
3    5100
4    5100
Name: BUKRS, dtype: object

2.4 Filtering Data¶

Often it is necessary to filter data. For this we can use the same syntax as pandas to filter for e.g. all purchasing documents from the UK:

In [15]:

Copied!

filtered_bukrs_series = bukrs_series[bukrs_series.str.startswith("UK")]
filtered_bukrs_series.head()
filtered_bukrs_series = bukrs_series[bukrs_series.str.startswith("UK")]
filtered_bukrs_series.head()

Out[15]:

Index
0    UK03
1    UK03
2    UK03
3    UK03
4    UK03
Name: BUKRS, dtype: object

Filters can also be combined and applied to different series. E.g. to get the mean net price of all UK03 purchasing documents with a value higher than 1000 one can execute the following.

NOTE: It's important to use brackets around each condition!

In [16]:

Copied!

net_price_series[(bukrs_series.str.startswith("UK")) & (net_price_series > 1000)].mean()
net_price_series[(bukrs_series.str.startswith("UK")) & (net_price_series > 1000)].mean()

Out[16]:

12571.304588096225

NOTE: It's only possible to combine series with the same filters (as they need to have the same shape).

3. SaolaPy DataFrames¶

3.1. Basics¶

Similar to series, you can instantiate a data frame with multiple columns by passing them as a dictionary:

In [17]:

Copied!





df = pql.DataFrame({
    "NETPR": net_price_series,
    "BUKRS": bukrs_series
})
df
df = pql.DataFrame({
    "NETPR": net_price_series,
    "BUKRS": bukrs_series
})
df

Out[17]:

DataFrame(data={'NETPR': '( "EKPO"."NETPR" + 50 )', 'BUKRS': 'REPLACE( "EKPO"."BUKRS", \'GB\', \'UK\' )'}, index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[])

You can access the dimensions of the data frame:

In [18]:

Copied!

df.shape
df.shape

Out[18]:

(325493, 2)

As well as, get the first few rows:

In [19]:

Copied!

df.head()
df.head()

Out[19]:

	NETPR	BUKRS
Index
0	215.60	UK03
1	161.01	UK03
2	199.50	5100
3	422.60	5100
4	315.80	5100

Columns can be accessed and assigned with the same notation as with pandas, df.NETPR or df["NETPR"]:

In [20]:

Copied!

df.NETPR
df.NETPR

Out[20]:

Series(data='( "EKPO"."NETPR" + 50 )', index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[], name='NETPR')

In [26]:

Copied!

df["NETPR"]
df["NETPR"]

Out[26]:

Series(data='( "EKPO"."NETPR" + 50 )', index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[], name='NETPR')

In [21]:

Copied!

df["NEW_COLUMN"] = df.NETPR * 5
df["NEW_COLUMN"] = df.NETPR * 5

It is also possible to select a subset of columns:

In [22]:

Copied!

df[["BUKRS", "NEW_COLUMN"]].head()
df[["BUKRS", "NEW_COLUMN"]].head()

Out[22]:

	BUKRS	NEW_COLUMN
Index
0	UK03	1078.00
1	UK03	805.05
2	5100	997.50
3	5100	2113.00
4	5100	1579.00

3.2. Arithmetic Operations¶

As with series, you can apply arithmetic operations:

In [23]:

Copied!

transformed_df = df[["NETPR", "NEW_COLUMN"]] * 2 / 5
transformed_df.head()
transformed_df = df[["NETPR", "NEW_COLUMN"]] * 2 / 5
transformed_df.head()

Out[23]:

	NETPR	NEW_COLUMN
Index
0	86.240	431.20
1	64.404	322.02
2	79.800	399.00
3	169.040	845.20
4	126.320	631.60

You can also apply aggregations:

In [24]:

Copied!

transformed_df.mean()
transformed_df.mean()

Out[24]:

NETPR          2415.708923
NEW_COLUMN    12078.544614
Name: 0, dtype: float64

Conclusion¶

Congratulations! You have learned how to use the basic functionalities of SaolaPy to transform and aggregate data from your data model.