Data Export¶

In this tutorial, you will learn how to export data from the Celonis EMS into your local Python project. This allows you to perform analyses on your Celonis data using tools, such as Pandas, or run machine learning algorithms on it. More specifically, you will learn:

Where to find the data that can be exported
How to define which data to retrieve using PQL
How to export the data as a Pandas dataframe

Prerequisites¶

To follow this tutorial, you should have created a data pool as well as a data model and should have uploaded data into it. As we continue working with the SAP Purchase-to-Pay (P2P) tables from the Data Upload tutorial, it is recommended to complete the Data Push tutorial first before embarking on this tutorial.

Tutorial¶

1. Import PyCelonis and connect to Celonis API¶

In [3]:

Copied!

from pycelonis import get_celonis
celonis = get_celonis(permissions=False)
from pycelonis import get_celonis
celonis = get_celonis(permissions=False)

[2024-11-12 15:04:56,174] INFO: No `base_url` given. Using environment variable 'CELONIS_URL'

[2024-11-12 15:04:56,174] INFO: No `api_token` given. Using environment variable 'CELONIS_API_TOKEN'

[2024-11-12 15:04:56,175] INFO: No `key_type` given. Using environment variable 'CELONIS_KEY_TYPE'

[2024-11-12 15:04:56,210] INFO: Initial connect successful! PyCelonis Version: 2.11.1

2. Find the data model from which data will be exported¶

The first step in exporting data from the EMS is to find the location where the data is stored, i.e. the corresponding data pool and data model.

In [4]:

Copied!

data_pool = celonis.data_integration.get_data_pools().find("PyCelonis Tutorial Data Pool")
data_pool
data_pool = celonis.data_integration.get_data_pools().find("PyCelonis Tutorial Data Pool")
data_pool

Out[4]:

DataPool(id='be065bab-bc94-4f2e-81d3-f4df1e126e2a', name='PyCelonis Tutorial Data Pool')

In [5]:

Copied!

data_model = data_pool.get_data_models().find("PyCelonis Tutorial Data Model")
data_model
data_model = data_pool.get_data_models().find("PyCelonis Tutorial Data Model")
data_model

Out[5]:

DataModel(id='68682a56-5bc4-4bfb-be4e-2e588335549c', name='PyCelonis Tutorial Data Model', pool_id='be065bab-bc94-4f2e-81d3-f4df1e126e2a')

Important:
In the data export, data is retrieved from the data model and not from the data pool. This is different from the data push, where the data is first inserted into the data pool and then loaded into a data model. This design choice has been implemented, as data models are specifically designed to support fast querying of process data via Celonis' custom Process Query Language (PQL).

3. Define PQL query and export result as Pandas dataframe¶

Data from the EMS is retrieved via Celonis' custom querying language PQL. Hence, in order to export data, we first need to specify a PQL query that defines which data to retrieve and can then export the resulting table as a Pandas dataframe. For this we will be using the SaolaPy Series and DataFrame implementation. For more information on how to use SaolaPy visit the SaolaPy Tutorial.

In [6]:

Copied!

import pycelonis.pql as pql
import pycelonis.pql as pql

First, we will specify the default data model to use with SaolaPy:

In [7]:

Copied!

from pycelonis.config import Config

Config.DEFAULT_DATA_MODEL = data_model
from pycelonis.config import Config

Config.DEFAULT_DATA_MODEL = data_model

3.1 Selecting columns¶

In [8]:

Copied!

activities = data_model.get_tables().find("ACTIVITIES")
activity_columns = activities.get_columns()
activities = data_model.get_tables().find("ACTIVITIES")
activity_columns = activities.get_columns()

Then, we will create a DataFrame representing an OLAP table by specifying the columns:

In [9]:

Copied!





df = pql.DataFrame(
    {
        "_CASE_KEY": activity_columns.find("_CASE_KEY"),
        "ACTIVITY_EN": activity_columns.find("ACTIVITY_EN"),
        "EVENTTIME": activity_columns.find("EVENTTIME"),
        # It is also possible to write a PQL query string directly:
        "_SORTING": """ "ACTIVITIES"."_SORTING" """,
    }
)
df = pql.DataFrame(
    {
        "_CASE_KEY": activity_columns.find("_CASE_KEY"),
        "ACTIVITY_EN": activity_columns.find("ACTIVITY_EN"),
        "EVENTTIME": activity_columns.find("EVENTTIME"),
        # It is also possible to write a PQL query string directly:
        "_SORTING": """ "ACTIVITIES"."_SORTING" """,
    }
)

We can take a look at the generated PQL query by executing:

In [10]:

Copied!

df.query
df.query

Out[10]:

PQL(columns=[PQLColumn(name='Index', query='0 - 1 + RUNNING_TOTAL(1)'), PQLColumn(name='_CASE_KEY', query='"ACTIVITIES"."_CASE_KEY"'), PQLColumn(name='ACTIVITY_EN', query='"ACTIVITIES"."ACTIVITY_EN"'), PQLColumn(name='EVENTTIME', query='"ACTIVITIES"."EVENTTIME"'), PQLColumn(name='_SORTING', query=' "ACTIVITIES"."_SORTING" ')], filters=[], order_by_columns=[], distinct=False, limit=None, offset=None)

To get the dimensions of the DataFrame simply run:

In [11]:

Copied!

df.shape
df.shape

[2024-11-12 15:04:56,332] INFO: Successfully created data export with id '2f5fd8a3-9b98-4eb7-98fb-94411ad31b27'

[2024-11-12 15:04:56,333] INFO: Wait for execution of data export with id '2f5fd8a3-9b98-4eb7-98fb-94411ad31b27'

[2024-11-12 15:04:56,340] INFO: Export result chunks for data export with id '2f5fd8a3-9b98-4eb7-98fb-94411ad31b27'

Out[11]:

(60, 4)

Then, we can use SaolaPy to perform additional operations on the table, such as filtering, sorting, or grouping. For more information on available operations visit the SaolaPy Tutorial.

3.2 Filtering¶

To filter the data, simply apply:

In [12]:

Copied!

df = df[df._CASE_KEY == "800000000006800001"]
df = df[df._CASE_KEY == "800000000006800001"]

By plotting the df, we can see that the filter has been added to the df object:

In [13]:

Copied!

df
df

Out[13]:

DataFrame(data={'_CASE_KEY': '"ACTIVITIES"."_CASE_KEY"', 'ACTIVITY_EN': '"ACTIVITIES"."ACTIVITY_EN"', 'EVENTTIME': '"ACTIVITIES"."EVENTTIME"', '_SORTING': ' "ACTIVITIES"."_SORTING" '}, index=RangeIndex(name='Index', start=0, step=1), filters=['FILTER ( "ACTIVITIES"."_CASE_KEY" = \'8000000000...'], order_by_columns=[])

Also, the shape now reflects the added filter:

In [14]:

Copied!

df.shape
df.shape

[2024-11-12 15:04:56,456] INFO: Successfully created data export with id 'f3a42712-1ad6-4ec2-b663-55147c8c9196'

[2024-11-12 15:04:56,458] INFO: Wait for execution of data export with id 'f3a42712-1ad6-4ec2-b663-55147c8c9196'

[2024-11-12 15:04:56,462] INFO: Export result chunks for data export with id 'f3a42712-1ad6-4ec2-b663-55147c8c9196'

Out[14]:

(6, 4)

3.3 Sorting the data¶

To sort the data, simply use the sort_values function:

In [15]:

Copied!

df = df.sort_values(by=["EVENTTIME", "_SORTING"])
df = df.sort_values(by=["EVENTTIME", "_SORTING"])

By plotting the df, we can see that the two order-columns have been added to the df:

In [16]:

Copied!

df
df

Out[16]:

DataFrame(data={'_CASE_KEY': '"ACTIVITIES"."_CASE_KEY"', 'ACTIVITY_EN': '"ACTIVITIES"."ACTIVITY_EN"', 'EVENTTIME': '"ACTIVITIES"."EVENTTIME"', '_SORTING': ' "ACTIVITIES"."_SORTING" '}, index=RangeIndex(name='Index', start=0, step=1), filters=['FILTER ( "ACTIVITIES"."_CASE_KEY" = \'8000000000...'], order_by_columns=['"ACTIVITIES"."EVENTTIME" ASC', ' "ACTIVITIES"."_SORTING"  ASC'])

In [17]:

Copied!

df.head()
df.head()

[2024-11-12 15:04:56,495] INFO: Successfully created data export with id 'a04f6fb6-810d-4e7f-847d-468e08a141e9'

[2024-11-12 15:04:56,497] INFO: Wait for execution of data export with id 'a04f6fb6-810d-4e7f-847d-468e08a141e9'

[2024-11-12 15:04:56,501] INFO: Export result chunks for data export with id 'a04f6fb6-810d-4e7f-847d-468e08a141e9'

Out[17]:

	_CASE_KEY	ACTIVITY_EN	EVENTTIME	_SORTING
Index
0	800000000006800001	Create Purchase Requisition Item	2008-12-31 07:44:05	0.0
1	800000000006800001	Create Purchase Order Item	2009-01-02 07:44:05	10.0
2	800000000006800001	Print and Send Purchase Order	2009-01-05 07:44:05	NaN
3	800000000006800001	Receive Goods	2009-01-12 07:44:05	30.0
4	800000000006800001	Scan Invoice	2009-01-20 07:44:05	NaN

3.4 SaolaPy Operations¶

We can then apply further SaolaPy transformations, for example arithmetic and string operations:

In [18]:

Copied!

df._SORTING = df._SORTING + 5
df.ACTIVITY_EN = df.ACTIVITY_EN.str.replace("Receive Goods", "Goods Received")
df._SORTING = df._SORTING + 5
df.ACTIVITY_EN = df.ACTIVITY_EN.str.replace("Receive Goods", "Goods Received")

3.4 Data Export¶

Finally, we can export the data as a Pandas dataframe to get our final result table. This step should be done only after all aggregations and filters are applied to ensure minimal memory consumption and to execute most computations inside the PQL engine instead of in Python:

In [19]:

Copied!

pandas_df = df.to_pandas()
pandas_df
pandas_df = df.to_pandas()
pandas_df

[2024-11-12 15:04:56,549] INFO: Successfully created data export with id '6b85c891-67b3-48bc-b44c-2cd3039a96f0'

[2024-11-12 15:04:56,550] INFO: Wait for execution of data export with id '6b85c891-67b3-48bc-b44c-2cd3039a96f0'

[2024-11-12 15:04:56,556] INFO: Export result chunks for data export with id '6b85c891-67b3-48bc-b44c-2cd3039a96f0'

Out[19]:

	_CASE_KEY	ACTIVITY_EN	EVENTTIME	_SORTING
Index
0	800000000006800001	Create Purchase Requisition Item	2008-12-31 07:44:05	5.0
1	800000000006800001	Create Purchase Order Item	2009-01-02 07:44:05	15.0
2	800000000006800001	Print and Send Purchase Order	2009-01-05 07:44:05	NaN
3	800000000006800001	Goods Received	2009-01-12 07:44:05	35.0
4	800000000006800001	Scan Invoice	2009-01-20 07:44:05	NaN
5	800000000006800001	Book Invoice	2009-01-30 07:44:05	NaN

We can also set the parameters distinct, offset, and limit of the PQL base object:

Let us specify a value for the limit property. By exporting the resulting dataframe, we can see that our result table has been reduced to only include 3 rows:

In [20]:

Copied!

df.to_pandas(limit=3)
df.to_pandas(limit=3)

[2024-11-12 15:04:56,584] INFO: Successfully created data export with id '6ee18a00-f1d9-4cfd-8621-d014b5f4f2a2'

[2024-11-12 15:04:56,585] INFO: Wait for execution of data export with id '6ee18a00-f1d9-4cfd-8621-d014b5f4f2a2'

[2024-11-12 15:04:56,592] INFO: Export result chunks for data export with id '6ee18a00-f1d9-4cfd-8621-d014b5f4f2a2'

Out[20]:

	_CASE_KEY	ACTIVITY_EN	EVENTTIME	_SORTING
Index
0	800000000006800001	Create Purchase Requisition Item	2008-12-31 07:44:05	5.0
1	800000000006800001	Create Purchase Order Item	2009-01-02 07:44:05	15.0
2	800000000006800001	Print and Send Purchase Order	2009-01-05 07:44:05	NaN

Next, we can specify an offset, meaning that a certain number of rows will be skipped. By exporting the resulting dataframe, we can see that the result table still contains 3 rows but this time the first 3 rows are skipped and only the last 3 rows are returned:

In [21]:

Copied!

df.to_pandas(limit=3, offset=3)
df.to_pandas(limit=3, offset=3)

[2024-11-12 15:04:56,621] INFO: Successfully created data export with id 'eb0f31eb-0b0e-4f10-9ac9-08e168c684e2'

[2024-11-12 15:04:56,621] INFO: Wait for execution of data export with id 'eb0f31eb-0b0e-4f10-9ac9-08e168c684e2'

[2024-11-12 15:04:56,628] INFO: Export result chunks for data export with id 'eb0f31eb-0b0e-4f10-9ac9-08e168c684e2'

Out[21]:

	_CASE_KEY	ACTIVITY_EN	EVENTTIME	_SORTING
Index
3	800000000006800001	Goods Received	2009-01-12 07:44:05	35.0
4	800000000006800001	Scan Invoice	2009-01-20 07:44:05	NaN
5	800000000006800001	Book Invoice	2009-01-30 07:44:05	NaN

Conclusion¶

Congratulations! You have learned how to create PQL queries in PyCelonis in order to export data from the EMS using SaolaPy. In the next tutorial Data Model - Advanced, we will cover more advanced topics in data models, such as foreign keys, process configurations, name mappings, and different reload modes.