Data Export¶

In this tutorial, you will learn how to export data from the Celonis EMS into your local Python project. This allows you to perform analyses on your Celonis data using tools, such as Pandas, or run machine learning algorithms on it. More specifically, you will learn:

Where to find the data that can be exported
How to define which data to retrieve using PQL
How to export the data as a Pandas dataframe

Prerequisites¶

To follow this tutorial, you should have created a data pool as well as a data model and should have uploaded data into it. As we continue working with the SAP Purchase-to-Pay (P2P) tables from the Data Upload tutorial, it is recommended to complete the Data Push tutorial first before embarking on this tutorial.

Tutorial¶

1. Import PyCelonis and connect to Celonis API¶

In [3]:

            
                Copied!
                
from pycelonis import get_celonis
celonis = get_celonis(permissions=False)
from pycelonis import get_celonis
celonis = get_celonis(permissions=False)

[2023-01-19 10:22:55,408] INFO: No `base_url` given. Using environment variable 'CELONIS_URL'
[2023-01-19 10:22:55,409] INFO: No `api_token` given. Using environment variable 'CELONIS_API_TOKEN'

[2023-01-19 10:22:55,481] WARNING: KeyType is not set. Defaulted to 'APP_KEY'.

[2023-01-19 10:22:55,484] INFO: Initial connect successful! PyCelonis Version: 2.0.1

2. Find the data to be exported from the EMS¶

The first step in exporting data from the EMS is to find the location where the data is stored, i.e. the corresponding data pool and data model.

In [4]:

            
                Copied!
                
data_pool = celonis.data_integration.get_data_pools().find("PyCelonis Tutorial Data Pool")
data_pool
data_pool = celonis.data_integration.get_data_pools().find("PyCelonis Tutorial Data Pool")
data_pool

Out[4]:

DataPool(id='4fb45b3c-f80a-4c23-9bc7-6c5ee3e2b4de', name='PyCelonis Tutorial Data Pool')

In [5]:

            
                Copied!
                
data_model = data_pool.get_data_models().find("PyCelonis Tutorial Data Model")
data_model
data_model = data_pool.get_data_models().find("PyCelonis Tutorial Data Model")
data_model

Out[5]:

DataModel(id='30bb734f-edbd-4fa2-b90d-dfbe1e6168e9', name='PyCelonis Tutorial Data Model', pool_id='4fb45b3c-f80a-4c23-9bc7-6c5ee3e2b4de')

Important:
In the data export, data is retrieved from the data model and not from the data pool. This is different from the data push, where the data is first inserted into the data pool and then loaded into a data model. This design choice has been implemented, as data models are specifically designed to support fast querying of process data via Celonis' custom Process Query Language (PQL).

3. Define PQL query and export result as Pandas dataframe¶

Data from the EMS is retrieved via Celonis' custom querying language PQL. Hence, in order to export data, we first need to specify a PQL query that defines which data to retrieve and can then export the resulting table as a Pandas dataframe.

PyCelonis supports four types of objects to construct such PQL query:

PQL Object	Description
`PQL`	Base object, which will be used to add elements, such as columns and filters, to the query.
`PQLColumn`	Specifies a column of a data model table in the format `"table"."column"` that should be retrieved. It is possible to add columns from different tables to the query. These columns will be aggregated into a single table using PQL's implicit join functionality. However, for this, we first need to define relationships between the data model tables using foreign keys. For a guide on how to do this, refer to the Data Model - Advanced tutorial.
`PQLFilter`	Specifies a filter that should be applied to the resulting table. It is possible to add multiple filters. These filters will be connected via an `AND` logic.
`OrderByColumn`	Specifies the column according to which to sort the resulting table. It is possible to add multiple order-columns. In this case, the first added order-column is the most important, the second added order-column the second-most important, and so on.

All these objects can be imported from pycelonis.pql.pql:

In [6]:

            
                Copied!
                
from pycelonis.pql import PQL, PQLColumn, PQLFilter, OrderByColumn
from pycelonis.pql import PQL, PQLColumn, PQLFilter, OrderByColumn

3.1 PQL Base Object¶

The first step in creating a PQL query is to instantiate the PQL base object, which takes as input arguments:

Name	Type	Description	Default
`distinct`	`bool`	If true, all duplicate rows are removed. Note: It is recommended to only enable this functionality if really needed, as data deduplication significantly reduces the query performance!	`False`
`limit`	`int`	Result is limited to a certain number of rows	`None`
`offset`	`int`	Result skips a certain number of rows	`None`

In [7]:

            
                Copied!
                
query = PQL(distinct=False, limit=None, offset=None)
query = PQL(distinct=False, limit=None, offset=None)

This object will create an empty PQL query in the following format:

<PQLFilters>
TABLE( [DISTINCT]
    <PQLColumns>
)
ORDER BY <OrderByColumns>
LIMIT ...
OFFSET ...

By plotting the query, we can see that it is currently empty:

In [8]:

            
                Copied!
                
query
query

Out[8]:

PQL(columns=[], filters=[], order_by_columns=[], distinct=False, limit=None, offset=None)

We can now add more elements to it in order to fill the placeholders for <PQLFilters>, <PQLColumns>, and <OrderByColumns>. New elements can be added either via query.add(<element>) or via query += <element>.

3.2 PQLColumn Objects¶

First, we need to specify which columns from which tables to retrieve. As we haven't defined foreign key relationships in the data model yet, we cannot query across multiple tables. Hence, we query here only columns from the activity table.

Columns can be added to the result table by adding a PQLColumn object to the query, which takes as input arguments:

Name	Type	Description	Default
`name`	`str`	Alias for the column	Required
`query`	`str`	PQL query that specifies which column from which table to retrieve in the format `"Table"."Column"`. Note: It is recommended to use triple-quotes `""" """` to avoid escaping the quote characters when defining the query.	Required

In [9]:

            
                Copied!
                
query += PQLColumn(name="_CASE_KEY", query=""" "ACTIVITIES"."_CASE_KEY" """)
query += PQLColumn(name="ACTIVITY_EN", query=""" "ACTIVITIES"."ACTIVITY_EN" """)
query += PQLColumn(name="EVENTTIME", query=""" "ACTIVITIES"."EVENTTIME" """)
query += PQLColumn(name="_SORTING", query=""" "ACTIVITIES"."_SORTING" """)
query += PQLColumn(name="_CASE_KEY", query=""" "ACTIVITIES"."_CASE_KEY" """)
query += PQLColumn(name="ACTIVITY_EN", query=""" "ACTIVITIES"."ACTIVITY_EN" """)
query += PQLColumn(name="EVENTTIME", query=""" "ACTIVITIES"."EVENTTIME" """)
query += PQLColumn(name="_SORTING", query=""" "ACTIVITIES"."_SORTING" """)

This object will create a PQL query in the format: "Table"."Column" AS name. By plotting the query, we can see that the columns have been added to the query object:

In [10]:

            
                Copied!
                
query
query

Out[10]:

PQL(columns=[
	PQLColumn(name='_CASE_KEY', query=' "ACTIVITIES"."_CASE_KEY" '),
	PQLColumn(name='ACTIVITY_EN', query=' "ACTIVITIES"."ACTIVITY_EN" '),
	PQLColumn(name='EVENTTIME', query=' "ACTIVITIES"."EVENTTIME" '),
	PQLColumn(name='_SORTING', query=' "ACTIVITIES"."_SORTING" ')
], filters=[], order_by_columns=[], distinct=False, limit=None, offset=None)

We can now export the table resulting from our PQL query as Pandas dataframe by calling the export_data_frame() method and passing the query as input argument:

In [11]:

            
                Copied!
                
result_df = data_model.export_data_frame(query)
print(result_df.shape)
result_df.head()
result_df = data_model.export_data_frame(query)
print(result_df.shape)
result_df.head()

[2023-01-19 10:22:55,568] INFO: Successfully created data export with id '09ef7089-389b-4254-97d8-dbcf83ff88a1'
[2023-01-19 10:22:55,568] INFO: Wait for execution of data export with id '09ef7089-389b-4254-97d8-dbcf83ff88a1'

[2023-01-19 10:22:55,588] INFO: Export result chunks for data export with id '09ef7089-389b-4254-97d8-dbcf83ff88a1'
(60, 4)

Out[11]:

	_CASE_KEY	ACTIVITY_EN	EVENTTIME	_SORTING
0	800000000006800001	Create Purchase Order Item	2009-01-02 07:44:05	10.0
1	800000000006800002	Book Invoice	2009-01-28 07:56:04	NaN
2	800000000006800002	Create Purchase Requisition Item	2008-12-31 07:56:04	0.0
3	800000000006800003	Create Purchase Requisition Item	2008-12-31 07:56:04	0.0
4	800000000006800004	Create Purchase Requisition Item	2008-12-31 07:56:04	0.0

We can see that the resulting dataframe contains only the columns _CASE_KEY, ACTIVITY_EN, EVENTTIME, and _SORTING specified in our PQL query.

Note: In new teams, data export is not enabled by default. To enable it contact Celonis customer support.

3.3 PQLFilter Objects¶

Next, we can specify if the rows of the result table should be filtered. Filters can be applied by adding a PQLFilter object to the query, which takes as input arguments:

Name	Type	Description	Default
`query`	`str`	PQL query to filter the data. Must be in the format: `FILTER <condition>;`	Required

Let's suppose we want to filter the activity table only for a certain case (i.e. purchase order item):

In [12]:

            
                Copied!
                
query += PQLFilter(query=""" FILTER "ACTIVITIES"."_CASE_KEY" = '800000000006800001'; """)
query += PQLFilter(query=""" FILTER "ACTIVITIES"."_CASE_KEY" = '800000000006800001'; """)

By plotting the query, we can see that the filter has been added to the query object:

In [13]:

            
                Copied!
                
query
query

Out[13]:

PQL(columns=[
	PQLColumn(name='_CASE_KEY', query=' "ACTIVITIES"."_CASE_KEY" '),
	PQLColumn(name='ACTIVITY_EN', query=' "ACTIVITIES"."ACTIVITY_EN" '),
	PQLColumn(name='EVENTTIME', query=' "ACTIVITIES"."EVENTTIME" '),
	PQLColumn(name='_SORTING', query=' "ACTIVITIES"."_SORTING" ')
], filters=[
	PQLFilter(query=' FILTER "ACTIVITIES"."_CASE_KEY" = \'800000000006800001\'; ')
], order_by_columns=[], distinct=False, limit=None, offset=None)

By exporting the table as a Pandas dataframe, we can see that the resulting table has been filtered to only include the purchase order item specified in our PQL query:

In [14]:

            
                Copied!
                
result_df = data_model.export_data_frame(query)
print(result_df.shape)
result_df
result_df = data_model.export_data_frame(query)
print(result_df.shape)
result_df

[2023-01-19 10:22:55,643] INFO: Successfully created data export with id '74942a3f-72bc-4757-9df5-49876cace32a'
[2023-01-19 10:22:55,644] INFO: Wait for execution of data export with id '74942a3f-72bc-4757-9df5-49876cace32a'

[2023-01-19 10:22:55,662] INFO: Export result chunks for data export with id '74942a3f-72bc-4757-9df5-49876cace32a'
(6, 4)

Out[14]:

	_CASE_KEY	ACTIVITY_EN	EVENTTIME	_SORTING
0	800000000006800001	Create Purchase Order Item	2009-01-02 07:44:05	10.0
1	800000000006800001	Scan Invoice	2009-01-20 07:44:05	NaN
2	800000000006800001	Book Invoice	2009-01-30 07:44:05	NaN
3	800000000006800001	Receive Goods	2009-01-12 07:44:05	30.0
4	800000000006800001	Print and Send Purchase Order	2009-01-05 07:44:05	NaN
5	800000000006800001	Create Purchase Requisition Item	2008-12-31 07:44:05	0.0

3.4 OrderByColumn Objects¶

Next, we can specify if the result table should be sorted according to one or more columns. For this, we need to add the OrderByColumn object to the query, which takes as input arguments:

Name	Type	Description	Default
`query`	`str`	PQL query that specifies column according to which to sort in the format `"Table"."Column"`	Required
`ascending`	`bool`	If true, results are sorted in ascending order. If false, results are sorted in descending order.	`True`

Let's suppose we want to sort the result according to the timestamp and sorting-column of the activity table:

In [15]:

            
                Copied!
                
query += OrderByColumn(query=""" "ACTIVITIES"."EVENTTIME" """)
query += OrderByColumn(query=""" "ACTIVITIES"."_SORTING" """)
query += OrderByColumn(query=""" "ACTIVITIES"."EVENTTIME" """)
query += OrderByColumn(query=""" "ACTIVITIES"."_SORTING" """)

By plotting the query, we can see that the two order-columns have been added to the query:

In [16]:

            
                Copied!
                
query
query

Out[16]:

PQL(columns=[
	PQLColumn(name='_CASE_KEY', query=' "ACTIVITIES"."_CASE_KEY" '),
	PQLColumn(name='ACTIVITY_EN', query=' "ACTIVITIES"."ACTIVITY_EN" '),
	PQLColumn(name='EVENTTIME', query=' "ACTIVITIES"."EVENTTIME" '),
	PQLColumn(name='_SORTING', query=' "ACTIVITIES"."_SORTING" ')
], filters=[
	PQLFilter(query=' FILTER "ACTIVITIES"."_CASE_KEY" = \'800000000006800001\'; ')
], order_by_columns=[
	OrderByColumn(query=' "ACTIVITIES"."EVENTTIME" ', ascending=True),
	OrderByColumn(query=' "ACTIVITIES"."_SORTING" ', ascending=True)
], distinct=False, limit=None, offset=None)

By exporting the table as a Pandas dataframe, we can see that the resulting table is now sorted according to the two columns EVENTTIME and _SORTING specified in our PQL query:

In [17]:

            
                Copied!
                
result_df = data_model.export_data_frame(query)
print(result_df.shape)
result_df
result_df = data_model.export_data_frame(query)
print(result_df.shape)
result_df

[2023-01-19 10:22:55,700] INFO: Successfully created data export with id '3aa3f882-c25b-4b93-b079-09fe5a144be2'
[2023-01-19 10:22:55,701] INFO: Wait for execution of data export with id '3aa3f882-c25b-4b93-b079-09fe5a144be2'

[2023-01-19 10:22:55,717] INFO: Export result chunks for data export with id '3aa3f882-c25b-4b93-b079-09fe5a144be2'
(6, 4)

Out[17]:

	_CASE_KEY	ACTIVITY_EN	EVENTTIME	_SORTING
0	800000000006800001	Create Purchase Requisition Item	2008-12-31 07:44:05	0.0
1	800000000006800001	Create Purchase Order Item	2009-01-02 07:44:05	10.0
2	800000000006800001	Print and Send Purchase Order	2009-01-05 07:44:05	NaN
3	800000000006800001	Receive Goods	2009-01-12 07:44:05	30.0
4	800000000006800001	Scan Invoice	2009-01-20 07:44:05	NaN
5	800000000006800001	Book Invoice	2009-01-30 07:44:05	NaN

3.5 Offset, Limit, and Distinct¶

Lastly, we can specify the input arguments distinct, offset, and limit of the PQL base object. This can be either done during its instantiation via the command query = PQL(distinct=..., limit=..., offset=...) or afterwards by modifying the PQL object's data properties via the command query.<property> = <value>.

Let us specify a value for the limit property of the PQL query, which limits the result table to a certain number of rows. By exporting the resulting dataframe, we can see that our result table has been reduced to only include 3 rows:

In [18]:

            
                Copied!
                
query.limit = 3
result_df = data_model.export_data_frame(query)
print(result_df.shape)
result_df
query.limit = 3
result_df = data_model.export_data_frame(query)
print(result_df.shape)
result_df

[2023-01-19 10:22:55,743] INFO: Successfully created data export with id '014db170-bdca-4880-99bf-e0730a321619'
[2023-01-19 10:22:55,744] INFO: Wait for execution of data export with id '014db170-bdca-4880-99bf-e0730a321619'

[2023-01-19 10:22:55,761] INFO: Export result chunks for data export with id '014db170-bdca-4880-99bf-e0730a321619'
(3, 4)

Out[18]:

	_CASE_KEY	ACTIVITY_EN	EVENTTIME	_SORTING
0	800000000006800001	Create Purchase Requisition Item	2008-12-31 07:44:05	0.0
1	800000000006800001	Create Purchase Order Item	2009-01-02 07:44:05	10.0
2	800000000006800001	Print and Send Purchase Order	2009-01-05 07:44:05	NaN

Next, we can specify an offset, meaning that a certain number of rows will be skipped. By exporting the resulting dataframe, we can see that the result table still contains 3 rows but this time the first 3 rows are skipped and only the last 3 rows are returned:

In [19]:

            
                Copied!
                
query.offset = 3
result_df = data_model.export_data_frame(query)
print(result_df.shape)
result_df
query.offset = 3
result_df = data_model.export_data_frame(query)
print(result_df.shape)
result_df

[2023-01-19 10:22:55,788] INFO: Successfully created data export with id 'f12d5b7a-a1c9-43ac-87a5-5fa0bda2eb83'
[2023-01-19 10:22:55,789] INFO: Wait for execution of data export with id 'f12d5b7a-a1c9-43ac-87a5-5fa0bda2eb83'

[2023-01-19 10:22:55,806] INFO: Export result chunks for data export with id 'f12d5b7a-a1c9-43ac-87a5-5fa0bda2eb83'
(3, 4)

Out[19]:

	_CASE_KEY	ACTIVITY_EN	EVENTTIME	_SORTING
0	800000000006800001	Receive Goods	2009-01-12 07:44:05	30.0
1	800000000006800001	Scan Invoice	2009-01-20 07:44:05	NaN
2	800000000006800001	Book Invoice	2009-01-30 07:44:05	NaN

4. Advanced PQL Queries¶

While this tutorial covered only basic usage of PQL to retrieve data from the EMS, the four PyCelonis objects PQL, PQLColumn, PQLFilter, and OrderByColumn, can be also used to build arbitrarily complex PQL queries. For instance, it is possible to use:

Data flow functions, such as CASE WHEN or REMAP_VALUES
Aggregations, such as COUNT, AVG, or STDEV
Process-specific functions, such as PROCESS EQUALS, CALC_REWORK, or VARIANT
Process-index functions, such as INDEX_ACTIVITY_ORDER or INDEX_ACTIVITY_LOOP
PU-functions, such as PU_COUNT or PU_FIRST

For an overview, how to create more advanced PQL functions, refer to the following documentation.

Conclusion¶

Congratulations! You have learned how to create PQL queries in PyCelonis in order to export data from the EMS. In the next tutorial Data Model - Advanced, we will cover more advanced topics in data models, such as foreign keys, process configurations, name mappings, and different reload modes.