Data Export¶
In this tutorial, you will learn how to export data from the Celonis EMS into your local Python project. This allows you to perform analyses on your Celonis data using tools, such as Pandas, or run machine learning algorithms on it. More specifically, you will learn:
- Where to find the data that can be exported
- How to define which data to retrieve using PQL
- How to export the data as a Pandas dataframe
Prerequisites¶
To follow this tutorial, you should have created a data pool as well as a data model and should have uploaded data into it. As we continue working with the SAP Purchase-to-Pay (P2P) tables from the Data Upload tutorial, it is recommended to complete the Data Push tutorial first before embarking on this tutorial.
Tutorial¶
1. Import PyCelonis and connect to Celonis API¶
from pycelonis import get_celonis
celonis = get_celonis(permissions=False)
[2022-11-28 14:08:52,772] INFO: No `base_url` given. Using environment variable 'CELONIS_URL' [2022-11-28 14:08:52,773] INFO: No `api_token` given. Using environment variable 'CELONIS_API_TOKEN'
[2022-11-28 14:08:52,855] WARNING: KeyType is not set. Defaulted to 'APP_KEY'.
[2022-11-28 14:08:52,857] INFO: Initial connect successful! PyCelonis Version: 2.0.0
2. Find the data to be exported from the EMS¶
The first step in exporting data from the EMS is to find the location where the data is stored, i.e. the corresponding data pool and data model.
data_pool = celonis.data_integration.get_data_pools().find("PyCelonis Tutorial Data Pool")
data_pool
DataPool(id='4fb45b3c-f80a-4c23-9bc7-6c5ee3e2b4de', name='PyCelonis Tutorial Data Pool')
data_model = data_pool.get_data_models().find("PyCelonis Tutorial Data Model")
data_model
DataModel(id='30bb734f-edbd-4fa2-b90d-dfbe1e6168e9', name='PyCelonis Tutorial Data Model', pool_id='4fb45b3c-f80a-4c23-9bc7-6c5ee3e2b4de')
Important:
In the data export, data is retrieved from the data model and not from the data pool. This is different from the data push, where the data is first inserted into the data pool and then loaded into a data model. This design choice has been implemented, as data models are specifically designed to support fast querying of process data via Celonis' custom Process Query Language (PQL).
3. Define PQL query and export result as Pandas dataframe¶
Data from the EMS is retrieved via Celonis' custom querying language PQL. Hence, in order to export data, we first need to specify a PQL query that defines which data to retrieve and can then export the resulting table as a Pandas dataframe.
PyCelonis supports four types of objects to construct such PQL query:
PQL Object | Description |
---|---|
PQL |
Base object, which will be used to add elements, such as columns and filters, to the query. |
PQLColumn |
Specifies a column of a data model table in the format "table"."column" that should be retrieved. It is possible to add columns from different tables to the query. These columns will be aggregated into a single table using PQL's implicit join functionality. However, for this, we first need to define relationships between the data model tables using foreign keys. For a guide on how to do this, refer to the Data Model - Advanced tutorial. |
PQLFilter |
Specifies a filter that should be applied to the resulting table. It is possible to add multiple filters. These filters will be connected via an AND logic. |
OrderByColumn |
Specifies the column according to which to sort the resulting table. It is possible to add multiple order-columns. In this case, the first added order-column is the most important, the second added order-column the second-most important, and so on. |
All these objects can be imported from pycelonis.pql.pql
:
from pycelonis.pql import PQL, PQLColumn, PQLFilter, OrderByColumn
3.1 PQL Base Object¶
The first step in creating a PQL query is to instantiate the PQL base object, which takes as input arguments:
Name | Type | Description | Default |
---|---|---|---|
distinct |
bool |
If true, all duplicate rows are removed. Note: It is recommended to only enable this functionality if really needed, as data deduplication significantly reduces the query performance! |
False |
limit |
int |
Result is limited to a certain number of rows | None |
offset |
int |
Result skips a certain number of rows | None |
query = PQL(distinct=False, limit=None, offset=None)
This object will create an empty PQL query in the following format:
<PQLFilters>
TABLE( [DISTINCT]
<PQLColumns>
)
ORDER BY <OrderByColumns>
LIMIT ...
OFFSET ...
By plotting the query, we can see that it is currently empty:
query
PQL(columns=[], filters=[], order_by_columns=[], distinct=False, limit=None, offset=None)
We can now add more elements to it in order to fill the placeholders for <PQLFilters>
, <PQLColumns>
, and <OrderByColumns>.
New elements can be added either via query.add(<element>)
or via query += <element>
.
3.2 PQLColumn Objects¶
First, we need to specify which columns from which tables to retrieve. As we haven't defined foreign key relationships in the data model yet, we cannot query across multiple tables. Hence, we query here only columns from the activity table.
Columns can be added to the result table by adding a PQLColumn
object to the query, which takes as input arguments:
Name | Type | Description | Default |
---|---|---|---|
name |
str |
Alias for the column | Required |
query |
str |
PQL query that specifies which column from which table to retrieve in the format "Table"."Column" . Note: It is recommended to use triple-quotes """ """ to avoid escaping the quote characters when defining the query. |
Required |
query += PQLColumn(name="_CASE_KEY", query=""" "ACTIVITIES"."_CASE_KEY" """)
query += PQLColumn(name="ACTIVITY_EN", query=""" "ACTIVITIES"."ACTIVITY_EN" """)
query += PQLColumn(name="EVENTTIME", query=""" "ACTIVITIES"."EVENTTIME" """)
query += PQLColumn(name="_SORTING", query=""" "ACTIVITIES"."_SORTING" """)
This object will create a PQL query in the format: "Table"."Column" AS name
. By plotting the query, we can see that the columns have been added to the query object:
query
PQL(columns=[ PQLColumn(name='_CASE_KEY', query=' "ACTIVITIES"."_CASE_KEY" '), PQLColumn(name='ACTIVITY_EN', query=' "ACTIVITIES"."ACTIVITY_EN" '), PQLColumn(name='EVENTTIME', query=' "ACTIVITIES"."EVENTTIME" '), PQLColumn(name='_SORTING', query=' "ACTIVITIES"."_SORTING" ') ], filters=[], order_by_columns=[], distinct=False, limit=None, offset=None)
We can now export the table resulting from our PQL query as Pandas dataframe by calling the export_data_frame()
method and passing the query
as input argument:
result_df = data_model.export_data_frame(query)
print(result_df.shape)
result_df.head()
[2022-11-28 14:08:52,959] INFO: Successfully created data export with id '09ef7089-389b-4254-97d8-dbcf83ff88a1' [2022-11-28 14:08:52,960] INFO: Wait for execution of data export with id '09ef7089-389b-4254-97d8-dbcf83ff88a1'
[2022-11-28 14:08:52,996] INFO: Export result chunks for data export with id '09ef7089-389b-4254-97d8-dbcf83ff88a1' (60, 4)
_CASE_KEY | ACTIVITY_EN | EVENTTIME | _SORTING | |
---|---|---|---|---|
0 | 800000000006800001 | Create Purchase Order Item | 2009-01-02 07:44:05 | 10.0 |
1 | 800000000006800002 | Book Invoice | 2009-01-28 07:56:04 | NaN |
2 | 800000000006800002 | Create Purchase Requisition Item | 2008-12-31 07:56:04 | 0.0 |
3 | 800000000006800003 | Create Purchase Requisition Item | 2008-12-31 07:56:04 | 0.0 |
4 | 800000000006800004 | Create Purchase Requisition Item | 2008-12-31 07:56:04 | 0.0 |
We can see that the resulting dataframe contains only the columns _CASE_KEY
, ACTIVITY_EN
, EVENTTIME
, and _SORTING
specified in our PQL query.
Note: In new teams, data export is not enabled by default. To enable it contact Celonis customer support.
3.3 PQLFilter Objects¶
Next, we can specify if the rows of the result table should be filtered. Filters can be applied by adding a PQLFilter
object to the query, which takes as input arguments:
Name | Type | Description | Default |
---|---|---|---|
query |
str |
PQL query to filter the data. Must be in the format: FILTER <condition>; |
Required |
Let's suppose we want to filter the activity table only for a certain case (i.e. purchase order item):
query += PQLFilter(query=""" FILTER "ACTIVITIES"."_CASE_KEY" = '800000000006800001'; """)
By plotting the query, we can see that the filter has been added to the query object:
query
PQL(columns=[ PQLColumn(name='_CASE_KEY', query=' "ACTIVITIES"."_CASE_KEY" '), PQLColumn(name='ACTIVITY_EN', query=' "ACTIVITIES"."ACTIVITY_EN" '), PQLColumn(name='EVENTTIME', query=' "ACTIVITIES"."EVENTTIME" '), PQLColumn(name='_SORTING', query=' "ACTIVITIES"."_SORTING" ') ], filters=[ PQLFilter(query=' FILTER "ACTIVITIES"."_CASE_KEY" = \'800000000006800001\'; ') ], order_by_columns=[], distinct=False, limit=None, offset=None)
By exporting the table as a Pandas dataframe, we can see that the resulting table has been filtered to only include the purchase order item specified in our PQL query:
result_df = data_model.export_data_frame(query)
print(result_df.shape)
result_df
[2022-11-28 14:08:53,054] INFO: Successfully created data export with id '74942a3f-72bc-4757-9df5-49876cace32a' [2022-11-28 14:08:53,055] INFO: Wait for execution of data export with id '74942a3f-72bc-4757-9df5-49876cace32a'
[2022-11-28 14:08:53,085] INFO: Export result chunks for data export with id '74942a3f-72bc-4757-9df5-49876cace32a' (6, 4)
_CASE_KEY | ACTIVITY_EN | EVENTTIME | _SORTING | |
---|---|---|---|---|
0 | 800000000006800001 | Create Purchase Order Item | 2009-01-02 07:44:05 | 10.0 |
1 | 800000000006800001 | Scan Invoice | 2009-01-20 07:44:05 | NaN |
2 | 800000000006800001 | Book Invoice | 2009-01-30 07:44:05 | NaN |
3 | 800000000006800001 | Receive Goods | 2009-01-12 07:44:05 | 30.0 |
4 | 800000000006800001 | Print and Send Purchase Order | 2009-01-05 07:44:05 | NaN |
5 | 800000000006800001 | Create Purchase Requisition Item | 2008-12-31 07:44:05 | 0.0 |
3.4 OrderByColumn Objects¶
Next, we can specify if the result table should be sorted according to one or more columns. For this, we need to add the OrderByColumn
object to the query, which takes as input arguments:
Name | Type | Description | Default |
---|---|---|---|
query |
str |
PQL query that specifies column according to which to sort in the format "Table"."Column" |
Required |
ascending |
bool |
If true, results are sorted in ascending order. If false, results are sorted in descending order. | True |
Let's suppose we want to sort the result according to the timestamp and sorting-column of the activity table:
query += OrderByColumn(query=""" "ACTIVITIES"."EVENTTIME" """)
query += OrderByColumn(query=""" "ACTIVITIES"."_SORTING" """)
By plotting the query, we can see that the two order-columns have been added to the query:
query
PQL(columns=[ PQLColumn(name='_CASE_KEY', query=' "ACTIVITIES"."_CASE_KEY" '), PQLColumn(name='ACTIVITY_EN', query=' "ACTIVITIES"."ACTIVITY_EN" '), PQLColumn(name='EVENTTIME', query=' "ACTIVITIES"."EVENTTIME" '), PQLColumn(name='_SORTING', query=' "ACTIVITIES"."_SORTING" ') ], filters=[ PQLFilter(query=' FILTER "ACTIVITIES"."_CASE_KEY" = \'800000000006800001\'; ') ], order_by_columns=[ OrderByColumn(query=' "ACTIVITIES"."EVENTTIME" ', ascending=True), OrderByColumn(query=' "ACTIVITIES"."_SORTING" ', ascending=True) ], distinct=False, limit=None, offset=None)
By exporting the table as a Pandas dataframe, we can see that the resulting table is now sorted according to the two columns EVENTTIME
and _SORTING
specified in our PQL query:
result_df = data_model.export_data_frame(query)
print(result_df.shape)
result_df
[2022-11-28 14:08:53,126] INFO: Successfully created data export with id '3aa3f882-c25b-4b93-b079-09fe5a144be2' [2022-11-28 14:08:53,127] INFO: Wait for execution of data export with id '3aa3f882-c25b-4b93-b079-09fe5a144be2'
[2022-11-28 14:08:53,157] INFO: Export result chunks for data export with id '3aa3f882-c25b-4b93-b079-09fe5a144be2' (6, 4)
_CASE_KEY | ACTIVITY_EN | EVENTTIME | _SORTING | |
---|---|---|---|---|
0 | 800000000006800001 | Create Purchase Requisition Item | 2008-12-31 07:44:05 | 0.0 |
1 | 800000000006800001 | Create Purchase Order Item | 2009-01-02 07:44:05 | 10.0 |
2 | 800000000006800001 | Print and Send Purchase Order | 2009-01-05 07:44:05 | NaN |
3 | 800000000006800001 | Receive Goods | 2009-01-12 07:44:05 | 30.0 |
4 | 800000000006800001 | Scan Invoice | 2009-01-20 07:44:05 | NaN |
5 | 800000000006800001 | Book Invoice | 2009-01-30 07:44:05 | NaN |
3.5 Offset, Limit, and Distinct¶
Lastly, we can specify the input arguments distinct
, offset
, and limit
of the PQL base object. This can be either done during its instantiation via the command query = PQL(distinct=..., limit=..., offset=...)
or afterwards by modifying the PQL object's data properties via the command query.<property> = <value>
.
Let us specify a value for the limit
property of the PQL query, which limits the result table to a certain number of rows. By exporting the resulting dataframe, we can see that our result table has been reduced to only include 3 rows:
query.limit = 3
result_df = data_model.export_data_frame(query)
print(result_df.shape)
result_df
[2022-11-28 14:08:53,186] INFO: Successfully created data export with id '014db170-bdca-4880-99bf-e0730a321619' [2022-11-28 14:08:53,187] INFO: Wait for execution of data export with id '014db170-bdca-4880-99bf-e0730a321619'
[2022-11-28 14:08:53,219] INFO: Export result chunks for data export with id '014db170-bdca-4880-99bf-e0730a321619' (3, 4)
_CASE_KEY | ACTIVITY_EN | EVENTTIME | _SORTING | |
---|---|---|---|---|
0 | 800000000006800001 | Create Purchase Requisition Item | 2008-12-31 07:44:05 | 0.0 |
1 | 800000000006800001 | Create Purchase Order Item | 2009-01-02 07:44:05 | 10.0 |
2 | 800000000006800001 | Print and Send Purchase Order | 2009-01-05 07:44:05 | NaN |
Next, we can specify an offset
, meaning that a certain number of rows will be skipped. By exporting the resulting dataframe, we can see that the result table still contains 3 rows but this time the first 3 rows are skipped and only the last 3 rows are returned:
query.offset = 3
result_df = data_model.export_data_frame(query)
print(result_df.shape)
result_df
[2022-11-28 14:08:53,248] INFO: Successfully created data export with id 'f12d5b7a-a1c9-43ac-87a5-5fa0bda2eb83' [2022-11-28 14:08:53,249] INFO: Wait for execution of data export with id 'f12d5b7a-a1c9-43ac-87a5-5fa0bda2eb83'
[2022-11-28 14:08:53,280] INFO: Export result chunks for data export with id 'f12d5b7a-a1c9-43ac-87a5-5fa0bda2eb83' (3, 4)
_CASE_KEY | ACTIVITY_EN | EVENTTIME | _SORTING | |
---|---|---|---|---|
0 | 800000000006800001 | Receive Goods | 2009-01-12 07:44:05 | 30.0 |
1 | 800000000006800001 | Scan Invoice | 2009-01-20 07:44:05 | NaN |
2 | 800000000006800001 | Book Invoice | 2009-01-30 07:44:05 | NaN |
4. Advanced PQL Queries¶
While this tutorial covered only basic usage of PQL to retrieve data from the EMS, the four PyCelonis objects PQL
, PQLColumn
, PQLFilter
, and OrderByColumn
, can be also used to build arbitrarily complex PQL queries. For instance, it is possible to use:
- Data flow functions, such as
CASE WHEN
orREMAP_VALUES
- Aggregations, such as
COUNT
,AVG
, orSTDEV
- Process-specific functions, such as
PROCESS EQUALS
,CALC_REWORK
, orVARIANT
- Process-index functions, such as
INDEX_ACTIVITY_ORDER
orINDEX_ACTIVITY_LOOP
- PU-functions, such as
PU_COUNT
orPU_FIRST
For an overview, how to create more advanced PQL functions, refer to the following documentation.
Conclusion¶
Congratulations! You have learned how to create PQL queries in PyCelonis in order to export data from the EMS. In the next tutorial Data Model - Advanced, we will cover more advanced topics in data models, such as foreign keys, process configurations, name mappings, and different reload modes.