Data Export¶
In this tutorial, you will learn how to export data from the Celonis EMS into your local Python project. This allows you to perform analyses on your Celonis data using tools, such as Pandas, or run machine learning algorithms on it. More specifically, you will learn:
- Where to find the data that can be exported
- How to define which data to retrieve using PQL
- How to export the data as a Pandas dataframe
Prerequisites¶
To follow this tutorial, you should have created a data pool as well as a data model and should have uploaded data into it. As we continue working with the SAP Purchase-to-Pay (P2P) tables from the Data Upload tutorial, it is recommended to complete the Data Push tutorial first before embarking on this tutorial.
Tutorial¶
1. Import PyCelonis and connect to Celonis API¶
from pycelonis import get_celonis
celonis = get_celonis(permissions=False)
[2024-09-05 14:14:40,467] INFO: No `base_url` given. Using environment variable 'CELONIS_URL'
[2024-09-05 14:14:40,468] INFO: No `api_token` given. Using environment variable 'CELONIS_API_TOKEN'
[2024-09-05 14:14:40,505] WARNING: KeyType is not set. Defaulted to 'APP_KEY'.
[2024-09-05 14:14:40,506] INFO: Initial connect successful! PyCelonis Version: 2.10.2
No access key provided. Please set the CELONIS_MLOPS_ACCESS_KEY environment variable.
2. Find the data model from which data will be exported¶
The first step in exporting data from the EMS is to find the location where the data is stored, i.e. the corresponding data pool and data model.
data_pool = celonis.data_integration.get_data_pools().find("PyCelonis Tutorial Data Pool")
data_pool
DataPool(id='b76b688a-d5f3-4fd4-9b03-41df01281bc3', name='PyCelonis Tutorial Data Pool')
data_model = data_pool.get_data_models().find("PyCelonis Tutorial Data Model")
data_model
DataModel(id='bca917ef-6777-4742-8eb9-013ee804d3e5', name='PyCelonis Tutorial Data Model', pool_id='b76b688a-d5f3-4fd4-9b03-41df01281bc3')
Important:
In the data export, data is retrieved from the data model and not from the data pool. This is different from the data push, where the data is first inserted into the data pool and then loaded into a data model. This design choice has been implemented, as data models are specifically designed to support fast querying of process data via Celonis' custom Process Query Language (PQL).
3. Define PQL query and export result as Pandas dataframe¶
Data from the EMS is retrieved via Celonis' custom querying language PQL. Hence, in order to export data, we first need to specify a PQL query that defines which data to retrieve and can then export the resulting table as a Pandas dataframe. For this we will be using the SaolaPy Series and DataFrame implementation. For more information on how to use SaolaPy visit the SaolaPy Tutorial.
import pycelonis.pql as pql
First, we will specify the default data model to use with SaolaPy:
from pycelonis.config import Config
Config.DEFAULT_DATA_MODEL = data_model
3.1 Selecting columns¶
activities = data_model.get_tables().find("ACTIVITIES")
activity_columns = activities.get_columns()
Then, we will create a DataFrame representing an OLAP table by specifying the columns:
df = pql.DataFrame(
{
"_CASE_KEY": activity_columns.find("_CASE_KEY"),
"ACTIVITY_EN": activity_columns.find("ACTIVITY_EN"),
"EVENTTIME": activity_columns.find("EVENTTIME"),
# It is also possible to write a PQL query string directly:
"_SORTING": """ "ACTIVITIES"."_SORTING" """,
}
)
We can take a look at the generated PQL query by executing:
df.query
PQL(columns=[PQLColumn(name='Index', query='0 - 1 + RUNNING_TOTAL(1)'), PQLColumn(name='_CASE_KEY', query='"ACTIVITIES"."_CASE_KEY"'), PQLColumn(name='ACTIVITY_EN', query='"ACTIVITIES"."ACTIVITY_EN"'), PQLColumn(name='EVENTTIME', query='"ACTIVITIES"."EVENTTIME"'), PQLColumn(name='_SORTING', query=' "ACTIVITIES"."_SORTING" ')], filters=[], order_by_columns=[], distinct=False, limit=None, offset=None)
To get the dimensions of the DataFrame simply run:
df.shape
[2024-09-05 14:14:40,638] INFO: Successfully created data export with id '999e8001-6f6b-4f2c-9899-526f48152ebc'
[2024-09-05 14:14:40,640] INFO: Wait for execution of data export with id '999e8001-6f6b-4f2c-9899-526f48152ebc'
[2024-09-05 14:14:40,647] INFO: Export result chunks for data export with id '999e8001-6f6b-4f2c-9899-526f48152ebc'
(60, 4)
Then, we can use SaolaPy to perform additional operations on the table, such as filtering, sorting, or grouping. For more information on available operations visit the SaolaPy Tutorial.
3.2 Filtering¶
To filter the data, simply apply:
df = df[df._CASE_KEY == "800000000006800001"]
By plotting the df, we can see that the filter has been added to the df object:
df
DataFrame(data={'_CASE_KEY': '"ACTIVITIES"."_CASE_KEY"', 'ACTIVITY_EN': '"ACTIVITIES"."ACTIVITY_EN"', 'EVENTTIME': '"ACTIVITIES"."EVENTTIME"', '_SORTING': ' "ACTIVITIES"."_SORTING" '}, index=RangeIndex(name='Index', start=0, step=1), filters=['FILTER ( "ACTIVITIES"."_CASE_KEY" = \'8000000000...'], order_by_columns=[])
Also, the shape now reflects the added filter:
df.shape
[2024-09-05 14:14:40,765] INFO: Successfully created data export with id '7aa7ddd0-7da0-46a6-8c65-5d7231aaa317'
[2024-09-05 14:14:40,766] INFO: Wait for execution of data export with id '7aa7ddd0-7da0-46a6-8c65-5d7231aaa317'
[2024-09-05 14:14:40,771] INFO: Export result chunks for data export with id '7aa7ddd0-7da0-46a6-8c65-5d7231aaa317'
(6, 4)
3.3 Sorting the data¶
To sort the data, simply use the sort_values
function:
df = df.sort_values(by=["EVENTTIME", "_SORTING"])
By plotting the df, we can see that the two order-columns have been added to the df:
df
DataFrame(data={'_CASE_KEY': '"ACTIVITIES"."_CASE_KEY"', 'ACTIVITY_EN': '"ACTIVITIES"."ACTIVITY_EN"', 'EVENTTIME': '"ACTIVITIES"."EVENTTIME"', '_SORTING': ' "ACTIVITIES"."_SORTING" '}, index=RangeIndex(name='Index', start=0, step=1), filters=['FILTER ( "ACTIVITIES"."_CASE_KEY" = \'8000000000...'], order_by_columns=['"ACTIVITIES"."EVENTTIME" ASC', ' "ACTIVITIES"."_SORTING" ASC'])
df.head()
[2024-09-05 14:14:40,807] INFO: Successfully created data export with id 'af6bdd3d-9e82-4fba-98b7-773b6c484c08'
[2024-09-05 14:14:40,808] INFO: Wait for execution of data export with id 'af6bdd3d-9e82-4fba-98b7-773b6c484c08'
[2024-09-05 14:14:40,814] INFO: Export result chunks for data export with id 'af6bdd3d-9e82-4fba-98b7-773b6c484c08'
_CASE_KEY | ACTIVITY_EN | EVENTTIME | _SORTING | |
---|---|---|---|---|
Index | ||||
0 | 800000000006800001 | Create Purchase Requisition Item | 2008-12-31 07:44:05 | 0.0 |
1 | 800000000006800001 | Create Purchase Order Item | 2009-01-02 07:44:05 | 10.0 |
2 | 800000000006800001 | Print and Send Purchase Order | 2009-01-05 07:44:05 | NaN |
3 | 800000000006800001 | Receive Goods | 2009-01-12 07:44:05 | 30.0 |
4 | 800000000006800001 | Scan Invoice | 2009-01-20 07:44:05 | NaN |
3.4 SaolaPy Operations¶
We can then apply further SaolaPy transformations, for example arithmetic and string operations:
df._SORTING = df._SORTING + 5
df.ACTIVITY_EN = df.ACTIVITY_EN.str.replace("Receive Goods", "Goods Received")
3.4 Data Export¶
Finally, we can export the data as a Pandas dataframe to get our final result table. This step should be done only after all aggregations and filters are applied to ensure minimal memory consumption and to execute most computations inside the PQL engine instead of in Python:
pandas_df = df.to_pandas()
pandas_df
[2024-09-05 14:14:40,859] INFO: Successfully created data export with id '822002d5-0cec-4ad3-85ba-1f75bc768345'
[2024-09-05 14:14:40,861] INFO: Wait for execution of data export with id '822002d5-0cec-4ad3-85ba-1f75bc768345'
[2024-09-05 14:14:40,867] INFO: Export result chunks for data export with id '822002d5-0cec-4ad3-85ba-1f75bc768345'
_CASE_KEY | ACTIVITY_EN | EVENTTIME | _SORTING | |
---|---|---|---|---|
Index | ||||
0 | 800000000006800001 | Create Purchase Requisition Item | 2008-12-31 07:44:05 | 5.0 |
1 | 800000000006800001 | Create Purchase Order Item | 2009-01-02 07:44:05 | 15.0 |
2 | 800000000006800001 | Print and Send Purchase Order | 2009-01-05 07:44:05 | NaN |
3 | 800000000006800001 | Goods Received | 2009-01-12 07:44:05 | 35.0 |
4 | 800000000006800001 | Scan Invoice | 2009-01-20 07:44:05 | NaN |
5 | 800000000006800001 | Book Invoice | 2009-01-30 07:44:05 | NaN |
We can also set the parameters distinct
, offset
, and limit
of the PQL base object:
Let us specify a value for the limit
property. By exporting the resulting dataframe, we can see that our result table has been reduced to only include 3 rows:
df.to_pandas(limit=3)
[2024-09-05 14:14:40,894] INFO: Successfully created data export with id '78056f4e-1eef-4483-90b5-56439b39385a'
[2024-09-05 14:14:40,896] INFO: Wait for execution of data export with id '78056f4e-1eef-4483-90b5-56439b39385a'
[2024-09-05 14:14:40,902] INFO: Export result chunks for data export with id '78056f4e-1eef-4483-90b5-56439b39385a'
_CASE_KEY | ACTIVITY_EN | EVENTTIME | _SORTING | |
---|---|---|---|---|
Index | ||||
0 | 800000000006800001 | Create Purchase Requisition Item | 2008-12-31 07:44:05 | 5.0 |
1 | 800000000006800001 | Create Purchase Order Item | 2009-01-02 07:44:05 | 15.0 |
2 | 800000000006800001 | Print and Send Purchase Order | 2009-01-05 07:44:05 | NaN |
Next, we can specify an offset
, meaning that a certain number of rows will be skipped. By exporting the resulting dataframe, we can see that the result table still contains 3 rows but this time the first 3 rows are skipped and only the last 3 rows are returned:
df.to_pandas(limit=3, offset=3)
[2024-09-05 14:14:40,932] INFO: Successfully created data export with id 'eeef750d-751b-494f-9ec2-e0f397c8feb9'
[2024-09-05 14:14:40,933] INFO: Wait for execution of data export with id 'eeef750d-751b-494f-9ec2-e0f397c8feb9'
[2024-09-05 14:14:40,940] INFO: Export result chunks for data export with id 'eeef750d-751b-494f-9ec2-e0f397c8feb9'
_CASE_KEY | ACTIVITY_EN | EVENTTIME | _SORTING | |
---|---|---|---|---|
Index | ||||
3 | 800000000006800001 | Goods Received | 2009-01-12 07:44:05 | 35.0 |
4 | 800000000006800001 | Scan Invoice | 2009-01-20 07:44:05 | NaN |
5 | 800000000006800001 | Book Invoice | 2009-01-30 07:44:05 | NaN |
Conclusion¶
Congratulations! You have learned how to create PQL queries in PyCelonis in order to export data from the EMS using SaolaPy. In the next tutorial Data Model - Advanced, we will cover more advanced topics in data models, such as foreign keys, process configurations, name mappings, and different reload modes.