Data Upload & Export Advanced¶

In this tutorial, you will dive deeper into more advanced topics of uploading data into the EMS and exporting data into your local project. More specifically, you will learn:

How to upload tables with a custom column configuration into the EMS
How to upload tables as Parquet files
How to perform a data upload with chunking
How to export data via manual data exports

Prerequisites¶

To follow this tutorial, you should have created a data model and should have uploaded data into it. As we continue working with the SAP Purchase-to-Pay (P2P) tables from the Data Push tutorial, it is recommended to complete the Data Push tutorial first. Further, it is recommended to complete the Data Export tutorial to have a basic understanding how data is retrieved from a data model via PQL.

Tutorial¶

1. Import PyCelonis and connect to Celonis API¶

In [3]:

Copied!

from pycelonis import get_celonis
celonis = get_celonis(permissions=False)
from pycelonis import get_celonis
celonis = get_celonis(permissions=False)

[2024-11-12 15:05:05,131] INFO: No `base_url` given. Using environment variable 'CELONIS_URL'

[2024-11-12 15:05:05,132] INFO: No `api_token` given. Using environment variable 'CELONIS_API_TOKEN'

[2024-11-12 15:05:05,133] INFO: No `key_type` given. Using environment variable 'CELONIS_KEY_TYPE'

[2024-11-12 15:05:05,169] INFO: Initial connect successful! PyCelonis Version: 2.11.1

2. Find data pool and model to use for data upload and export¶

In [4]:

Copied!

data_pool = celonis.data_integration.get_data_pools().find("PyCelonis Tutorial Data Pool")
data_pool
data_pool = celonis.data_integration.get_data_pools().find("PyCelonis Tutorial Data Pool")
data_pool

Out[4]:

DataPool(id='be065bab-bc94-4f2e-81d3-f4df1e126e2a', name='PyCelonis Tutorial Data Pool')

In [5]:

Copied!

data_model = data_pool.get_data_models().find("PyCelonis Tutorial Data Model")
data_model
data_model = data_pool.get_data_models().find("PyCelonis Tutorial Data Model")
data_model

Out[5]:

DataModel(id='68682a56-5bc4-4bfb-be4e-2e588335549c', name='PyCelonis Tutorial Data Model', pool_id='be065bab-bc94-4f2e-81d3-f4df1e126e2a')

In [6]:

Copied!

activities_columns = data_model.get_tables().find("ACTIVITIES").get_columns()
activities_columns = data_model.get_tables().find("ACTIVITIES").get_columns()

3. Advanced Data Upload¶

3.1 Upload table with custom column configuration¶

When uploading data into the EMS via create_table(), we have, besides the standard input arguments df, table_name, and drop_if_exists, the option to give a column_config as input parameter, which specifies properties, such as name and data type, for each column of the table to be pushed into the EMS. Column configurations are used to make properties of tables explicit and to enforce a consistent data push (e.g. when replacing tables), which is especially crucial in production environments. Please note that column configurations do not support data type conversions (eg. Datetime to String) and can only be used to set the field length for String columns. We suggest using the Pandas library for any dtype conversion.

A column_config is essentially a list of ColumnTransport objects, where each object specifies the properties of a single column. To create these objects, we have to import ColumnTansport and ColumnType from PyCelonis:

In [7]:

Copied!

from pycelonis.ems import ColumnTransport, ColumnType
from pycelonis.ems import ColumnTransport, ColumnType

A ColumnTransport object now takes the following input arguments:

Name	Type	Description	Default
`column_name`	`str`	Name of the column	`None`
`column_type`	`ColumnType`	Data type of the column in the format: `ColumnType.<Type>` (supported data types: `INTEGER`, `DATE`, `TIME`, `DATETIME`, `FLOAT`, `BOOLEAN`, `STRING`)	`None`
`field_length`	`int`	Max. number of characters per field (Required for the data type `STRING`). By default, the max. number of characters is 80. Hence, when working with longer-text columns, they may get cut off when the table is uploaded into the EMS. To avoid this, we can specify a greater `field_length`. Note: The number specified in `field_length` is multiplied by 4 to obtain the max. number of characters.	`None`

Let's start by pushing a new activity table with a custom column configuration into the EMS. For this, we import our P2P activity table:

In [8]:

Copied!





import pandas as pd
activity_df = pd.read_parquet("../../../assets/_CEL_P2P_ACTIVITIES_EN.parquet", engine="pyarrow")
print(activity_df.shape)
activity_df.head()
import pandas as pd
activity_df = pd.read_parquet("../../../assets/_CEL_P2P_ACTIVITIES_EN.parquet", engine="pyarrow")
print(activity_df.shape)
activity_df.head()

(60, 13)

Out[8]:

	_CASE_KEY	ACTIVITY_EN	ACTIVITY_DE	EVENTTIME	_SORTING	USER_TYPE	CHANGED_TABLE	CHANGED_FIELD	CHANGED_FROM	CHANGED_TO	CHANGED_FROM_FLOAT	CHANGED_TO_FLOAT	CHANGE_NUMBER
0	800000000006800001	Create Purchase Requisition Item	Lege BANF Position an	2008-12-31 07:44:05	0.0	B	None	None	None	None	NaN	NaN	None
1	800000000006800001	Create Purchase Order Item	Lege Bestellposition an	2009-01-02 07:44:05	10.0	B	None	None	None	None	NaN	NaN	None
2	800000000006800001	Print and Send Purchase Order	Sende Bestellung	2009-01-05 07:44:05	NaN	B	None	None	None	None	NaN	NaN	None
3	800000000006800001	Receive Goods	Wareneingang	2009-01-12 07:44:05	30.0	A	None	None	None	None	NaN	NaN	None
4	800000000006800001	Scan Invoice	Scanne Rechnung	2009-01-20 07:44:05	NaN	A	None	None	None	None	NaN	NaN	None

We now create a column_config by creating for each column of the activity table a ColumnTransport object and wrapping them in a list. Hereby, the input arguments depend on the data type of the column:

In [9]:

Copied!





column_config = [
    ColumnTransport(column_name="_CASE_KEY", column_type=ColumnType.STRING, field_length=20),
    ColumnTransport(column_name="ACTIVITY_EN", column_type=ColumnType.STRING, field_length=100),
    ColumnTransport(column_name="ACTIVITY_DE", column_type=ColumnType.STRING, field_length=100),
    ColumnTransport(column_name="EVENTTIME", column_type=ColumnType.DATETIME),
    ColumnTransport(column_name="_SORTING", column_type=ColumnType.FLOAT),
    ColumnTransport(column_name="USER_TYPE", column_type=ColumnType.STRING, field_length=1),
    ColumnTransport(column_name="CHANGED_TABLE", column_type=ColumnType.STRING, field_length=20),
    ColumnTransport(column_name="CHANGED_FIELD", column_type=ColumnType.STRING, field_length=20),
    ColumnTransport(column_name="CHANGED_FROM", column_type=ColumnType.STRING, field_length=20),
    ColumnTransport(column_name="CHANGED_TO", column_type=ColumnType.STRING, field_length=20),
    ColumnTransport(column_name="CHANGED_FROM_FLOAT", column_type=ColumnType.FLOAT),
    ColumnTransport(column_name="CHANGED_TO_FLOAT", column_type=ColumnType.FLOAT),
    ColumnTransport(column_name="CHANGE_NUMBER", column_type=ColumnType.STRING, field_length=20)
]
column_config = [
    ColumnTransport(column_name="_CASE_KEY", column_type=ColumnType.STRING, field_length=20),
    ColumnTransport(column_name="ACTIVITY_EN", column_type=ColumnType.STRING, field_length=100),
    ColumnTransport(column_name="ACTIVITY_DE", column_type=ColumnType.STRING, field_length=100),
    ColumnTransport(column_name="EVENTTIME", column_type=ColumnType.DATETIME),
    ColumnTransport(column_name="_SORTING", column_type=ColumnType.FLOAT),
    ColumnTransport(column_name="USER_TYPE", column_type=ColumnType.STRING, field_length=1),
    ColumnTransport(column_name="CHANGED_TABLE", column_type=ColumnType.STRING, field_length=20),
    ColumnTransport(column_name="CHANGED_FIELD", column_type=ColumnType.STRING, field_length=20),
    ColumnTransport(column_name="CHANGED_FROM", column_type=ColumnType.STRING, field_length=20),
    ColumnTransport(column_name="CHANGED_TO", column_type=ColumnType.STRING, field_length=20),
    ColumnTransport(column_name="CHANGED_FROM_FLOAT", column_type=ColumnType.FLOAT),
    ColumnTransport(column_name="CHANGED_TO_FLOAT", column_type=ColumnType.FLOAT),
    ColumnTransport(column_name="CHANGE_NUMBER", column_type=ColumnType.STRING, field_length=20)
]

We can now push the activity table as dataframe along with our custom column_config:

In [10]:

Copied!

table = data_pool.create_table(df=activity_df, table_name="ACTIVITIES_2", drop_if_exists=True, column_config=column_config)
table = data_pool.create_table(df=activity_df, table_name="ACTIVITIES_2", drop_if_exists=True, column_config=column_config)

[2024-11-12 15:05:05,297] INFO: Successfully created data push job with id 'd860f175-7a83-49b0-b3da-8015747bd083'

[2024-11-12 15:05:05,298] INFO: Add data frame as file chunks to data push job with id 'd860f175-7a83-49b0-b3da-8015747bd083'

[2024-11-12 15:05:05,309] INFO: Successfully upserted file chunk to data push job with id 'd860f175-7a83-49b0-b3da-8015747bd083'

[2024-11-12 15:05:05,315] INFO: Successfully triggered execution for data push job with id 'd860f175-7a83-49b0-b3da-8015747bd083'

[2024-11-12 15:05:05,315] INFO: Wait for execution of data push job with id 'd860f175-7a83-49b0-b3da-8015747bd083'

[2024-11-12 15:05:05,327] INFO: Successfully created table 'ACTIVITIES_2' in data pool

[2024-11-12 15:05:05,330] INFO: Successfully deleted data push job with id 'd860f175-7a83-49b0-b3da-8015747bd083'

To verify the column configuration, we can print the column properties of a table with get_columns():

In [11]:

Copied!

table.get_columns()
table.get_columns()

[2024-11-12 15:05:05,340] WARNING: For string columns, the field length is automatically multiplied by 4. For creating new tables based on the given columns, make sure to divide the field length for each column by 4 again in the column config before pushing.

Out[11]:

[
	PoolColumn(name='_CASE_KEY', length=80, type_='STRING'),
	PoolColumn(name='ACTIVITY_EN', length=33, type_='STRING'),
	PoolColumn(name='ACTIVITY_DE', length=23, type_='STRING'),
	PoolColumn(name='EVENTTIME', length=26, type_='DATE'),
	PoolColumn(name='_SORTING', length=15, type_='FLOAT'),
	PoolColumn(name='USER_TYPE', length=4, type_='STRING'),
	PoolColumn(name='CHANGED_TABLE', length=19, type_='INTEGER'),
	PoolColumn(name='CHANGED_FIELD', length=19, type_='INTEGER'),
	PoolColumn(name='CHANGED_FROM', length=19, type_='INTEGER'),
	PoolColumn(name='CHANGED_TO', length=19, type_='INTEGER'),
	PoolColumn(name='CHANGED_FROM_FLOAT', length=15, type_='FLOAT'),
	PoolColumn(name='CHANGED_TO_FLOAT', length=15, type_='FLOAT'),
	PoolColumn(name='CHANGE_NUMBER', length=19, type_='INTEGER'),
	PoolColumn(name='_CELONIS_CHANGE_DATE', length=23, type_='DATE')
]

We can see that certain columns, such as ACTIVITY_EN now have a higher max. number of characters than the default of 80 (namely: field_length of 100 * 4 = 400).

When pushing a table with a custom column configuration into the EMS, it is not possible to replace this table with create_table(drop_if_exists=True) anymore, unless the new table also has a column_config specified. This is to ensure that productive tables, on whose columns other Celonis objects rely are not replaced:

In [12]:

raises-exception

Copied!

data_pool.create_table(df=activity_df, table_name="ACTIVITIES_2", drop_if_exists=True)
data_pool.create_table(df=activity_df, table_name="ACTIVITIES_2", drop_if_exists=True)

---------------------------------------------------------------------------
PyCelonisValueError                       Traceback (most recent call last)
Cell In[12], line 1
----> 1 data_pool.create_table(df=activity_df, table_name="ACTIVITIES_2", drop_if_exists=True)

File ~/work/pycelonis/pycelonis/pycelonis/pycelonis/ems/data_integration/data_pool.py:381, in DataPool.create_table(self, df, table_name, drop_if_exists, column_config, chunk_size, force, data_source_id, index, **kwargs)
    379         raise PyCelonisTableAlreadyExistsError(table_name)
    380     if column_config is None and not force:
--> 381         raise PyCelonisValueError(
    382             f"Replacing table '{table_name}' without specifying column_config resets table schema to default. "
    383             f"Especially STRING columns are affected where the default data type is VARCHAR(80) which causes"
    384             f" longer strings to be cut at 80 characters. Either specify column_config or set `force=True`."
    385         )
    387 if column_config is None:
    388     logger.warning(
    389         "STRING columns are by default stored as VARCHAR(80) and therefore cut after 80 characters. You can "
    390         "specify a custom field length for each column using the `column_config` parameter."
    391     )

PyCelonisValueError: Replacing table 'ACTIVITIES_2' without specifying column_config resets table schema to default. Especially STRING columns are affected where the default data type is VARCHAR(80) which causes longer strings to be cut at 80 characters. Either specify column_config or set `force=True`.

If we want to disable this functionality and replace a table without specifying a custom column_config, we can set the input argument force=True:

In [13]:

Copied!

data_pool.create_table(df=activity_df, table_name="ACTIVITIES_2", drop_if_exists=True, force=True)
data_pool.create_table(df=activity_df, table_name="ACTIVITIES_2", drop_if_exists=True, force=True)

[2024-11-12 15:05:05,652] WARNING: STRING columns are by default stored as VARCHAR(80) and therefore cut after 80 characters. You can specify a custom field length for each column using the `column_config` parameter.

[2024-11-12 15:05:05,657] INFO: Successfully created data push job with id '984ef0a1-91f1-49a1-9b89-c402327e717e'

[2024-11-12 15:05:05,658] INFO: Add data frame as file chunks to data push job with id '984ef0a1-91f1-49a1-9b89-c402327e717e'

[2024-11-12 15:05:05,667] INFO: Successfully upserted file chunk to data push job with id '984ef0a1-91f1-49a1-9b89-c402327e717e'

[2024-11-12 15:05:05,675] INFO: Successfully triggered execution for data push job with id '984ef0a1-91f1-49a1-9b89-c402327e717e'

[2024-11-12 15:05:05,676] INFO: Wait for execution of data push job with id '984ef0a1-91f1-49a1-9b89-c402327e717e'

[2024-11-12 15:05:05,692] INFO: Successfully created table 'ACTIVITIES_2' in data pool

[2024-11-12 15:05:05,697] INFO: Successfully deleted data push job with id '984ef0a1-91f1-49a1-9b89-c402327e717e'

Out[13]:

DataPoolTable(name='ACTIVITIES_2', data_source_id=None, columns=[], schema_name='be065bab-bc94-4f2e-81d3-f4df1e126e2a', data_pool_id='be065bab-bc94-4f2e-81d3-f4df1e126e2a')

3.2 Push table in specific data connection¶

To push a table within a specific data connection, simply pass the data_source_id parameter:

In [ ]:

skip-execution

Copied!

data_connection = data_pool.get_data_connection("<DATA_CONNECTION_ID>")
data_pool.create_table(df=activity_df, table_name="ACTIVITIES_IN_CONNECTION", data_source_id=data_connection.id)
data_connection = data_pool.get_data_connection("")
data_pool.create_table(df=activity_df, table_name="ACTIVITIES_IN_CONNECTION", data_source_id=data_connection.id)

3.3 Push table as Parquet file¶

Besides pushing data as Pandas dataframes into the EMS, it is also possible to push data as Parquet files. For this use case, we need to create a manual data push job.

In general, pushing data as Parquet files into the EMS is done via the following workflow:

Create a manual data push job
Add file chunks into the data push job
Execute the data push job

3.3.1 Create a manual data push job¶

A manual data push job can be created with the method create_data_push_job() inside a data pool. The method takes the following input arguments:

Name	Type	Description	Default
`target_name`	`str`	Name of data pool table, into which the data should be pushed	Required
`type_`	`JobType`	Type of data push job in format `JobType.<type>` (Supported types: `REPLACE`= create table, `DELTA` = append or upsert table)	`None`
`column_config`	`List[ColumnTransport]`	Custom column configuration to specify column names and types	`None`
`keys`	`List[str]`	List of columns that are used in an upsert operation to check for equality (not relevant for create and append)	`None`

Let's start by pushing our activity table from the Data Push tutorial once again into the EMS but this time as a Parquet file. For this, we first have to create a manual data push job. We specify JobType.REPLACE to create a new table and use the custom column configuration, which we created in the previous section:

In [14]:

Copied!





from pycelonis.ems import JobType
data_push_job = data_pool.create_data_push_job(target_name="ACTIVITIES_3",
                                               type_=JobType.REPLACE,
                                               column_config=column_config)
from pycelonis.ems import JobType
data_push_job = data_pool.create_data_push_job(target_name="ACTIVITIES_3",
                                               type_=JobType.REPLACE,
                                               column_config=column_config)

[2024-11-12 15:05:05,716] INFO: Successfully created data push job with id '86c1d0ca-d51b-4325-a841-fa4dc9a31598'

We can verify that the data push job was created inside our data pool by calling the get_data_push_jobs() method:

In [15]:

Copied!

data_pool.get_data_push_jobs()
data_pool.get_data_push_jobs()

Out[15]:

[
	DataPushJob(id='86c1d0ca-d51b-4325-a841-fa4dc9a31598', target_name='ACTIVITIES_3', status='NEW', keys=[])
]

3.3.2 Add file chunks into the data push job¶

The newly-created data push job currently does not contain any data to be pushed. In order to add data to the data push job, we need to call the add_file_chunk() method. This method takes as input argument the Parquet file as a BytesIO byte-stream. Hence, we first need to convert our file into a byte stream with open(file, "rb)" and can then call the add_file_chunk() method:

In [16]:

Copied!

with open("../../../assets/_CEL_P2P_ACTIVITIES_EN.parquet", "rb") as file:
    data_push_job.add_file_chunk(file)
with open("../../../assets/_CEL_P2P_ACTIVITIES_EN.parquet", "rb") as file:
    data_push_job.add_file_chunk(file)

[2024-11-12 15:05:05,740] INFO: Successfully upserted file chunk to data push job with id '86c1d0ca-d51b-4325-a841-fa4dc9a31598'

To verify that the data has been added to the data push job, we can call the get_chunks() method:

In [17]:

Copied!

data_push_job.get_chunks()
data_push_job.get_chunks()

Out[17]:

[
	DataPushChunk(id='91432605-4088-4501-819d-c8e2a3f231c5', tenant_id=None, creation_date=datetime.datetime(2024, 9, 16, 13, 2, 0, 581000, tzinfo=datetime.timezone.utc), type_='UPSERT', push_job_id='86c1d0ca-d51b-4325-a841-fa4dc9a31598', checksum='0FF41563EF11CB58E1B8E59FF3AE84FEEAB94D49910DFAE1D5E703A7D2945208', optional_tenant_id=None)
]

3.3.3 Execute the data push job¶

Lastly, we can execute the data push job in order to push the file into the EMS. Further, we can specify with wait whether we want to wait for the data push job to be successfully executed:

If wait=True, the method waits for the data push job and raises an error if the data push job fails
If wait=False, the method does not wait for the data push job and does not raise an error

Note:
A data push job can only be executed one single time!

In [18]:

Copied!

data_push_job.execute(wait=True)
data_push_job.execute(wait=True)

[2024-11-12 15:05:05,767] INFO: Successfully triggered execution for data push job with id '86c1d0ca-d51b-4325-a841-fa4dc9a31598'

[2024-11-12 15:05:05,768] INFO: Wait for execution of data push job with id '86c1d0ca-d51b-4325-a841-fa4dc9a31598'

We can verify that the pushed table exists in the data pool by calling the get_tables().find() method:

In [19]:

Copied!

data_pool.get_tables().find("ACTIVITIES_3")
data_pool.get_tables().find("ACTIVITIES_3")

Out[19]:

DataPoolTable(name='ACTIVITIES_3', data_source_id=None, columns=[], schema_name='be065bab-bc94-4f2e-81d3-f4df1e126e2a', data_pool_id='be065bab-bc94-4f2e-81d3-f4df1e126e2a')

3.4 Perform data push with chunking¶

3.4.1 Chunking for Parquet files¶

Celonis only supports Parquet files up to 1GB to be pushed in one file chunk into the EMS. However, a data job can contain multiple chunks, which can be pushed and are later combined into one single table inside the EMS. Hence, if we want to push data bigger than 1GB, we have to split up our data into multiple Parquet files (i.e. chunks), add these chunks into our data job, and can then execute the data push job.

Note:
Even though 1GB is the upper limit, it is recommended to stay significantly below that limit to ensure a smooth data job execution.

Let's suppose we want to push our P2P activity table again into the EMS. However, to reduce the data load, we have split up our activity table into 3 equal Parquet files. To push these chunks into the EMS, we first have to create a manual data push job:

In [20]:

Copied!

data_push_job = data_pool.create_data_push_job(target_name="ACTIVITIES_4",
                                               type_=JobType.REPLACE,
                                               column_config=column_config)
data_push_job = data_pool.create_data_push_job(target_name="ACTIVITIES_4",
                                               type_=JobType.REPLACE,
                                               column_config=column_config)

[2024-11-12 15:05:05,825] INFO: Successfully created data push job with id '2ed0299f-4fab-4e78-b8c6-3f11af43cd0a'

Now, we can iterate over the different Parquet files and add each file as a separate chunk into the data job by calling the add_file_chunk() method:

In [21]:

Copied!





file_path = "../../../assets/chunked_files"
file_names = ["ACTIVITIES_PUSH_CHUNKED_1", "ACTIVITIES_PUSH_CHUNKED_2", "ACTIVITIES_PUSH_CHUNKED_3"]
for file_name in file_names:
    with open(f"{file_path}/{file_name}.parquet", "rb") as f:
        data_push_job.add_file_chunk(f)
file_path = "../../../assets/chunked_files"
file_names = ["ACTIVITIES_PUSH_CHUNKED_1", "ACTIVITIES_PUSH_CHUNKED_2", "ACTIVITIES_PUSH_CHUNKED_3"]
for file_name in file_names:
    with open(f"{file_path}/{file_name}.parquet", "rb") as f:
        data_push_job.add_file_chunk(f)

[2024-11-12 15:05:05,839] INFO: Successfully upserted file chunk to data push job with id '2ed0299f-4fab-4e78-b8c6-3f11af43cd0a'

[2024-11-12 15:05:05,846] INFO: Successfully upserted file chunk to data push job with id '2ed0299f-4fab-4e78-b8c6-3f11af43cd0a'

[2024-11-12 15:05:05,853] INFO: Successfully upserted file chunk to data push job with id '2ed0299f-4fab-4e78-b8c6-3f11af43cd0a'

We can verify that all chunks are in our data push job by calling the get_chunks() method:

In [22]:

Copied!

data_push_job.get_chunks()
data_push_job.get_chunks()

Out[22]:

[
	DataPushChunk(id='414fa9bb-5fba-4cdc-9cd8-9ded2d02002a', tenant_id=None, creation_date=datetime.datetime(2024, 9, 16, 13, 2, 8, 312000, tzinfo=datetime.timezone.utc), type_='UPSERT', push_job_id='2ed0299f-4fab-4e78-b8c6-3f11af43cd0a', checksum='424CE595644627F9AB5D72EDE86F0AC12D28438F0F1659FEE016912BF77F53EE', optional_tenant_id=None),
	DataPushChunk(id='fbca9902-477c-47f2-b250-0d9dd1fbf95c', tenant_id=None, creation_date=datetime.datetime(2024, 9, 16, 13, 2, 8, 75000, tzinfo=datetime.timezone.utc), type_='UPSERT', push_job_id='2ed0299f-4fab-4e78-b8c6-3f11af43cd0a', checksum='50C8E93D569F9F8043854C46F79FE3420C9853E7E44618E6BDA27A21AD9E846E', optional_tenant_id=None),
	DataPushChunk(id='018fddc6-ad9b-4e6a-bff4-ca6b47f5cedb', tenant_id=None, creation_date=datetime.datetime(2024, 9, 16, 13, 2, 8, 564000, tzinfo=datetime.timezone.utc), type_='UPSERT', push_job_id='2ed0299f-4fab-4e78-b8c6-3f11af43cd0a', checksum='F3F8BE5F1B7A2CCB154AB418D3C909D40D57D9AB13D79FBA699D9DD920D0862F', optional_tenant_id=None)
]

Lastly, we can execute the data push job:

In [23]:

Copied!

data_push_job.execute(wait=True)
data_push_job.execute(wait=True)

[2024-11-12 15:05:05,885] INFO: Successfully triggered execution for data push job with id '2ed0299f-4fab-4e78-b8c6-3f11af43cd0a'

[2024-11-12 15:05:05,886] INFO: Wait for execution of data push job with id '2ed0299f-4fab-4e78-b8c6-3f11af43cd0a'

The EMS will take the different chunks and construct a single table out of them. We can call the get_tables().find_all() method inside our data pool to verify that, indeed, only a single table with the name ACTIVITIES_4 was created:

In [24]:

Copied!

data_pool.get_tables().find_all("ACTIVITIES_4")
data_pool.get_tables().find_all("ACTIVITIES_4")

Out[24]:

[
	DataPoolTable(name='ACTIVITIES_4', data_source_id=None, columns=[], schema_name='be065bab-bc94-4f2e-81d3-f4df1e126e2a', data_pool_id='be065bab-bc94-4f2e-81d3-f4df1e126e2a')
]

3.4.2 Chunking for Pandas dataframes¶

Splitting data into multiple chunks can also be performed when pushing Pandas dataframes into the EMS. This is done by specifying the input argument chunk_size in the methods create_table(), append(), or upsert(). The chunk_size argument specifies, how many rows of the data frame are pushed into the EMS in one chunk.

By default, this value is set to 100,000 rows. However, the optimal value for chunk_size depends on the number of columns and type of data in our table. For instance, if our table only contains few columns and data with small memory requirements, such as boolean or integer, we can also increase the chunk_size for a faster data push. If our table contains many columns and data with high memory requirements, such as long strings, it might be a good idea to decrease the chunk_size to avoid reaching the data push limit of Celonis.

Let's push our P2P activity table as Pandas dataframe but limit the chunk_size to 10,000 rows:

In [25]:

Copied!

data_pool.create_table(df=activity_df, table_name="ACTIVITIES_5", drop_if_exists=True, chunk_size=10000)
data_pool.create_table(df=activity_df, table_name="ACTIVITIES_5", drop_if_exists=True, chunk_size=10000)

[2024-11-12 15:05:05,950] WARNING: STRING columns are by default stored as VARCHAR(80) and therefore cut after 80 characters. You can specify a custom field length for each column using the `column_config` parameter.

[2024-11-12 15:05:05,958] INFO: Successfully created data push job with id '7e3d3c80-78a8-44d9-a65b-0b0e317365c0'

[2024-11-12 15:05:05,960] INFO: Add data frame as file chunks to data push job with id '7e3d3c80-78a8-44d9-a65b-0b0e317365c0'

[2024-11-12 15:05:05,972] INFO: Successfully upserted file chunk to data push job with id '7e3d3c80-78a8-44d9-a65b-0b0e317365c0'

[2024-11-12 15:05:05,988] INFO: Successfully triggered execution for data push job with id '7e3d3c80-78a8-44d9-a65b-0b0e317365c0'

[2024-11-12 15:05:05,989] INFO: Wait for execution of data push job with id '7e3d3c80-78a8-44d9-a65b-0b0e317365c0'

[2024-11-12 15:05:06,021] INFO: Successfully created table 'ACTIVITIES_5' in data pool

[2024-11-12 15:05:06,030] INFO: Successfully deleted data push job with id '7e3d3c80-78a8-44d9-a65b-0b0e317365c0'

Out[25]:

DataPoolTable(name='ACTIVITIES_5', data_source_id=None, columns=[], schema_name='be065bab-bc94-4f2e-81d3-f4df1e126e2a', data_pool_id='be065bab-bc94-4f2e-81d3-f4df1e126e2a')

Internally, PyCelonis will iterate over the dataframe using a sliding window of chunk_size rows, convert these rows into Parquet files, add these files as chunks into a data job, and execute the data job.

4. Advanced Data Export¶

4.1 Export data in alternative file formats¶

Besides retrieving data as Pandas dataframe via the export_data_frame() method, it is also possible to export data in alternative file formats, such as .csv, .xlsx, or .parquet files, from the EMS. Similarly to the manual data push job, we have to create a manual data export in order to achieve this.

In general, exporting data in alternative file formats from the EMS is done via the following workflow:

Specify the data to be exported via PQL
Create a manual data export with the PQL query as argument
Write chunks into files

4.1.1 Specify data to be exported via PQL¶

First, we need to specify which data to export by constructing a PQL query. Here, we will use the query from the Data Export tutorial. However, instead of exporting the result table as a Pandas dataframe, we will export it as Parquet files.

In [26]:

Copied!





import pycelonis.pql as pql

df = pql.DataFrame(
    {
        "_CASE_KEY": activities_columns.find("_CASE_KEY"),
        "ACTIVITY_EN": activities_columns.find("ACTIVITY_EN"),
        "EVENTTIME": activities_columns.find("EVENTTIME"),
        "_SORTING": activities_columns.find("_SORTING"),
    },
    data_model=data_model,
)

df = df[df._CASE_KEY == "800000000006800001"].sort_values(by=["EVENTTIME", "_SORTING"])
import pycelonis.pql as pql

df = pql.DataFrame(
    {
        "_CASE_KEY": activities_columns.find("_CASE_KEY"),
        "ACTIVITY_EN": activities_columns.find("ACTIVITY_EN"),
        "EVENTTIME": activities_columns.find("EVENTTIME"),
        "_SORTING": activities_columns.find("_SORTING"),
    },
    data_model=data_model,
)

df = df[df._CASE_KEY == "800000000006800001"].sort_values(by=["EVENTTIME", "_SORTING"])

4.1.2 Create manual data export¶

Next, we need to put the PQL query into a manual data export by calling the create_data_export() method. The method takes the following input arguments:

Name	Type	Description	Default
`query`	`PQL`	PQL query that specifies which data to retrieve	Required
`export_type`	`ExportType`	File format in which data should be retrieved (Format: `ExportType.<type>`, Supported types: `PARQUET`, `EXCEL`, `CSV`)	Required

We can get the raw PQL query of a DataFrame by calling df.query:

In [27]:

Copied!

from pycelonis.ems import ExportType
data_export = data_model.create_data_export(query=df.query, export_type=ExportType.PARQUET)
from pycelonis.ems import ExportType
data_export = data_model.create_data_export(query=df.query, export_type=ExportType.PARQUET)

[2024-11-12 15:05:06,118] INFO: Successfully created data export with id 'bc78f05f-922e-40e1-840b-6e8dfb1838c4'

The method will export the data specified in the PQL query from the EMS and store it in one or multiple chunks (depending on the data size) inside a DataExport object.

To access the chunks inside the DataExport object, we first have to make sure that the data export is completed. To achieve this, we call the wait_for_execution() method, which shows the current status of the data export as a progress bar. Once the method returns the status DONE, we are able to retrieve the chunks from the DataExport object.

In [28]:

Copied!

data_export.wait_for_execution()
data_export.wait_for_execution()

[2024-11-12 15:05:06,125] INFO: Wait for execution of data export with id 'bc78f05f-922e-40e1-840b-6e8dfb1838c4'

4.1.3 Write chunks into files¶

Lastly, we retrieve the chunks inside the DataExport object by calling the get_chunks() method. This method returns the data chunks as a Generator, over which can be iterated in a for-loop:

In [29]:

Copied!

chunks = data_export.get_chunks()
chunks
chunks = data_export.get_chunks()
chunks

Out[29]:

<generator object DataExport.get_chunks at 0x7fc3ccb7c430>

Now, we can read the content of each chunk and write it into a Parquet file:

In [30]:

Copied!





file_path = "../../../assets/chunked_files"
file_name = "ACTIVITIES_PULL_CHUNKED"
file_counter = 1

for chunk in chunks:
    with open(f"{file_path}/{file_name}_{file_counter}.parquet", "wb") as f:
        f.write(chunk.read())
        file_counter += 1
file_path = "../../../assets/chunked_files"
file_name = "ACTIVITIES_PULL_CHUNKED"
file_counter = 1

for chunk in chunks:
    with open(f"{file_path}/{file_name}_{file_counter}.parquet", "wb") as f:
        f.write(chunk.read())
        file_counter += 1

[2024-11-12 15:05:06,148] INFO: Export result chunks for data export with id 'bc78f05f-922e-40e1-840b-6e8dfb1838c4'

Conclusion¶

Congratulations! You have successfully mastered the more advanced topics of pushing and exporting data via PyCelonis. More specifically, you have learned how to specify custom column configurations, how to use alternative file formats, and how to perform chunking. In the next tutorial Data Jobs, you will learn how to use PyCelonis to interact with data jobs and tasks in order to modify data pool tables.