DataFrame¶

DataFrame are a core concept of Pandas and SaolaPy and are used to work with two-dimensional datasets.

In this tutorial, you will learn:

How to create a SaolaPy DataFrame
How to apply basic functions to a SaolaPy DataFrame

Prerequisites:¶

To follow this tutorial, you need to have a Celonis EMS team set up and be familiar with the basic functionalities of PyCelonis.

Tutorial¶

1. Connect to your data model¶

To get started with SaolaPy, we first need to connect to our data model:

In [1]:

Copied!





from pycelonis import get_celonis
import pycelonis.pql as pql

c = get_celonis()
data_pool = c.data_integration.get_data_pools().find("<YOUR_DATA_POOL>")
data_model = data_pool.get_data_models().find("<YOUR_DATA_MODEL>")
from pycelonis import get_celonis
import pycelonis.pql as pql

c = get_celonis()
data_pool = c.data_integration.get_data_pools().find("")
data_model = data_pool.get_data_models().find("")

To avoid always having to specify the data model we set it as our global default data model:

In [2]:

Copied!

from pycelonis.config import Config

Config.DEFAULT_DATA_MODEL = data_model
from pycelonis.config import Config

Config.DEFAULT_DATA_MODEL = data_model

For convenience, we extract the columns of tables we want to use:

In [3]:

Copied!

ekpo_columns = data_model.get_tables().find("EKPO").get_columns()
ekpo_columns = data_model.get_tables().find("EKPO").get_columns()

2. SaolaPy DataFrame Initialization¶

Now it's time to start working with SaolaPy DataFrames. There are several ways to instantiate a SaolaPy DataFrame.

First, we can directly pass columns from the default data model in a dictionary where the key is the column name and the value is the column:

In [4]:

Copied!





ekpo_df = pql.DataFrame({
    "NETPR": ekpo_columns.find("NETPR"),
    "BUKRS": ekpo_columns.find("BUKRS"),
})
ekpo_df
ekpo_df = pql.DataFrame({
    "NETPR": ekpo_columns.find("NETPR"),
    "BUKRS": ekpo_columns.find("BUKRS"),
})
ekpo_df

Out[4]:

DataFrame(data={'NETPR': '"EKPO"."NETPR"', 'BUKRS': '"EKPO"."BUKRS"'}, index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[])

This will initialize a DataFrame with the given PQL columns and a regular RangeIndex.

Alternatively, we can specify the data model explicitly (e.g. in case we want to use a different data model than the default):

In [5]:

Copied!





ekpo_df = pql.DataFrame(
    {
        "NETPR": ekpo_columns.find("NETPR"),
        "BUKRS": ekpo_columns.find("BUKRS"),
    }, 
    data_model=data_model
)
ekpo_df
ekpo_df = pql.DataFrame(
    {
        "NETPR": ekpo_columns.find("NETPR"),
        "BUKRS": ekpo_columns.find("BUKRS"),
    }, 
    data_model=data_model
)
ekpo_df

Out[5]:

DataFrame(data={'NETPR': '"EKPO"."NETPR"', 'BUKRS': '"EKPO"."BUKRS"'}, index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[])

It is also possible to instantiate a DataFrame using a PQL query string (generally all PQL query strings can be used but there might be some restrictions based on the PQL language):

In [6]:

Copied!





ekpo_df = pql.DataFrame({
    "NETPR": '"EKPO"."NETPR"',
    "BUKRS": '"EKPO"."BUKRS"',
})
ekpo_df
ekpo_df = pql.DataFrame({
    "NETPR": '"EKPO"."NETPR"',
    "BUKRS": '"EKPO"."BUKRS"',
})
ekpo_df

Out[6]:

DataFrame(data={'NETPR': '"EKPO"."NETPR"', 'BUKRS': '"EKPO"."BUKRS"'}, index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[])

3. DataFrame Columns¶

Now let's look at how to work with DataFrame columns. First, let's find out which columns our DataFrame has:

In [7]:

Copied!

ekpo_df.columns
ekpo_df.columns

Out[7]:

['NETPR', 'BUKRS']

Next, we can access and assign new columns. To access a new column use the same notation as pandas:

In [8]:

Copied!

ekpo_df["NETPR"]
ekpo_df["NETPR"]

Out[8]:

Series(data='"EKPO"."NETPR"', index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[], name='NETPR')

You can also access columns as properties:

In [9]:

Copied!

ekpo_df.NETPR
ekpo_df.NETPR

Out[9]:

Series(data='"EKPO"."NETPR"', index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[], name='NETPR')

To access a subset of columns just use a list as index:

In [10]:

Copied!

ekpo_df[["BUKRS", "NETPR"]]
ekpo_df[["BUKRS", "NETPR"]]

Out[10]:

DataFrame(data={'BUKRS': '"EKPO"."BUKRS"', 'NETPR': '"EKPO"."NETPR"'}, index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[])

In case you want to assign a new column simply use the pandas' notation:

In [11]:

Copied!

ekpo_df["MENGE"] = ekpo_columns.find("MENGE")
ekpo_df["MENGE"] = ekpo_columns.find("MENGE")

To assign multiple columns at once, simply pass the values as a list:

In [12]:

Copied!





ekpo_df[["EBELN", "EBELP"]] = [
    ekpo_columns.find("EBELN"),
    ekpo_columns.find("EBELP"),
]
ekpo_df
ekpo_df[["EBELN", "EBELP"]] = [
    ekpo_columns.find("EBELN"),
    ekpo_columns.find("EBELP"),
]
ekpo_df

Out[12]:

DataFrame(data={'NETPR': '"EKPO"."NETPR"', 'BUKRS': '"EKPO"."BUKRS"', 'MENGE': '"EKPO"."MENGE"', 'EBELN': '"EKPO"."EBELN"', 'EBELP': '"EKPO"."EBELP"'}, index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[])

Finally, it is also possible to drop one or multiple columns:

In [13]:

Copied!

ekpo_df = ekpo_df.drop(["EBELN", "EBELP", "MENGE"])
ekpo_df.columns
ekpo_df = ekpo_df.drop(["EBELN", "EBELP", "MENGE"])
ekpo_df.columns

Out[13]:

['NETPR', 'BUKRS']

4. Basic DataFrame Functions¶

To get the number of rows and columns we can call the shape function:

In [14]:

Copied!

ekpo_df.shape
ekpo_df.shape

Out[14]:

(325493, 2)

For a sample of the data we can get the first rows using the head function:

In [15]:

Copied!

ekpo_df.head()
ekpo_df.head()

Out[15]:

	NETPR	BUKRS
Index
0	165.60	GB03
1	111.01	GB03
2	149.50	5100
3	372.60	5100
4	265.80	5100

Or we can export the full DataFrame using the to_pandas function:

In [16]:

Copied!

ekpo_df.to_pandas()
ekpo_df.to_pandas()

Out[16]:

	NETPR	BUKRS
Index
0	165.60	GB03
1	111.01	GB03
2	149.50	5100
3	372.60	5100
4	265.80	5100
...	...	...
325488	36.94	5100
325489	1117.50	5100
325490	114.00	5100
325491	219.51	MY01
325492	495.81	5100

325493 rows × 2 columns

We can also sort the data using the sort_values function by specifying the columns to sort by:

In [19]:

Copied!

ekpo_df.sort_values(["BUKRS", "NETPR"], ascending=False).head()
ekpo_df.sort_values(["BUKRS", "NETPR"], ascending=False).head()

Out[19]:

	NETPR	BUKRS
Index
0	492390.59	SG01
1	492390.59	SG01
2	492390.59	SG01
3	492390.59	SG01
4	484845.57	SG01

In the data science workflow it is common to exclude none values which can be achieved using the dropna function:

In [18]:

Copied!

ekpo_df = ekpo_df.dropna()
ekpo_df = ekpo_df.dropna()

Lastly, you can convert your DataFrame to a different data type:

ekpo_df.astype(str)

Note: The conversion uses the PQL operators TO_STRING, TO_INT, and TO_FLOAT and is therefore only possible for the allowed input data types.

Conclusion¶

Congratulations! You have learned how to use the basic functionalities of a SaolaPy DataFrame.