DataFrame¶
DataFrame are a core concept of Pandas and SaolaPy and are used to work with two-dimensional datasets.
In this tutorial, you will learn:
- How to create a SaolaPy DataFrame
- How to apply basic functions to a SaolaPy DataFrame
Prerequisites:¶
To follow this tutorial, you need to have a Celonis EMS team set up and be familiar with the basic functionalities of PyCelonis.
Tutorial¶
1. Connect to your data model¶
To get started with SaolaPy, we first need to connect to our data model:
from pycelonis import get_celonis
import pycelonis.pql as pql
c = get_celonis()
data_pool = c.data_integration.get_data_pools().find("<YOUR_DATA_POOL>")
data_model = data_pool.get_data_models().find("<YOUR_DATA_MODEL>")
To avoid always having to specify the data model we set it as our global default data model:
from pycelonis.config import Config
Config.DEFAULT_DATA_MODEL = data_model
For convenience, we extract the columns of tables we want to use:
ekpo_columns = data_model.get_tables().find("EKPO").get_columns()
2. SaolaPy DataFrame Initialization¶
Now it's time to start working with SaolaPy DataFrames. There are several ways to instantiate a SaolaPy DataFrame.
First, we can directly pass columns from the default data model in a dictionary where the key is the column name and the value is the column:
ekpo_df = pql.DataFrame({
"NETPR": ekpo_columns.find("NETPR"),
"BUKRS": ekpo_columns.find("BUKRS"),
})
ekpo_df
DataFrame(data={'NETPR': '"EKPO"."NETPR"', 'BUKRS': '"EKPO"."BUKRS"'}, index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[])
This will initialize a DataFrame with the given PQL columns and a regular RangeIndex
.
Alternatively, we can specify the data model explicitly (e.g. in case we want to use a different data model than the default):
ekpo_df = pql.DataFrame(
{
"NETPR": ekpo_columns.find("NETPR"),
"BUKRS": ekpo_columns.find("BUKRS"),
},
data_model=data_model
)
ekpo_df
DataFrame(data={'NETPR': '"EKPO"."NETPR"', 'BUKRS': '"EKPO"."BUKRS"'}, index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[])
It is also possible to instantiate a DataFrame using a PQL query string (generally all PQL query strings can be used but there might be some restrictions based on the PQL language):
ekpo_df = pql.DataFrame({
"NETPR": '"EKPO"."NETPR"',
"BUKRS": '"EKPO"."BUKRS"',
})
ekpo_df
DataFrame(data={'NETPR': '"EKPO"."NETPR"', 'BUKRS': '"EKPO"."BUKRS"'}, index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[])
3. DataFrame Columns¶
Now let's look at how to work with DataFrame columns. First, let's find out which columns our DataFrame has:
ekpo_df.columns
['NETPR', 'BUKRS']
Next, we can access and assign new columns. To access a new column use the same notation as pandas:
ekpo_df["NETPR"]
Series(data='"EKPO"."NETPR"', index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[], name='NETPR')
You can also access columns as properties:
ekpo_df.NETPR
Series(data='"EKPO"."NETPR"', index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[], name='NETPR')
To access a subset of columns just use a list as index:
ekpo_df[["BUKRS", "NETPR"]]
DataFrame(data={'BUKRS': '"EKPO"."BUKRS"', 'NETPR': '"EKPO"."NETPR"'}, index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[])
In case you want to assign a new column simply use the pandas' notation:
ekpo_df["MENGE"] = ekpo_columns.find("MENGE")
To assign multiple columns at once, simply pass the values as a list:
ekpo_df[["EBELN", "EBELP"]] = [
ekpo_columns.find("EBELN"),
ekpo_columns.find("EBELP"),
]
ekpo_df
DataFrame(data={'NETPR': '"EKPO"."NETPR"', 'BUKRS': '"EKPO"."BUKRS"', 'MENGE': '"EKPO"."MENGE"', 'EBELN': '"EKPO"."EBELN"', 'EBELP': '"EKPO"."EBELP"'}, index=RangeIndex(name='Index', start=0, step=1), filters=[], order_by_columns=[])
Finally, it is also possible to drop one or multiple columns:
ekpo_df = ekpo_df.drop(["EBELN", "EBELP", "MENGE"])
ekpo_df.columns
['NETPR', 'BUKRS']
4. Basic DataFrame Functions¶
To get the number of rows and columns we can call the shape
function:
ekpo_df.shape
(325493, 2)
For a sample of the data we can get the first rows using the head
function:
ekpo_df.head()
NETPR | BUKRS | |
---|---|---|
Index | ||
0 | 165.60 | GB03 |
1 | 111.01 | GB03 |
2 | 149.50 | 5100 |
3 | 372.60 | 5100 |
4 | 265.80 | 5100 |
Or we can export the full DataFrame using the to_pandas
function:
ekpo_df.to_pandas()
NETPR | BUKRS | |
---|---|---|
Index | ||
0 | 165.60 | GB03 |
1 | 111.01 | GB03 |
2 | 149.50 | 5100 |
3 | 372.60 | 5100 |
4 | 265.80 | 5100 |
... | ... | ... |
325488 | 36.94 | 5100 |
325489 | 1117.50 | 5100 |
325490 | 114.00 | 5100 |
325491 | 219.51 | MY01 |
325492 | 495.81 | 5100 |
325493 rows × 2 columns
We can also sort the data using the sort_values
function by specifying the columns to sort by:
ekpo_df.sort_values(["BUKRS", "NETPR"], ascending=False).head()
NETPR | BUKRS | |
---|---|---|
Index | ||
0 | 492390.59 | SG01 |
1 | 492390.59 | SG01 |
2 | 492390.59 | SG01 |
3 | 492390.59 | SG01 |
4 | 484845.57 | SG01 |
In the data science workflow it is common to exclude none values which can be achieved using the dropna
function:
ekpo_df = ekpo_df.dropna()
Lastly, you can convert your DataFrame to a different data type:
ekpo_df.astype(str)
Note: The conversion uses the PQL operators TO_STRING
, TO_INT
, and TO_FLOAT
and is therefore only possible for the allowed input data types.
Conclusion¶
Congratulations! You have learned how to use the basic functionalities of a SaolaPy DataFrame.