Dataflows in SAP Data Warehouse Cloud
SAP Data Warehouse Cloud (DWC) is a cloud-based data warehousing solution that combines both efficient data management and advanced analytics.
Dataflows have been introduced in SAP DWC as an easy-to-use data modeling experience for ETL requirements. It allows us to load and combine structured and semi-structured data from different data sources (SAP and non-SAP) like cloud file storage, database management systems (DBMS), or SAP S/4HANA and assists with standard data transformation capabilities and scripting for advanced requirements.
Dataflow builder architecture
The Dataflow builder leverages parts of the HANA Cloud-powered SAP Data Intelligence Cloud (DIC). The Data Warehouse cloud is built on top of the HANA cloud and the SAP Data Intelligence cloud is embedded into the DWC in the form of the Data Flow Builder and offers ETL functionalities. When the Dataflow builder is being used in the DWC, it uses a dedicated subset of the Data Intelligence cloud functionality. i.e., on triggering a Dataflow execution, there is a Data Intelligence pipeline generated in a side-by-side Data Intelligence cluster.
Figure 1: SAP HANA cloud services (Refer link)
Data Views vs Dataflows
How is a Dataflow different from a Data view? This is detailed in the table below.
DATA VIEW BUILDER |
DATA FLOW |
The main aim of the Data view builder is Data federation |
The main aim of Dataflows is to persist data |
Data outside the DWC is made accessible as one integrated dataset |
Enables working with large data sources like datalakes, where federation would cause slow response times |
Supports Graphical and SQL builder views and a standard set of data transformations |
Supports a Graphical view and standard set of transformations; Also provides Python scripting functionality |
Supports connections that in-turn support federation, real-time replication, or momentary data snapshots |
Draws from a richer network of connections, including non-SAP sources, cloud file storage, or APIs |
Single output structure, in an inherited form. The target will also be federated |
Results come in multiple, definable output structures - you can choose to add/replace the data in an existing table or create a new output. The target will be persisted |
One strategy is to use the Data view builder and Dataflows in a way that they complement each other – using Dataflows to move data from multiple sources to DWC and then, using the view builder to build quick insights.
Data operations in Dataflows
Dataflows offer several standard data operations (similar to those available in the Data view builder) which can be used to model data, such as Unions, Joins, Projections, Filters, Aggregations. One major advantage of Dataflows in DWC is that it includes a ‘Script’ operator which can be used to perform more advanced transformations in Python.
Figure 2: Data operators in DWC – marked in red (left to right – Join, Union, Projection, Aggregation, and Script)
Python scripting in Dataflows
The Script operator currently runs on Python 3.6.x. It allows for data manipulations and vector operations in Python by providing support for NumPy and Pandas modules. NumPy and Pandas functions can be referenced by aliases np and pd directly within the transform function without any explicit imports.
The incoming data is fed into the data parameter of the transform function of the Script node. It is accessible within the function as a pandas DataFrame for further data transformations. The return from this function is sent to the output.
It is important to note that the returning DataFrame from the transform function has the same column namesand types as specified in the output schema of the operator. Otherwise, the execution results in a failure.
NOTE: The operator is executed in sandbox mode; accessing the file system or network and importing other Python modules is restricted, as much as building classes and using coroutines. Restricted Pandas and NumPy functions are listed in the help section (in the Properties pane of the Script node). Updates if any to the Python scripting documentation are also added here.
Dataflow execution with Python scripting will be discussed in detail in an upcoming blog.
Further reading:
- The Architecture of the Data Flow Builder within SAP Data Warehouse Cloud
- The Data Flow: a Quick Overview
- Data View Builder vs. Data Flow
- Using Standard Operations in the Data Flow Builder