Skip to main content

Dataflows in SAP Data Warehouse Cloud

Published on 9 March 2021
SAP Data Warehouse Cloud Dataflows
Sanjana Nair
Sanjana Nair
AI Consultant

Sanjana Nair is an AI Consultant at Applexus. She has 4+ years of experience in the IT industry, centred around the development of AI/ML solutions for Customer Retail. She also has experience as a developer in the eCommerce sector and is a certified SAP Hybris developer.

SAP Data Warehouse Cloud (DWC) is a cloud-based data warehousing solution that combines both efficient data management and advanced analytics.

Dataflows have been introduced in SAP DWC as an easy-to-use data modeling experience for ETL requirements. It allows us to load and combine structured and semi-structured data from different data sources (SAP and non-SAP) like cloud file storage, database management systems (DBMS), or SAP S/4HANA and assists with standard data transformation capabilities and scripting for advanced requirements.

Dataflow builder architecture

The Dataflow builder leverages parts of the HANA Cloud-powered SAP Data Intelligence Cloud (DIC). The Data Warehouse cloud is built on top of the HANA cloud and the SAP Data Intelligence cloud is embedded into the DWC in the form of the Data Flow Builder and offers ETL functionalities. When the Dataflow builder is being used in the DWC, it uses a dedicated subset of the Data Intelligence cloud functionality. i.e., on triggering a Dataflow execution, there is a Data Intelligence pipeline generated in a side-by-side Data Intelligence cluster.

SAP Data Warehouse Cloud Dataflows


Figure 1: SAP HANA cloud services (Refer link)

Data Views vs Dataflows

How is a Dataflow different from a Data view? This is detailed in the table below.

DATA VIEW BUILDER

DATA FLOW

The main aim of the Data view builder is Data federation

The main aim of Dataflows is to persist data

Data outside the DWC is made accessible as one integrated dataset

Enables working with large data sources like datalakes, where federation would cause slow response times

Supports Graphical and SQL builder views and a standard set of data transformations

Supports a Graphical view and standard set of transformations; Also provides Python scripting functionality

Supports connections that in-turn support federation, real-time replication, or momentary data snapshots

Draws from a richer network of connections, including non-SAP sources, cloud file storage, or APIs

Single output structure, in an inherited form. The target will also be federated

Results come in multiple, definable output structures - you can choose to add/replace the data in an existing table or create a new output. The target will be persisted

One strategy is to use the Data view builder and Dataflows in a way that they complement each other – using Dataflows to move data from multiple sources to DWC and then, using the view builder to build quick insights.

Messer Webinar

Data operations in Dataflows

Dataflows offer several standard data operations (similar to those available in the Data view builder) which can be used to model data, such as Unions, Joins, Projections, Filters, Aggregations. One major advantage of Dataflows in DWC is that it includes a ‘Script’ operator which can be used to perform more advanced transformations in Python.

SAP Data Warehouse Cloud Dataflows


Figure 2: Data operators in DWC – marked in red (left to right – Join, Union, Projection, Aggregation, and Script)

Python scripting in Dataflows

The Script operator currently runs on Python 3.6.x. It allows for data manipulations and vector operations in Python by providing support for NumPy and Pandas modules. NumPy and Pandas functions can be referenced by aliases np and pd directly within the transform function without any explicit imports.

The incoming data is fed into the data parameter of the transform function of the Script node. It is accessible within the function as a pandas DataFrame for further data transformations. The return from this function is sent to the output.

It is important to note that the returning DataFrame from the transform function has the same column namesand types as specified in the output schema of the operator. Otherwise, the execution results in a failure.

NOTE: The operator is executed in sandbox mode; accessing the file system or network and importing other Python modules is restricted, as much as building classes and using coroutines. Restricted Pandas and NumPy functions are listed in the help section (in the Properties pane of the Script node). Updates if any to the Python scripting documentation are also added here.

Dataflow execution with Python scripting will be discussed in detail in an upcoming blog.

Further reading:

 

Add new comment

Plain text

  • No HTML tags allowed.
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.