SAP Data Warehouse Cloud Dataflows | Data Warehouse Solutions

Sanjana Nair

AI Consultant

Sanjana Nair is an AI Consultant at Applexus. She has 4+ years of experience in the IT industry, centred around the development of AI/ML solutions for Customer Retail. She also has experience as a developer in the eCommerce sector and is a certified SAP Hybris developer.

SAP Data Warehouse Cloud (DWC) is a cloud-based data warehousing solution that combines both efficient data management and advanced analytics.

Dataflows have been introduced in SAP DWC as an easy-to-use data modeling experience for ETL requirements. It allows us to load and combine structured and semi-structured data from different data sources (SAP and non-SAP) like cloud file storage, database management systems (DBMS), or SAP S/4HANA and assists with standard data transformation capabilities and scripting for advanced requirements.

Dataflow builder architecture

The Dataflow builder leverages parts of the HANA Cloud-powered SAP Data Intelligence Cloud (DIC). The Data Warehouse cloud is built on top of the HANA cloud and the SAP Data Intelligence cloud is embedded into the DWC in the form of the Data Flow Builder and offers ETL functionalities. When the Dataflow builder is being used in the DWC, it uses a dedicated subset of the Data Intelligence cloud functionality. i.e., on triggering a Dataflow execution, there is a Data Intelligence pipeline generated in a side-by-side Data Intelligence cluster.

Figure 1: SAP HANA cloud services (Refer link)

Data Views vs Dataflows

How is a Dataflow different from a Data view? This is detailed in the table below.

DATA VIEW BUILDER	DATA FLOW
The main aim of the Data view builder is Data federation	The main aim of Dataflows is to persist data
Data outside the DWC is made accessible as one integrated dataset	Enables working with large data sources like datalakes, where federation would cause slow response times
Supports Graphical and SQL builder views and a standard set of data transformations	Supports a Graphical view and standard set of transformations; Also provides Python scripting functionality
Supports connections that in-turn support federation, real-time replication, or momentary data snapshots	Draws from a richer network of connections, including non-SAP sources, cloud file storage, or APIs
Single output structure, in an inherited form. The target will also be federated	Results come in multiple, definable output structures - you can choose to add/replace the data in an existing table or create a new output. The target will be persisted

One strategy is to use the Data view builder and Dataflows in a way that they complement each other – using Dataflows to move data from multiple sources to DWC and then, using the view builder to build quick insights.

Data operations in Dataflows

Dataflows offer several standard data operations (similar to those available in the Data view builder) which can be used to model data, such as Unions, Joins, Projections, Filters, Aggregations. One major advantage of Dataflows in DWC is that it includes a ‘Script’ operator which can be used to perform more advanced transformations in Python.

Figure 2: Data operators in DWC – marked in red (left to right – Join, Union, Projection, Aggregation, and Script)

Python scripting in Dataflows

The Script operator currently runs on Python 3.6.x. It allows for data manipulations and vector operations in Python by providing support for NumPy and Pandas modules. NumPy and Pandas functions can be referenced by aliases np and pd directly within the transform function without any explicit imports.

The incoming data is fed into the data parameter of the transform function of the Script node. It is accessible within the function as a pandas DataFrame for further data transformations. The return from this function is sent to the output.

It is important to note that the returning DataFrame from the transform function has the same column namesand types as specified in the output schema of the operator. Otherwise, the execution results in a failure.

NOTE: The operator is executed in sandbox mode; accessing the file system or network and importing other Python modules is restricted, as much as building classes and using coroutines. Restricted Pandas and NumPy functions are listed in the help section (in the Properties pane of the Script node). Updates if any to the Python scripting documentation are also added here.

Dataflow execution with Python scripting will be discussed in detail in an upcoming blog.

Further reading:

SAP Analytics Cloud

Add new comment

InSITE Achieves The Coveted “Spotlight Status” On SAP Store

In terms of performance this puts InSITE in the top 5% of all SAP Partner solutions

Manufacturing, Logistics, Energy and Utilities

Technology, Media and Telecommunication

Healthcare, Public Sector and Defense

Blog

White Paper

Events

Videos

Success Stories

News

Archives

Category

Recent Posts

Dataflows in SAP Data Warehouse Cloud