Performance Optimization in BODS
SAP Business Objects Data Services deliver a powerful, enterprise-class data integration and data quality solution. They provide one development, run-time, and management environment that allow you to integrate, transform, improve, and deliver data anywhere, at any frequency.
The comprehensive features and functions of SAP Business Objects Data Services are that it can enable:
Agility – Leverage features that can accelerate time to market, encourage reuse and collaboration, increase developer productivity, and create unified skill sets.
Trusted information – Improve the reliability of your data with functions for data governance – including transparency, change management, security, and controls – global data support, and end-to-end metadata management.
Operational excellence – Gain a single data repository, platform, and environment so that costs associated with supporting multiple applications can be reduced; Leverage multi-platform support for 64bit, UNIX, Linux, and Windows.
Enterprise-class – Benefit from highly salable performance with support for parallelization and grid computing, enterprise connectivity, and support for service-oriented architecture.
It is recommended that the BODS jobs should be tuned accordingly to be able to complete in less time and utilize fewer resources.
This blog deals with the different Performance Optimization techniques needed to fine-tune the BODS jobs thereby enabling the BODS jobs to complete in less time with fewer rates of failure.
What is Join Rank?
You can use join rank to control the order in which sources (tables or files) are joined in a data flow. The highest-ranked source is accessed first to construct the join.
Best Practices for Join Ranks:
- Define the join rank in the Query editor.
- For an inner join between two tables, in the Query editor assign a higher join rank value to the larger table and, if possible, cache the smaller table.
Default, Max, and Min values in Join Rank:
The default value for Join Rank is 0. Max and Min value could be any non-negative number.
Consider you have tables T1, T2, and T3 with Join Ranks as 10, 20, and 30 then table T3 has the highest join rank and therefore T3 will act as a driving table.
Controlling join order can often have a huge effect on the performance of producing the join result. Join ordering is relevant only in cases where the Data Services engine performs the join. In cases where the code is pushed down to the database, the database server determines how a join is performed.
Where Join Rank to be used?
When code is not full push down and sources are with huge records then join rank may be considered. The Data Services Optimizer considers join the rank and uses the source with the highest join rank as the left source. Join Rank is very useful in cases where the DS optimizer is not being able to resolve the most efficient execution plan automatically. If the join rank value is higher that means that a particular table is driving the join.
You can print a trace message to the Monitor log file which allows you to see the order in which the Data Services Optimizer performs the joins. This information may help you to identify ways to improve performance. To add the trace, select Optimized Data Flow in the Trace tab of the “Execution Properties” dialog.
Row per commit:
For the best performance, BODS recommends setting the Rows per commit value between 500 and 2000. The default value for regular loading for Rows per commit is 1000. The value of Row per commit depends on the number of columns in the target table. While adjusting the Row per commit value in the target table following should be kept in mind, the default value is 1000 and the maximum value is 5000 only, and do not enter any negative and non-numeric value. Row per commit is the target side performance tuning technique.
The formula for calculating the Row per commit value is: max_IO_size /row size (in bytes)
The following scenarios will show the impact of Row per commit in the BODS Job. For further details, find below the screenshots:
In the first scenario: Row per commit is 200 and the source data row count is 18116224 records. It takes 229 seconds (3min, 49 sec) to load the data into the target table.
In the second scenario: Row per commit is 2000 and the source data row count is the same as above (18116224 records). It takes 136 seconds (2min, 16secs) to load the data into the target table.
Hence, in the second scenario job, executive time gets reduced because the RPC value is set properly based on data load.
Note: The performance becomes worse, if we set RPC too high for our environment or if we used an overflow file or our initial transactions failed then in such conditions the Rows per commit value behaves as it sets to 1.
Array fetch size:
It indicates the number of rows returned in a single fetch call to a source table, the default value is 1000. AFS value reduces the number of round trips in the database and hence performance for the table reads improves. Array Fetch size is the source side performance tuning technique.
For Example: If you enter 50 and your query retrieves 500 rows, the connection executes 10 fetches to retrieve your 500 rows data.
- Data retrieve row by row if you enter 1 for array fetch size, in such scenario array fetch size is deactivated and slow the server performance time.
- If you keep array fetch size is too high then it will cause a Fetch Out of sequence error, then in such a scenario receiving end packet is corrupted and has to be re-transmitted.
The following scenarios will show the impact of Array Fetch size in the BODS Job. For further details, find below the screenshots:
In the first scenario: Array Fetch Size is 100 and source data row count is 18116224 records. It takes 292 seconds (4min, 52secs) to load the data into the target table.
In the second scenario: Row per commit is 1000 and the source data row count is the same as above (18116224 records). It takes 259 seconds (4min, 19secs) to load the data into the target.
If Array Fetch size value is set properly for large scale of source data load in complex ETL job, then performance will improve more in terms of job execution time.
Hence, in the second scenario, it shows when the Array fetches size value is 1000 (greater) than the previous one so the job executive time gets reduced.
DS generates SQL statements for selecting, inserting, updating, and deleting data records.
When processing database data, DS can leverage the power of the database engine. That may be very important for performance reasons. The mechanism applied is called SQL-Pushdown: (part of) the transformation logic is pushed downed to the database in the form of generated SQL statements. That is because, although DS itself is a very powerful tool, databases are often able to process data much faster. On top of that, internal processing within the database layer avoids or significantly reduces costly time-consuming data transfers between database server memory and DS memory and vice versa.
In many cases, the DS engine is smart enough to make the right decisions at this level. But a good dataflow (DF) design will help. The overall principle should consist of minimizing processing capacity and memory usage by the DS engine. In fact, following are the most important factors influencing the performance of a DS dataflow:
- Maximize the number of operations that can be performed by the database
- Minimize the number of records processed by the DS engine
- Minimize the number of columns processed by the DS engine ( a bit less important, because often with lower impact).
During the development of a DS dataflow, it is always possible to view the code as it will be executed by the DS engine at runtime. More in particular, when reading from a relational database, one can always see the SQL that will be generated from the dataflow. When a dataflow is open in the DS Designer, from the menu, select validation -> Display optimized SQL.
It will show the SQL code that will be generated and pushed down by the DS engine:
Make sure that the dataflow has not been modified after it has last been saved to the repository. If the dataflow is modified, it must be saved before displaying the generated SQL. The Optimized SQL popup window will always show the code corresponding to the saved version and not to the one displayed in DS Designer.
When all sources and targets in a flow are relational database tables, the complete operation will be pushed to the database under following conditions:
- All tables exist in the same database or linked databases.
- The dataflow contains Query transforms only.
- For every DS function used there's an equivalent function at the database level. This has to be true for any implicitly generated functions, too. For instance, when data types of source and target columns are different, DS will include a conversion function, for which possibly no equivalent function exists at the database level! There are no substitution parameters in the where-clause (replace them with global variables if necessary).
- Bulk loading is not enabled.
- The source sets are distinct for every target.
By making use of these techniques we can increase the performance of the jobs designed in Business Object Data Services. So make sure you apply these while designing a job and just see the difference.