Executive Summary
The realization of the importance of data warehouses coupled with ever decreasing price of hardware and advanced computing options has resulted in datawarehouses of thousands of gigabytes over the last few years. This in turn has paved way for OLAP solutions that store detail and aggregate data for optimally answering complex business queries related to multi-dimensional data. However, with increasing data volumes and complexity of business requirements, OLAP solutions are also challenged to meet the processing and reporting SLA’s defined by business. Parallel Processing of OLAP cubes is a very powerful option for meeting such performance and timeline demands. This white paper is an initiative to understand the benefits Parallel Processing Utility provides in processing large Analysis Services 2000 cubes in high volume business scenarios. The paper also explains how improvements in Analysis Services 2005 ensure that Parallel Processing is inbuilt in the upgraded version and there is no requirement for Parallel Processing Utility in 2005.
Introduction
This white paper explains the benefits derived by use of Parallel Processing utility particularly where large volumes of data need to be processed as part of the overnight batch process. It explains the underlying data processing and storage needs in large businesses across domains with the implications of impact on overnight batch loads. The complex data processing requirements in terms of data volumes and business logic have implications on the OLAP and relational database design and in the use of optimal processing methods whereby over night batch processes can meet the SLA(service level agreement) and also address the business needs for large complex data sets. This white paper addresses how Parallel Processing Utility in Analysis Services plays a stellar role in addressing such requirements.
This white paper applies to SQL Server 2000 SP4 and assumes that the reader has a basic understanding of online analytical processing (OLAP) and Analysis Services 2000 in particular. This white paper is divided into 2 parts –
The first part talks of the business problem, the data storage needs and batch process requirements, the implications of these requirements on the OLAP design.
The second half of the paper deals with the performance problems faced with sequential processing and the performance benefits derived by the use of Parallel Processing Utility
(PPU) and the optimal processing of partitions in parallel with a given set of hardware and some recommendations on the use of Parallel Processing Utility.
The available hardware environment is also explained to give the readers an understanding of the test environment and the metrics mentioned in the paper are based on the given test environment. The metrics collected are based on a common set of business requirements across domains like Retail and Manufacturing and provide real life scenarios that can be addressed using the given technology suite.
The last part of the paper explains how these problems are overcome in Analysis Services 2005 and how parallel processing is in built in the software.
Business Requirements and their Implications on storage and processing
A common business requirement across domains is to store historic information to aid managers to analyze business trends based on historic data. For instance sales volume information over the years was used by managers to understand the seasonal fluctuation of sales and analyze the impact on sales of promotions. Also waste related information is key to understand vendor performance trends over time and in ranking vendors. Similarly raw material spends over time for a manufacturer helps in understanding cost price trends over time and the impact of finished goods costs.
- A typical requirement for any manufacturing, telecomm or retail business user is to store current quarter transaction data and historic 2 years transaction data. Hence at any given point in time the requirement is to store at least 9 quarters ( 9*3 = 27 months data).
- Usually for large businesses across domains transaction data volumes touch 35-40 million rows per month. In data volume terms this means storing data for 27 months with fact table volume at 945-1080 million rows.
- The large data volumes are further complicated by complex requirements like handling backdated updates. For instance Account Payment Vouchers data often have the backdated updates extending to 2 weeks. This implies that the overnight load needs to handle 15 days data instead of 1. This implies processing a larger volume of records in the overnight batch process. This also means that Incremental Processing of cubes is not possible. The performance implication of this requirement is shown in the figure (Incremental Processing) below –
In case of backdated updates the entire partition would need to be processed thereby increasing the batch load time.
- Handling such large volumes means the use of partitioning in relational database as well as OLAP databases. This is needed to address batch load performance as well as query performance.
- The reports need to be available early in the morning at the start of the business day. Which effectively makes the batch load window limited to anything between 2-3 hours.
The obvious benefits of partitioning being the following
- Improve the efficiency of the backing up of the data during downtime – in case of daily runs just the back up can be limited to the current partition and previous partition based on date ranges.
- Better query response times when date ranges are clearly defined – having multiple partitions results in lesser number of rows per partition. Hence the query response times would be better as the number of rows to traverse would be lesser.
- Ease of archiving of data – archiving data becomes a lot more manageable with partitions. Data can be archived and purged partition wise.
Considering the data volume of 900+ million rows in our case, the transaction related facts need to be partitioned by month. The idea behind monthly partitions being not to have very large number of partitions as well ensure that number of rows in a single partition were not too large.
Cube Design Considerations for handling large data volumes:
Dimension Storage and Processing –
The dimension design has a direct impact on processing performance. The design guidelines to be followed are as follows:
- Use MOLAP as storage mode for dimensions – MOLAP dimensions use a muti-dimensional storage area separate from the underlying dimension data while ROLAP dimensions use the underlying dimension data. MOLAP is the preferred storage mode as the dimensions involved would not have large number of members.
- Use of Unique Member Keys – The proper use of unique members , especially at the lowest level of the dimension ensures that unnecessary joins when processing a dimension are or processing a cube partition using a dimension can be avoided.
Cube Design –
Cubes are designed based on shared dimensions and MOLAP storage. Large cubes related to sales data in manufacturing and retail or call information in telecomm industry is usually partitioned by month to ensure optimal batch load processing and query performance.
Some basic design guidelines that have to be followed while designing the cubes in Analysis Services are:
- Each cube corresponds to a fact table in the underlying SQL Server relational database. Hence each cube in turn corresponds to specific subject area.
- The naming convention used for the cubes is very similar to that used for fact tables and is the list of subject areas of measures in a cube. For instance Sales cube would store all measures related to Sales.
- All cubes contain the measures stored in the underlying fact tables ( classified as Base Measures) in the cubes and calculated measures which are based on a calculation on set of Base Measures.
- All cubes are designed with 20% aggregations using the Storage Design Wizard. It Is also recommended that 6 months after the application goes live a usage pattern analysis of queries is performed to understand the effectiveness of existing aggregations and whether any additional aggregations are needed.
- Cubes are referenced by Shared Dimensions hence there is no additional overhead of processing dimensions while processing the cubes as the dimensions would already have been processed. Moreover the build time for cubes would be reduced due to the use of Shared dimensions.
- The maximum number of dimensions referenced by a cube is restricted to 4.
- Cubes based on large fact tables (like Receipts, Payments, Sales) are partitioned by month to ensure that query and processing performance of cubes is optimal.
- The storage mode for the cubes is MOLAP as MOLAP gives the best performance in terms of query response times and storage requirements.
- Use of Optimize Schema while doing cube design to ensure that unnecessary joins between facts and dimensions can be avoided resulting in better cube processing performance.