Introduction to Data Mining with SQL Server

Can I Mine My Data?

The key to unlocking the potential of data mining using SQL Server is to begin the process of analyzing the current data as well as the storage methodology for the enterprise’s data. An analysis should begin with current reporting data stores, techniques and distribution of operational and analysis reports. By analyzing how the enterprise receives information regarding its business processes, the business decision maker can determine what type of information is available to data mine. Simultaneously, from a business process perspective, analysts and decision makers should begin active discovery of existing OLAP data to uncover the types of questions that will be satisfied by a data mining initiative. This exercise is beneficial to the data mining development process, because business users

Physical Data Sources for Data Mining

Physically, the data for most enterprises can facilitate data mining. Many contemporary enterprises have constructed OLAP resources, including data marts and data warehouses for their analysts and decision makers to use in analyzing aggregate historical data. It is these OLAP resources that must exist and are optimal sources for a data mining effort to begin.

Sources which are not indicative to data mining exist as any OLTP or transactional database. Although an enterprise may have a series of operational reports based in an OLTP structure, OLAP processes, reporting, and data mining are not designed to function based on a transactional database. The structure of a transactional database is designed for the efficient and typically short transactions, characterized by reads, inserts, updates and deletions. Extended, read intensive, processes will be harmful to both the OLAP attempt and the OLTP processes.

The Enterprise Data Warehouse as a Data Mining Source

An enterprise data warehouse is an excellent source for locating data mining data. Because of the nature of a data warehouse, most pertinent data that has been selected by analysts and business users should be located within the warehouse structure. In addition, this data is organized and stored for the explicit purpose of reporting. Through the data warehouse, further processing of OLAP data can occur. This processing can take the form of additional aggregations into multidimensional cubes (i.e., SQL Server 2000 Analysis Services Cubes) or undergo further segregation into organizational data marts.

The data mining process will utilize the data in the enterprise data warehouse, based on user selection and location of pertinent data, to test and validate a data mining model. It is important that the data be granular enough to analyze. Data that is characterized by significant aggregations beyond the original grain of the data will not produce significant results when used to create or test against a mining model.

An enterprise data warehouse is a prime source for data mining data because the data housed within the warehouse has already undergone significant data additions, modifications and cleansing based on business rules and processes. Refined Extraction Transformation and Loading (ETL) processes are required for reliable OLAP and enterprise data warehousing reporting. It is the ETL process which is responsible for cleansing bad data from the OLTP source, reclassifying or aggregating granular transactions from the operational system, and enriching the data with more readable and comprehensible data as opposed to operational codes and abbreviations used in an OLTP system. Once the data has been sufficiently cleansed and refined, it is ripe for data mining.

Spotlight on Data Transformation for Data Mining

Typically, data warehouses contain granular data representing cleansed and enriched transactions from the source OLTP system. The granularity of this data may exist in its finest form, and then be aggregated at later stages in the enterprise OLAP reporting solution (i.e., multidimensional OLAP cubes or physical aggregated data marts). Often, it is important to modify or enrich the data to include parameters and scopes which will be more indicative to data mining.

Depending on the architecture of the enterprise data warehouse solution, transactions transformed by an existing ETL process can be duplicated and modified to feed an additional data mining-specific data mart. In essence, the goal is to develop a functional version of the data warehouse solely for the purpose of deriving data mining data. For the purpose of this discussion, this new data mining derived portion of the enterprise data warehouse is truly a data mart, in that it exists as a specialized subset of the larger enterprise warehouse.

This new data mining data mart should be architected with a rudimentary understanding of how to format data to increase its propensity for functional mining. For example, the data mining process works best with data that is not overly aggregated. Because of high aggregations which are typically built into enterprise data warehouse subject areas, or into data marts, a “smearing” of the underlying trends of the data may occur, because the mining model will not be able to include the granular details.

Such a concern may not come to light if a specialized data mining data mart is designed to accommodate for the specific needs of a data mining model. Another example which characterizes the need for a specialized data mining data mart is the level of data enrichment that should occur in ETL processing during the nightly data mart population.

Traditionally, granular data warehouses contain the lowest transactional level in base star schemas. During ETL for data mining, the goal is to load a custom data mart by enriching the data with ranges. Data mining works best on data that has a low level of selectivity. Figure 1.3 presents sample records from a customer table that originates from the OLTP source and becomes an enriched, data mining ripe source of customer information.

Figure 1.3
Data Demonstration – ETL Data Enrichment — Substitutes ranges for granular data


Leave a comment

Your email address will not be published.