Introduction to Data Mining with SQL Server
Analysis Services Data Mining Capabilities
SQL Server 2000 Includes Analysis Services with data mining technology which examines data, in relational data warehouse or data mart star schemas, as well as SQL Server 2000 Analysis Services OLAP cubes to uncover areas of interest to business decision makers and other analysts. Housed within this new segment of Analysis Services are data mining algorithms which enable the creation and analysis of data mining models. These models are exposed through a new OLE DB for Data Mining provider to calling applications and reporting tools.
OLAP Data Mining with SQL Server allows architects to reuse the results from data mining and incorporate the information into an OLAP Cube dimension for further analysis. Users can now browse and investigate data while satisfying the answers to questions that only data mining can answer. PivotTable Services supports data mining through emulating a similar interface to that of OLAP pivot tables. In essence, users can utilize a specially designed Excel spreadsheet to interface with data mining services and run prediction queries against a data mining model, just as they do for multidimensional analysis of OLAP data cubes.
In addition, the Decision Support Objects (DSO) library has been extended in order to accommodate direct programmatic access to the data mining functionality present within OLAP services, as well as continued programmatic access to the OLAP cube resources. DSO now includes the MiningModel object as well as other extended support objects. Coupled with the new OLE DB for Data Mining provider, organizations wishing to implement custom applications which provide predictive data mining data to meet business processes, or those organizations that wish to access data mining data for the purpose of creating an operational data mining application, can now do so using the DSO library.
What is a Data Mining Model?
Building and using data mining models can be accomplished through several methods in SQL Server 2000. The Analysis Server toolset provides several interfaces to achieve the creation of a data mining model. Essentially, a data mining model exists as a “sifter” to process existing OLAP or relational data to discover patterns and rules as well as make comprehensive predictions. Physically, the data mining model is defined as an object that stores data in a series of row sets. Utilizing various methods to create and access this model allows for system and solution architects to make this powerful decision support tool available to a wide range of users and applications.
The process of creating the data mining model is directly dependent on the methodology used to feed the entire data mining process. In essence, the method used to make data available to be mined governs the process used to create the data model. If a solutions architect designed a specialized OLAP data cube in Analysis Services to serve as the primary source of data mining data, then an OLAP data mining model would be created, as opposed to a relational data mining model.
Because data mining models provide users and applications the ability to execute an extremely sophisticated and intricate analysis of relational or OLAP data, a series of algorithms exists to facilitate the training and persistence of data into the mining model. In addition, similar to OLAP cubes, the mining model has a security structure that allows for authentication for access to data mining resources. It is conceivable to maintain several data mining models, in order to meet different organizational goals or identify different types of occurrences in business processes.
Using tools such as the Mining Model Wizard, the Data Mining Model Browser, as well as programmatic tools such as manipulation of the DSO library and the specific data mining TSQL predicates, developers and architects can train models, view and analyze data, retrieve predication and investigative record sets and implement security.
SQL Server Data Mining Algorithms
Data mining algorithms supply the methodology that the model construction is based on. In essence a data mining algorithm is principle in determining which decision making capabilities apply to a given set or case of data. These algorithms scour the case and provide the logical processing to determine the best method to classify, segregate and group data that will provide predictive or trend information.
In many instances, the type of algorithm chosen depends on the type of data mining desired. This is best determined by the end users and is often satisfied when the end users provide answers to the questions like “What type of questions do I want to ask my data?” If the users are interested in gaining predictive insight into current business processes, an appropriate algorithm must be utilized.
Each algorithm exists as a data mining provider, and as a result is an OLE DB provider which supports the OLE DB for Data Mining specification. Each algorithm can be accessed as a provider via an application and used in an appropriate scenario, depending on the algorithm provider’s capabilities. The following list classifies and describes the data mining algorithms available with SQL Server and provides examples to demonstrate their functionality.
Microsoft Decision Trees
The Microsoft Decision Trees algorithm provider is extremely useful during classification attempts for data mining case set data. In essence, the Decision Tree is akin to the balanced tree (b-tree ) structure of table indexes, in that like the b-tree, the Microsoft Decision Tree is navigated by examining chosen properties associated with the case set and sorting or classifying the results accordingly while transcending levels based on the data’s classification in previous levels. The Microsoft Decision Tree model differs from the typical b-tree structure, in that it does not evaluate data elements that fall into a range, but classifies them based on a true/false evaluation.
The Microsoft Decision Tree algorithm attempts to create a statistical image of business processes to facilitate predictions. Through the use of classification analysis, training case sets, and evaluation sets, this statistical image is persisted in the form of the finished data mining model. In Figure 1.4, the hardware sales organization is attempting to use the Microsoft Decision Tree algorithm to make predictions about purchased support and maintenance contracts and contract length, depending on contract type and product purchased.