Introduction to Data Mining with SQL Server
Figure 1.5 below includes a diagram which visually represents the Decision Tree algorithm. The Microsoft Decision Tree algorithm will attempt to analyze this case set in the following manner.
The decision tree determines which customers purchased which products
Figure 1.6 below includes a diagram which represents the Purchase Product 1 = True node and further explores the case data by determining the customers’ support purchasing behavior.
The decision tree then determines which customers that purchased Product n (in this example, Product 1), purchased Support type X (in this example Premier Support)
Depending on the number of Customers that purchased Premier Support and the number of years that they purchased of that support type, the mining model will indicate to users that customers that purchase Product 1 will purchase Premier Support. In addition, it can be derived from the mining model that these customers will purchase 2 years of Premier Support.
In the above example, because this training case is so small, the mining model suffers from data overfitting. This means that because the dataset supplied to the model is not representative of the total data set size, the mining model will allow a node to be formed on each distinct case. As a result, conclusions like the previous one may be too simplified with this data set and may not reflect the actual patterns of the customer. In essence, this data set fit too well into the decision tree process and predictive analysis on this set would be useless. However, this example illustrates how the Microsoft Decision Trees algorithm, when supplied with enough data in a training case set, can create statistical representations of historical data to allow for predictions.
The Microsoft Clustering algorithm is based in grouping or clustering business data and events together, in order to reveal logical groupings of related segments of data. These revealed groupings will often add insight into how a business sells products, views customers and arranges service offerings to consumers. This algorithm uses the similarity of attributes in training model sets to group the data into logical clusters.
Microsoft Clustering, through an iterative process employing weighted calculations, continually refines data attributes into clusters, thereby revealing hidden and esoteric relationships in the data. This type of data mining algorithm is best used in situations where business users or analysts require the mining of a large series of transactional or organizational data to observe patterns and groupings to gain insight into tailoring the business focus to new opportunities.
The Training Case
A training case exists as a set of data that is representative of the majority of the OLAP data to be mined. The training case set is utilized to establish the rules and patterns that contribute to the foundation of the data mining model. Through the construction of the data mining model, based on this set of information, users will be able to browse prediction and clustered data mining data and gain insight into business processes. Essentially, the training case set is used by the model to construct a statistical representation of the data. Based on the rules and patterns extrapolated from the case set, the data mining model will sift through the remaining data and attempt to apply these rules to reveal classifications, groupings and predictions.
There are a number of approaches to evaluate when determining the best data to compose a training case set. Ideally, the training case set will embody the best representation of your data. At this juncture, knowledge of sampling techniques and the grain and domain of the data is required to produce a mining model with a strong statistical foundation.
A training case set can be identified in multiple ways. Physical separation of the OLAP or relational data into different data architecture is an excellent method to cleanly identify the training case set. Once contained in a separate physical database or OLAP cube schema, a training query can be initiated against the training case data set staging structure. The other alternative is to query the larger data set directly and not separate the training set data. The first approach is recommended because it allows for a more dynamic redefinition of the training case set while not requiring a change to the training case set query. The following recommendations outline the optimum selection methods for training cases.