Introduction to Data Mining with SQL Server
Training Case Set Recommendations
When determining the best method to assemble the training case set, it is important to determine the goal of the data mining attempt. If it is to examine or reveal patterns or logical groupings or categories, it is best to gather together the largest amount of data possible. By sampling this amount of data, a developer can be sure that the training case set will be representative of the whole. Theoretically, random slices of records from the entire data set form the perfect composition for the training case. In essence, the data mining model will strive to unearth patterns and other groupings from this random sampling of data. By sampling random records, the data mining model will not be supplied with preexisting groupings or logical relationships.
If the goal of the data mining effort is to produce information or predictions regarding a specific set of business process occurrences, it is wiser to select a smaller training case set. In actuality, it is imperative that the smaller training case set not be random in order to include the specified series of occurrences to be mined. In essence, by oversampling or filtering the source data to provide a smaller set containing the desired occurrences or situations, the data mining model will be create a statistical representation of these desired events, instead of random data.
When creating and oversampled data mining model, it is important to realize that the basis of the training case set is a small portion of data. As a result, corrections or allowances during analysis must take into account the ratio of the data which the model is based on, to the whole. Depending on the approach and purpose of the data mining effort, a midstream position between the aforementioned recommendations may serve to provide users and analysts with relevant predictions or groupings.
Creating and Training the Data Mining Model
Once a selection of organizational data is prepared for data mining, a mining algorithm is chosen based on the type of data mining desired, and a training case set is identified, architects are ready to create the data mining model and subsequently train it using a series of training cases. The data mining model can be created in a number of ways. Developers and architects using the Analysis Services Mining Model Wizard, DSO and other applications can create Analysis Server-side or client-side data mining models. In addition, through new programming predicates located in the PivotTable Service’s library, developers can create session level data mining models as well as permanent ones.
Upon creation of the physical data mining model, the training case set is initially applied to the newly constructed model, to begin its “training”. This training process is designed to allow the model to conform to the slice of data represented in the training case set and develop into a statistical model of the data. Once developers use the CREATE MINING MODEL predicates, or use DSO via the Analysis Services Mining Model Wizard, the training case set is applied using an INSERT INTO statement. Essentially, the training case set is inserted into the physical structure of the mining model (which loosely represents a table-like physical data structure). By executing this insert query, the data mining algorithm is employed to examine the data set and generate statistical data. This statistical data is essential in comprising the foundation of the data model.
Using the Analysis Services Mining Model Wizard, this training query is created automatically. The training case is inserted into the model, analyzed by the chosen algorithm and statistical data, based on the training case set, is generated. This process occurs via the inputs into the Mining Model Wizard. Similarly, the training case insert statement can be executed programmatically as well, using an MDX INSERT INTO statement, and DSO. The Data Mining Sample Application supplied with SQL Server is an instrument that can be used to create MDX mining model training queries. Using DSO programmatically, multiple training queries can be applied to the data mining model, thereby increasing effectiveness of the statistical component of the model itself.