Claytons Data Mining (Part 1)
Now that the table (data) is defined and the addin can connect to analysis services, we are ready to perform simple data mining tasks. The tasks examined in this article are
Key Influencers: to determine which fields and their associated values are the most important in the determining a target column outcome value.
Detect Categories: which identifies rows that are similar and groups them into buckets. These buckets are referred to as clusters because the rows (and data within them) are considered similar.
Fill From Example: predicts what an unknown value for a column should be based on how previous values have been determined.
Forecasting: predicts progressive time series values based on trends derived in the underlying data.
Highlighting Exceptions identifies rows in the data that has unexpected values in it. These rows are considered outliers and are someway abnormal when compared to other (similar) rows in the same category.
When exploring data, we often seek to identify predictor data which identifies a cause and effect relationship. An example of such a relationship could be found in the statement ‘On a very hot day, there is likely to be an increase drink sales’. The mining addin allows the user to conduct this analysis easily by simply by identifying what column you are trying to determine a relationship to. The addin then determines which other columns (and the column value) are most likely to result in the target.
In the first example, we use the sheet ‘Table Analysis Tools Sample’ to find what driving bike sales, the question is ‘What type of people are likely to buy a bike’. The data in the spreadsheet shows a list of customers along with attributes about them including whether they have purchased a bike or not. The column ‘Purchased Bike’ becomes our target (what we are trying to understand). Simply
- Click the ‘Analyze Key Influencers’ button and set the analyze column to ‘Purchased Bike’ (Figure 7)
- Run the algorithm (click run)
- The ‘Key Influencers Output’ is produced
- Close the report dialog (Figure 8)
Figure 7 – Set the target Column
Figure 8 – Close the Report Generation Dialog
The output from the Key Influencers algorithm is quiet easy to understand. The ‘Favors’ column shows the target output (whether someone would purchase a bike). The ‘Relative Impact’ column shows the strength of the column and value combination. For example, if a person has 2 cars, they are not likely to purchase, and, if they have 0 cars, they are likely to purchase. Also note, that having 2 cars is much more important (more than twice) and the next most important factor (being married).