Claytons Data Mining (Part 2)
In the ‘Detect Categories’ operation, we group together rows that are considered similar (even though we haven’t actually specified how they are deemed to be similar). These groups are referred to as clusters and each row was assigned to a cluster. We accept that the assignment is formed on the basis of best fit, that is, each row is put into the cluster that it is most like. This ‘law of averages’ must exist (otherwise we would have a cluster for each row) but we are faced with the possibility that sometimes, a row is assigned a cluster and exists on the cluster ‘fringe’ or possibly, a row is assigned to a cluster and it has a column value that is significantly different from the cluster average. These types of exceptions are often used to identify fraud because they deviate from the expected norm.
The data mining add in allows the user to apply these techniques to an excel table and determine which rows are out of the ordinary. In this example, we use the sheet ‘Table Analysis Tools Sample’. Click the Highlight Exceptions button and ensure that all fields are selected (except for column ID) and then ‘Run’ (see Figure 19). Note that this step (the columns checked) is used to define which columns are used to compare row similarity.
Figure 19 – Highlight Exceptions
There are two outputs formed by this operation.
Firstly an ‘Outliers’ sheet is added to the workbook (see Figure 20). This sheet is prefixed with the table name and the word ‘Outliers’. It shows the number of rows that are outliers based on an exception threshold. Figure 16 shows that, with a threshold of 75 (which is essentially an arbitrary number) there are 24 rows out of the ordinary (cell B19) and the column ‘Commute Distance’ is responsible for 5 of these exceptions. In order to increase or decrease the number of exceptions captured by the model, the threshold can be decreased or increased.
Figure 20 – Exceptions Report
The second output derived from the operation identifies the rows in the original data that are considered outliers. In Figure 21, we can see that row 18 has been highlighted and the Region column is shaded lighter than the rest of the row. This indicates that the current region (pacific) is not expected given the other column values. If we were to change the value to ‘North America’ (as is shown in the second row 18) the yellow highlighting would remove. This demonstrates that the identification is a real time interactive process. If we were to change the threshold on the outlier sheet, we would find the number of highlighted rows on the original data sheet would also change (to reflect the total number of outliers).
Figure 21 – Outlier Identification
The applications of data mining demonstrated in this article employ a clustering algorithm to produce outputs. The clusters derived can be susceptible to mixes of numeric and non numeric data. For example, if we changed values of the income column to ‘Low, Medium or High’ we would get different results than if we had used the current format. It is always a good idea to consider alternate formats for the data.
This article has shown how easy it can be to determine trends and gain knowledge from data without the need for heavy user interaction or technical expertise. Unlike traditional methods of data investigation, where the user must think of a question and then answer it, data mining provides a method of discovery that targets an outcome and searches for ways to achieve it. The interaction of Office and Analysis Service aides this discovery by exposing a powerful tool that allows every day users access to data mining techniques.