Claytons Data Mining (Part 2)
This article is the second installment of Claytons data mining.
Each row in the spreadsheet is considered to be a discrete observation. This is easy to understand when we look at the data in the sheet ‘Table Analysis Tools Sample’ because each row represents a customer. We also know particular attributes about the customer, and these are listed in the columns. For example, given a customer number, we know their marital status, gender, education and so on.
Often, when we are faced with data like this, we wish to group the data in to similar buckets. The name given to this type of separation is called classification. If you only classify by only one column, the breakup can be as simple as splitting the data by into distinct values. For example, if we classified by ‘Bike Buyer’ we would have two categories of customers being those that bought a bike and those that did not. However, simply separating data of the basis of the values in the columns can raise several issues, most noticeably the number of buckets that are formed. For example, in the sheet ‘Table Analysis Tools Sample’ there is five types of education and five types of occupation. Categorizing on these two fields would give us 25 buckets! Additionally, in performing such an action, we are only listing customers that had a specific type of occupation for a specific type of education. Imagine if we had 3 sales reps, how would we allocate the 25 buckets to the reps?
The concept of clustering addresses all the issues discussed above. This simple concept starts with identification and grouping of similar rows (or observations) until a desired number of buckets (or clusters) has been achieved. Discussing how these rows are considered similar is outside the scope of this article however, it is important to note that the similarity is determined by columns that the user has chosen.
To cluster the data in the worksheet ‘Table Analysis Tools Sample’, simply
- Click the ‘Detect Categories’ button
- Chose (check) the columns that we wish to include in the similarity calculations (Figure 9)
- Specify the number of categories (or leave as <auto-detect>) (Figure 9)
Figure 9 – Category Detection
There are two outputs that are generated by the operation. Firstly, the original data has been appended with an additional ‘Category’ column. This identifies the category (or cluster) that each row was assigned. Figure 9 (above) shows that this is an optional output from the operation. Un-checking the ‘Append a Category column…’ will not produce this output, however, it is suggested that you leave it checked so that you can partition you data at a later date.
The second output is the insertion of a new sheet ‘Categories Report’ which summarises the categories formed and includes a chart for visualization. This is shown below in Figure 10 and Figure 11. We can easily see, how many rows have been assigned to each category (Figure 10, rows 5-11) and what attributes were important in determining that categories constitution (Figure 10 rows 16 to 28). Note that the ‘Category Characteristics’ table is filtered for ‘Category 1’ and can be applied to any category (or all of them).
Figure 10 – Category Summary Output
Figure 11 – Category Visualisation
The chart added as output (Figure 11) defaults to a first alphabetical category (or column name) in the list of fields chosen. While this can show relative proportions of the attribute chosen (eg Age), it is often more convenient to change the chart to a stacked chart (rather than 100% stacked) so that you can easily see the number of people in each attribute. Additional columns can easily be added to the x Axis through chart tools.