Data Mining Cluster Analysis in SQL Server

Grouping is something we naturally do in our day to day life. We group foods depending on taste, we group friends depending on their different attributes.

Clustering is an algorithm which finds natural groupings inside your data when these groupings are not obvious. It finds the hidden variable that accurately classifies your data.

How Clustering Works in SSAS

As usual you need to start by creating a SSAS project and adding data source and data source view. I have used the vTargetMail view of AdventureworkDW database as the datasource view.

Then select create mining structure by right clicking mining structure from the Solution Explorer.

After selecting clustering as the data mining algorithm, you can select the attributes you think most appropriate for the case.

After creating the data mining structure and processing it you can get the clusters and their relationships as shown in below image.

This is case , 10 clusters were created. These clusters are joined by a line which indicates the relationship between those two clusters. If the thickness is high means that there is strong relationship between those two clusters or those two clusters are close to each other.

Also, going by the slider which is available on the left side, you can view only the stronger links as shown in below image.

So the above image shows that cluster 8 is isolated from others and cluster 2 and cluster 5 are strongly related to each other. Also, this indicates that cluster 3 and cluster 6 have formed separate data sets.

You then have the option of selecting variables and according to the selected variables you can select the state. Then the clusters will be shaded accordingly as shown in following figure. This figure is for Total Children and five figures are shown for following different state.

1. Very Low (<=0)

2. Low ( 0 -1)

3. Average ( 1 – 3)

4. High ( 3 – 4)

5. Very High ( > = 4)

In the above diagram, it can be seen that Cluster 1 is dominated with a low number of children and Cluster 6 is dominated by high number of children. Similarly, the shading variable can be selected and verify what are the clusters which has that variable.

From the cluster profile tab, you can compare all the clusters on one screen. Continuous variables are shown differently to discrete variables.

From the cluster characteristics you can view the properties of the selected cluster. So, as shown in Cluster 2 the customers age from 60 -85 with a Marital Status of married. Other parameters and states are not very important for Cluster 2.

The next important option is cluster comparison which can be done by using cluster discrimination.

In addition, you can compare a cluster with a complement of the cluster. This will help to identify the given cluster features versus all the other clusters.

By looking at the properties, you can label the clusters.

Prediction

An important feature in data mining is prediction which can be done from mining model prediction tab using the following configurations.

After executing it, you will get the following output which indicates which customer belongs to which cluster.

Parameters

There are a few parameters we need to set depending on the customer requirement.

CLUSTER_COUNT

In business cases, before running the cluster algorithm you need to know how many clusters are required. The number of clusters to be built can be specified by CLUSTER_COUNT. The default value is 10. If this set to 0, the algorithm will determine the appropriate number of clusters.

CLUSTER_SEED

CLUSTER_SEED specifies the seed number to randomly generate clusters for the initial stage of model building. This is available only in the enterprise edition.

CLUSTERING_METHOD

Microsoft using two clustering techniques called K-means (KM) and the expectation maximization (EM) algorithm.

K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. http://en.wikipedia.org/wiki/K-means_clustering

KM is known as hard clustering because each object is assigned one and exactly one cluster. KM Clusters are disjoint and do not overlap. EM cluster assignment method is uses a probabilistic measure to determine which objects belong to which clusters. In EM clustering, any object can belong to multiple clusters with an assigned probability for each. This technique is considered as soft clustering because it allows clusters to overlap.

So for the CLUSTER_METHOD parameter the possible values are 1-4. i.e

1 – Scalable EM

2 – Vanilla (Non-scalable) EM

3 – Scalable K-means

4 – Vanilla (Non-scalable) K-means

MAXIMUM_INPUT_ATTRIBUTES

This parameter defines the number of input attributes that the algorithm can handle before it invokes feature selection. The default value is 255 and by setting this value to 0 to turn off feature selection.

MAXIMUM_OUTPUT_ATTRIBUTES

This defines the number of output attributes that the algorithm can handle before it invokes feature selection. By setting this value to 0 to turn off feature selection and the default is 255.

MAXIMUM_STATES

This parameter specifies the maximum number of attribute states that the algorithm supports. If the number of states that an attribute has is greater than the maximum number of states, the algorithm uses the attribute’s most popular states and treats the remaining states as missing. The default value is 100. This is an Enterprise Edition feature

MINIMUM_SUPPORT

This parameter specifies the minimum number of cases in each cluster. The default is 1.

MODELLING_CARDINALITY

This parameter specifies the number of sample models constructed during the clustering process (default is 10).

SAMPLE_SIZE

This parameter specifies the number of cases that the algorithm uses on each pass if the CLUSTERING_METHOD parameter is set to one of the scalable clustering methods. By setting SAMPLE_SIZE to 0 will cause the entire dataset to be clustered in a single pass, which can cause memory and performance issues. This is an Enterprise Edition Feature

STOPPING_TOLERANCE

This parameter specifies the value used to determine when convergence is reached and the algorithm is finished building the model. Convergence is reached when the overall change in cluster probabilities is less than the ratio of the STOPPING_TOLERANCE divided by the size of the model.

All parameters can be set by using the algorithm parameters option as shown below.

Parameters can be set by following dialog.

Hard Clustering Vs Soft Clustering

Microsoft Clustering has two methods of cluster assignment. K – means & Expectation Maximization. K-means is known as hard clustering because each object is assigned one and exactly one cluster. K-means clusters are disjoint and do not overlap. Expectation Maximization cluster assignment method is uses a probabilistic measure to determine which objects belong to which clusters. In Expectation Maximization clustering, any object can belong to multiple clusters with an assigned probability for each. This technique is considered soft clustering since it allows clusters to overlap. Though for simplicity many data analyzers wish to select K-means , in real world scenarios, any object can belong to multiple clusters. Hence ideally Expectation Maximization is the preferred clustering mechanism.

Dynamic Data Collection

It is important to note that the Microsoft clustering algorithm does not support dynamic data collection. This means that you need entire data to be collected before applying the clustering technique.