Data Mining Cluster Analysis in SQL Server

Grouping is something we naturally do in our day to
day life. We group foods depending on taste, we group friends
depending on their different attributes. 

Clustering is an algorithm which finds
natural groupings inside your data when these groupings are not obvious. It
finds the hidden variable that accurately classifies your data.

How Clustering Works in SSAS

As usual you need to start by creating a
SSAS project and adding data source and data source view. I have used
the vTargetMail view of AdventureworkDW database as the datasource view.

Then select create mining structure by
right clicking mining structure from the Solution Explorer.

After selecting clustering as the data
mining algorithm, you can select the attributes you think most appropriate for
the case.

After creating the data mining structure
and processing it you can get the clusters and their relationships as shown in
below image.

This is case , 10 clusters were created.
These clusters are joined by a line which indicates the relationship between
those two clusters. If the thickness is high means that there is strong
relationship between those two clusters or those two clusters are close to each
other.

Also, going by the slider which is
available on the left side, you can view only the stronger links as shown in
below image.

So the above image shows that cluster 8 is
isolated from others and cluster 2 and cluster 5 are strongly related to each
other. Also, this indicates that cluster 3 and cluster 6 have formed separate data sets.

You then have the option of selecting
variables and according to the selected variables you can select the state. Then the
clusters will be shaded accordingly as shown in following figure. This figure
is for Total Children and five figures are shown for following different state.

1.
Very Low (<=0)

2.
Low ( 0 -1)

3.
Average ( 1 – 3)

4.
High ( 3 – 4)

5. Very High ( > = 4)

In the above diagram, it can be seen
that Cluster 1 is dominated with a low number of children and Cluster 6 is
dominated by high number of children.  Similarly, the shading variable can be
selected and verify what are the clusters which has that variable.

From the cluster profile tab, you can compare
all the clusters on one screen. Continuous variables are shown differently
to discrete variables. 

From the cluster characteristics you can
view the properties of the selected cluster.  So, as shown in Cluster 2 the
customers age from 60 -85 with a Marital Status of married. Other parameters and
states are not very important for Cluster 2.  

The next important option is cluster comparison
which can be done by using cluster discrimination.

In addition, you can compare a cluster with a
complement of the cluster. This will help to identify the given cluster features
versus all the other clusters.

By looking at the properties, you can label
the clusters.

Prediction

An important feature in data mining is
prediction which can be done from mining model prediction tab using the
following configurations.

After executing it, you will get the
following output which indicates which customer belongs to which cluster.


Parameters

There are a few parameters we need to set
depending on the customer requirement.

CLUSTER_COUNT

In business cases, before running the
cluster algorithm you need to know how many clusters are required. The number of
clusters to be built can be specified by CLUSTER_COUNT. The default value is 10. If
this set to 0, the algorithm will determine the appropriate number of clusters.

CLUSTER_SEED

CLUSTER_SEED specifies the seed number to
randomly generate clusters for the initial stage of model building. This is
available only in the enterprise edition.

CLUSTERING_METHOD

Microsoft using two clustering techniques
called K-means (KM) and the expectation maximization (EM) algorithm.  

K-means clustering aims to partition n
observations into k clusters in which each observation belongs to the cluster
with the nearest mean, serving as a prototype of the cluster.  http://en.wikipedia.org/wiki/K-means_clustering

KM is known as hard clustering because each
object is assigned one and exactly one cluster. KM Clusters are disjoint and do
not overlap. EM cluster assignment method is uses a probabilistic measure to
determine which objects belong to which clusters. In EM clustering, any object
can belong to multiple clusters with an assigned probability for each. This
technique is considered as soft clustering because it allows clusters to overlap.

So for the CLUSTER_METHOD parameter the
possible values are 1-4.  i.e

1 – Scalable EM

2 – Vanilla (Non-scalable)
EM

3 – Scalable
K-means

4 – Vanilla (Non-scalable)
K-means

MAXIMUM_INPUT_ATTRIBUTES

This parameter defines the number of input
attributes that the algorithm can handle before it invokes feature selection.  The default
value is 255 and by setting this value to 0 to turn off feature selection.

MAXIMUM_OUTPUT_ATTRIBUTES

This defines the number of output
attributes that the algorithm can handle before it invokes feature selection.  By
setting this value to 0 to turn off feature selection and the default is 255.

MAXIMUM_STATES

This parameter specifies the maximum number
of attribute states that the algorithm supports. If the number of states that
an attribute has is greater than the maximum number of states, the algorithm
uses the attribute’s most popular states and treats the remaining states as
missing. The default value is 100. This is an Enterprise Edition feature

MINIMUM_SUPPORT

This parameter specifies the minimum number
of cases in each cluster. The default is 1.

MODELLING_CARDINALITY

This parameter specifies the number of
sample models constructed during the clustering process (default is 10).

SAMPLE_SIZE

This parameter specifies the number of
cases that the algorithm uses on each pass if the CLUSTERING_METHOD parameter
is set to one of the scalable clustering methods.  By setting SAMPLE_SIZE to 0
will cause the entire dataset to be clustered in a single pass, which can cause
memory and performance issues. This is an Enterprise Edition Feature

STOPPING_TOLERANCE

This parameter specifies the value used to
determine when convergence is reached and the algorithm is finished building
the model.  Convergence is reached when the overall change in cluster
probabilities is less than the ratio of the STOPPING_TOLERANCE divided by the
size of the model.

All parameters can be set by using the
algorithm parameters option as shown below.

Parameters can be set by following dialog.

Hard Clustering Vs Soft Clustering

Microsoft Clustering has two methods of
cluster assignment. K – means & Expectation Maximization. K-means is known
as hard clustering because each object is assigned one and exactly one cluster.
K-means clusters are disjoint and do not overlap. Expectation Maximization cluster
assignment method is uses a probabilistic measure to determine which objects
belong to which clusters. In Expectation Maximization clustering, any object
can belong to multiple clusters with an assigned probability for each. This
technique is considered soft clustering since it allows clusters to overlap.
Though for simplicity many data analyzers wish to select K-means , in real
world scenarios, any object can belong to multiple clusters. Hence ideally
Expectation Maximization is the preferred clustering mechanism.




Array

No comments yet... Be the first to leave a reply!

Software Reviews | Book Reviews | FAQs | Tips | Articles | Performance Tuning | Audit | BI | Clustering | Developer | Reporting | DBA | ASP.NET Ado | Views tips | | Developer FAQs | Replication Tips | OS Tips | Misc Tips | Index Tuning Tips | Hints Tips | High Availability Tips | Hardware Tips | ETL Tips | Components Tips | Configuration Tips | App Dev Tips | OLAP Tips | Admin Tips | Software Reviews | Error | Clustering FAQs | Performance Tuning FAQs | DBA FAQs |