Using Decision Trees in SQL Server

As a practical introduction to decision imagine you are assessing the risk of a PC being infected with a virus. The below data on factors influencing affecting the risk of a PC being infected with a virus is available on MSDN.

Shares files

Uses scanner

Infected before

Risk

Yes

Yes

No

High

Yes

No

No

High

No

No

Yes

Medium

Yes

Yes

Yes

Low

Yes

Yes

No

High

No

Yes

No

Low

Yes

No

Yes

High

A convenient way to visualize this is to draw a tree
diagram as shown below.

If you look at the above diagram, when the
files are shared (Share Files = TRUE) whether is was infected before is irrelevant. On the
other hand, if the files are not shared (Share Files = FALSE) and not it was not infected
before there is a high risk, whatever user scanner parameter is.  This example
explains the importance of having decision trees for making appropriate
decisions.

Decision Trees in SQL Server

Decision trees was introduced in SQL
Server 2000 along with Microsoft’s clustering technique.

for example, let’s examine the vTargetMail view in the AdventureWorksDW2012.

After creating the SSAS project, AdventureWorksDW2012
was created as the data source and vTargetMail was added to Data Source Views. 
Since those are pretty simple, I won’t be going into details of them. If you
are new to SSAS search to find out how to start with SSAS. After doing this,
solution explorer should look as below.

Then create a mining structure by right
clicking mining structure node in the Solution Explorer, the Microsoft Decision Tree is the default mining structure.

After selecting the required mining
structure, we need to select the case table but in this scenario we have only one
view to be selected as the case table which is vTargetMail.

Next is selecting the Key, Input and Predict
columns. The Customer Key will be selected by default as the Key column. Since we
are going to predict the bike buyer let us select BikeBuyer as the Predict column.
The next task is to select the input columns. If you have sound knowledge about your data,
you will have some idea about what your input columns might be.

However, if you are not sure about data
properties, you have the option of getting suggestions by clicking the suggest
button. For each parameter, a score will be displayed and you need to select the
parameters with the highest score value.

After selecting these parameters, they will be taken as the inputs for the selected mining structure as shown
below.

Next, there is a minor
modification to be done as shown below. Typcially you will select
the default values but there is an exception in this scenario.

By default, the Bike Buyer will have a Continuous
Content Type and Long as the Data Type. This is due to the fact that Bike Buyer
column is an integer column and by default integer columns will be assigned the
Continuous data type.  Since we are looking at Bike Buyer as Yes or No, it has
to be discrete and the data type needs to be Boolean.

Next is to provide a test sample. In
this, you have two options – firstly to provide a percentage of the data for
testing. When it is set to 30, 30% of data is used to test the model and
other 70% of data will be used to build the model. There is also the option to
provide the number of cases available for the testing.

Unlike Weka where there are several testing
methods are available, in SQL Server data mining tools on those two options are
available.

After creating the data mining model and
processing it, you can view the tree model as shown in the below image.

The tree diagram should be easy to navigate. For example, you can start at the first node,
where the split is done by using the Number of Cars Owned. There are different
algorithms to find out what is the most suitable parameter from the available
parameters list. Let us assume that new customer has more than two cars. Then the
next parameter to consider is Yearly Income. Let us assume that it is more than
90,000. Then you navigate to that path. Next parameter to consider is Total
Children.    At each node you are given the probability of the occurrence. For
example, when you click a node in the right hand side of it, you will see the
Mining Legend as shown in below screen.

The above is the screen for no cars owned,
Age above 49 and Yearly Income between 26,000 to 58,000. In this
category, there is a possibility of 72% that person will be a bike buyer.

If you are in the marketing function what
you want are the parameters which lead towards purchasing a bike.
Selecting true value for Bike buyer as shown in the below image, the tree
will be highlighted where there is a high probability of a bike buyer.

Below is the image of the dependency
network. This will give you a clear idea about what parameters are the most
important factors for customers in purchasing a bike.

On the left hand side of the screen, there
is a slider which by default is at the top. In that position, all links will be
shown. Dragging the slider down removes the weaker links. At the
lowest point of the slider, you will see that only the number of cars will remain. Thus the number of cars owned is the most important factor.
That is the reason why that parameter was selected as first split parameter in the decision tree.

Predictions

Predictions are one of the most important
aspects of data mining.  The Mining Model Prediction tab will help us to predict
the probable bike buyers. The ProspectiveBuyer table in the AdventureWorksDW
database contains the prospective buyers. Now our target is to predict the prospective
buyers from the developed data mining model.

You can select the table by clicking the
Select Case Table… button. After selecting the ProspectiveBuyer table, some columns
will be selected automatically and others need to be selected
manually. For example, in the data mining model English Education needs to be
mapped with Education in ProspectiveBuyer and English Occupation needs to be
mapped with Occupation.

Next is to select output columns.

As this is a report, you will obviously you need
customer details as shown above. Next, will be whether this prospect is likely to buy a bike or not.  For that, the Bike Buyer column from
vTargetMail is selected. However, we also know that there is a probability of
whether this customer is buying a bike or not which can be retrieved by
PredictProbability. Thus there are two arguments for the probability.

Clicking results generates the final results of the analysis.

This will be the list of prospective
customers.

[v Target Mail].[Bike Buyer] will
give you the probability of bike buyer.  Another criteria is [v Target
Mail].[Bike Buyer], 1 and this gives you probability of this buyer is a bike
buyer. If the bike buyer is true (or 1) both of these parameters should be
equal or if the bike buyer is false ( or 0) sum of these two values should be
equal to 1.

There also is an option of predicting
customers using the Singleton Query input option. In this, you can
provide details for each parameter and can return the required values.

Parameters

As you know, one data set is different to
another. Therefore, you need to tweak your data set to suit to your data
environment using parameter options.

MAXIMUM_INPUT_ATTRIBUTES

This defines the number of input attributes
that the algorithm can handle before it invokes feature selection.  By setting
this value to 0 the feature will be turned off. The default value is 255. This is an
enterprise edition feature.

MAXIMUM_OUTPUT_ATTRIBUTES

This parameter defines the number of output
attributes that the algorithm can handle before it invokes feature selection.  To
turn off this feature you can set this value to 0.   The default value is 255. This
is an also enterprise edition feature.

SCORE_METHOD 

This determines the method that is used to calculate
the split score.

Available options are,

  • Entropy (1)
  • Bayesian with K2 Prior (2)
  • Bayesian Dirichlet Equivalent (BDE) Prior
    (3).

SPLIT_METHOD

Determines the method that is used to split
the node.

Available options:

  • Binary (1)
  • Complete (2)
  • Both (3)

The default is 3.

MINIMUM_SUPPORT

This determines the minimum number of leaf
cases that is required to generate a split in the decision tree. The default is
10. If this is a very high number the decision will be complex.

COMPLEXITY_PENALTY

This controls the growth of the decision
tree.  A low value increases the number of splits, and a high value decreases
the number of splits.  When this value is close to 0, you may see a large tree.
The default value is based on the number of attributes for a particular model,
as described in the following list:

  • For 1 through 9 attributes, the default is
    0.5.
  • For 10 through 99 attributes, the default
    is 0.9.
  • For 100 or more attributes, the default is
    0.99.

FORCED_REGRESSOR 

This parameter forces the algorithm to use
the indicated columns as repressors, regardless of the importance of the
columns as calculated by the algorithm. This parameter is only used for
decision trees that are predicting a continuous attribute. This is an
enterprise edition feature.




Array

No comments yet... Be the first to leave a reply!