Using Decision Trees in SQL Server

As a practical introduction to decision imagine you are assessing the risk of a PC being infected with a virus. The below data on factors influencing affecting the risk of a PC being infected with a virus is available on MSDN.

Shares files

Uses scanner

Infected before

Risk

Yes

Yes

No

High

Yes

No

No

High

No

No

Yes

Medium

Yes

Yes

Yes

Low

Yes

Yes

No

High

No

Yes

No

Low

Yes

No

Yes

High

A convenient way to visualize this is to draw a tree diagram as shown below.

If you look at the above diagram, when the files are shared (Share Files = TRUE) whether is was infected before is irrelevant. On the other hand, if the files are not shared (Share Files = FALSE) and not it was not infected before there is a high risk, whatever user scanner parameter is.  This example explains the importance of having decision trees for making appropriate decisions.

Decision Trees in SQL Server

Decision trees was introduced in SQL Server 2000 along with Microsoft’s clustering technique.

for example, let’s examine the vTargetMail view in the AdventureWorksDW2012.

After creating the SSAS project, AdventureWorksDW2012 was created as the data source and vTargetMail was added to Data Source Views.  Since those are pretty simple, I won’t be going into details of them. If you are new to SSAS search to find out how to start with SSAS. After doing this, solution explorer should look as below.

Then create a mining structure by right clicking mining structure node in the Solution Explorer, the Microsoft Decision Tree is the default mining structure.

After selecting the required mining structure, we need to select the case table but in this scenario we have only one view to be selected as the case table which is vTargetMail.

Next is selecting the Key, Input and Predict columns. The Customer Key will be selected by default as the Key column. Since we are going to predict the bike buyer let us select BikeBuyer as the Predict column. The next task is to select the input columns. If you have sound knowledge about your data, you will have some idea about what your input columns might be.

However, if you are not sure about data properties, you have the option of getting suggestions by clicking the suggest button. For each parameter, a score will be displayed and you need to select the parameters with the highest score value.

After selecting these parameters, they will be taken as the inputs for the selected mining structure as shown below.

Next, there is a minor modification to be done as shown below. Typcially you will select the default values but there is an exception in this scenario.

By default, the Bike Buyer will have a Continuous Content Type and Long as the Data Type. This is due to the fact that Bike Buyer column is an integer column and by default integer columns will be assigned the Continuous data type.  Since we are looking at Bike Buyer as Yes or No, it has to be discrete and the data type needs to be Boolean.

Next is to provide a test sample. In this, you have two options – firstly to provide a percentage of the data for testing. When it is set to 30, 30% of data is used to test the model and other 70% of data will be used to build the model. There is also the option to provide the number of cases available for the testing.

Unlike Weka where there are several testing methods are available, in SQL Server data mining tools on those two options are available.

After creating the data mining model and processing it, you can view the tree model as shown in the below image.

The tree diagram should be easy to navigate. For example, you can start at the first node, where the split is done by using the Number of Cars Owned. There are different algorithms to find out what is the most suitable parameter from the available parameters list. Let us assume that new customer has more than two cars. Then the next parameter to consider is Yearly Income. Let us assume that it is more than 90,000. Then you navigate to that path. Next parameter to consider is Total Children.    At each node you are given the probability of the occurrence. For example, when you click a node in the right hand side of it, you will see the Mining Legend as shown in below screen.

The above is the screen for no cars owned, Age above 49 and Yearly Income between 26,000 to 58,000. In this category, there is a possibility of 72% that person will be a bike buyer.

If you are in the marketing function what you want are the parameters which lead towards purchasing a bike. Selecting true value for Bike buyer as shown in the below image, the tree will be highlighted where there is a high probability of a bike buyer.

Below is the image of the dependency network. This will give you a clear idea about what parameters are the most important factors for customers in purchasing a bike.

On the left hand side of the screen, there is a slider which by default is at the top. In that position, all links will be shown. Dragging the slider down removes the weaker links. At the lowest point of the slider, you will see that only the number of cars will remain. Thus the number of cars owned is the most important factor. That is the reason why that parameter was selected as first split parameter in the decision tree.

Predictions

Predictions are one of the most important aspects of data mining.  The Mining Model Prediction tab will help us to predict the probable bike buyers. The ProspectiveBuyer table in the AdventureWorksDW database contains the prospective buyers. Now our target is to predict the prospective buyers from the developed data mining model.

You can select the table by clicking the Select Case Table… button. After selecting the ProspectiveBuyer table, some columns will be selected automatically and others need to be selected manually. For example, in the data mining model English Education needs to be mapped with Education in ProspectiveBuyer and English Occupation needs to be mapped with Occupation.

Next is to select output columns.

As this is a report, you will obviously you need customer details as shown above. Next, will be whether this prospect is likely to buy a bike or not.  For that, the Bike Buyer column from vTargetMail is selected. However, we also know that there is a probability of whether this customer is buying a bike or not which can be retrieved by PredictProbability. Thus there are two arguments for the probability.

Clicking results generates the final results of the analysis.

This will be the list of prospective customers.

[v Target Mail].[Bike Buyer] will give you the probability of bike buyer.  Another criteria is [v Target Mail].[Bike Buyer], 1 and this gives you probability of this buyer is a bike buyer. If the bike buyer is true (or 1) both of these parameters should be equal or if the bike buyer is false ( or 0) sum of these two values should be equal to 1.

There also is an option of predicting customers using the Singleton Query input option. In this, you can provide details for each parameter and can return the required values.

Parameters

As you know, one data set is different to another. Therefore, you need to tweak your data set to suit to your data environment using parameter options.

MAXIMUM_INPUT_ATTRIBUTES

This defines the number of input attributes that the algorithm can handle before it invokes feature selection.  By setting this value to 0 the feature will be turned off. The default value is 255. This is an enterprise edition feature.

MAXIMUM_OUTPUT_ATTRIBUTES

This parameter defines the number of output attributes that the algorithm can handle before it invokes feature selection.  To turn off this feature you can set this value to 0.   The default value is 255. This is an also enterprise edition feature.

SCORE_METHOD 

This determines the method that is used to calculate the split score.

Available options are,

  • Entropy (1)
  • Bayesian with K2 Prior (2)
  • Bayesian Dirichlet Equivalent (BDE) Prior (3).

SPLIT_METHOD

Determines the method that is used to split the node.

Available options:

  • Binary (1)
  • Complete (2)
  • Both (3)

The default is 3.

MINIMUM_SUPPORT

This determines the minimum number of leaf cases that is required to generate a split in the decision tree. The default is 10. If this is a very high number the decision will be complex.

COMPLEXITY_PENALTY

This controls the growth of the decision tree.  A low value increases the number of splits, and a high value decreases the number of splits.  When this value is close to 0, you may see a large tree. The default value is based on the number of attributes for a particular model, as described in the following list:

  • For 1 through 9 attributes, the default is 0.5.
  • For 10 through 99 attributes, the default is 0.9.
  • For 100 or more attributes, the default is 0.99.

FORCED_REGRESSOR 

This parameter forces the algorithm to use the indicated columns as repressors, regardless of the importance of the columns as calculated by the algorithm. This parameter is only used for decision trees that are predicting a continuous attribute. This is an enterprise edition feature.

]]>

Leave a comment

Your email address will not be published.