Claytons Data Mining (Part 1)
The success of OLAP technologies is driven by the ability to provide end users with fast access to data without the need for technical interaction. The sophistication of todays reporting tools and multidimensional databases allow users to define their own reports, from content to presentation and delivery. The tools allow a user to think of a question and answer it in real time without IT resources. They can form hypothesis and validate their theories in real time.
It was not that long ago (and is still the norm in some organizations) that retrieving any information from the system is a technical burden. Uses need to book a technical resource, explain what they want to see, have it developed, validate it and then find some way to introduce it into their business cycle. This type of information retrieval and discovery is clearly not acceptable in a modern organization.
Even with these benefits, one drawback of information discovery in this manner is that the user must think of the question they wish to answer before they answer it. If we think of a sales scenario in which we want to determine what is driving sales, the user must ask ‘Who’s purchasing the product?’, ‘Which area is selling the product?’, ‘Which Salesman is selling the product?’, ‘Which Product is being sold?’ and make a decision based on their findings. Having a drill through function is a great place to start, however, the order by which the user starts to drill into the data will often determine the answer they get. Therefore, the knowledge acquired is a function of the method of discovery.
Enter Data Mining
Unlike this method of ‘think and find’ data investigation, data mining does not require the user to form such a structured question in the process of knowledge discovery. Rather, a data mining algorithm is applied (to data) and it investigates and learns based on that data. The question posed in a data mining operation is more akin to ‘what are the determining factors for low sale outlets?’ not ‘which sales men have the lowest sales?’’ The process of allowing the mining algorithm to determining these relationships is referred to as training.
Once the data mining algorithm understands the data, it forms a set of rules about what it has learned. These rules can be used to predict what outcome is likely given a new set of inputs. This type of prediction is called generalized as heuristic because the estimated outcome is based on a prior knowledge or past experience.
This two part article examines a powerful addin available for Excel 2007 that exposes data mining capabilities to even the most modest user. Part 1 examines how to setup and configure the addin and Part 2 explores some simple data mining operations.
Excel and MSAS Interaction
Microsoft SQL Server Analysis Server (MSAS) extends the ability to analyze data beyond OLAP storage by providing data mining functionality within the analysis service engine. The engine is responsible for
- Maintaining the data mining metadata (that is, the structure and role of data being fed as input data)
- Storing various data mining algorithms (and input parameters) within the context of the metadata (above). Note that the same metadata structure can have various algorithms applied to it (and multiple instances of the same algorithm)
- Running the algorithm and storing the results (or rules) derived from training the algorithm (and
- Managing queries to the algorithms regarding output
Despite this functionality, the server application is limited by the ability of a user to understand and apply the data mining algorithm. Firstly, a user must have knowledge of how to implement the mining operation and secondly, they must understand how to interperate the results and what they mean. To overcome this knowledge gap, a data mining addin for Excel 2007 gives users the ability to apply the server data mining techniques against tables in excel. This simple to use interface can deliver real insights about their data.
The rest of this article focuses on how to analyze data using these simple tools. Firstly we examine how to install and use the addin and then we examine how apply it to our own data.
Obtaining and Installing the Addin
The addin installer can be obtained from Microsoft. In addition to the addin, the installer installs a sample data workbook. This can be found ‘Microsoft SQL Server 2005 DM Addins’ Program group.
Overview of Operations
The (simple) data mining operations discussed in this article are found as table analysis techniques (see <Table Tools><Analyze>). To apply them, the data must be formatted as an excel table. To do this we simply format a range as a table (Button Format as Table). The diagram(s) below (Figure 1, Figure 2, Figure 3) show the method for this. Note that the data represented is the ‘value only’ version of the worksheet ‘Table Analysis Tools Sample’ in the addin sample data. All work sheets in the sample data are formatted as tables but the tools are only exposed if the data is formatted as a table.
Figure 1 – Format Data Range as a Table