This is the fourth article on data mining series. The below are the previous articles in this series.
This article focuses Time Series Algorithms which are a forecasting technique. One of the most common algorithms used in industry are time series algorithms which can be used to answer questions on the future values such sales volume for the next season, or petrol prices in winter. Most of the cases, time series algorithms are limited to prices and quantities. However, using the same theories and techniques, they have the capabilities of predicting trajectory or a moving object and next video frame.
Time series data is typically correlated with time. Also, there is often a seasonality factor. For example, hotels they have a seasons for their sales. So when running predictions, you will need to consider seasonal activities.
Requirements for Time Series Algorithm
Let us look at requirements for a Time Series Algorithm. Let us use the vTimeSeries view in the AdventureWorksDW2014 database. Here is the output of the vTimeSeries view.
A single key time column
This defines the time slices that the model will use. In the above example, TimeIndex is the time column. This needs to be a continuous value as there shouldn’t be any gaps. In case of missing values there are few techniques can be adopted which will be discussed later.
A predictable column
This is the column which you will predict. In the above example, Amount and Quantity are the predictable column.
Series key column
This is an optional column to identify a series. In case of sales, there can be several regions so that you can predict sales for region wise. In the above example, ModelRegion is the series key column.
Let us do a sample with SQL Server 2014 using the sample database. We will create a SQL Server Analysis Service (SSAS) project. After the project creation is completed, the data source should be created with the AdventureWorksDW2014 database. Then vTimeSeries view needs to be added to the data source views.
After that your solution explorer of the project looks like the below.
The next task is to create a data mining model. By right clicking the mining structures nodes and select new mining structure. In this wizard driven approach, you need to select Microsoft Time Series and then vTimeSeries as the case table.
After setting a few default settings, you will end up with option of selecting the key, predictable and series columns.
The TimeIndex column is the key column. ‘Amount’ is selected as the Input and Predict column. In data mining a predication column can be Predict or Predict Only. Predict means, for the prediction of next value previous predicted values will be used. Predict Only means that predicted values will not be used for prediction.
The ModelRegion column is selected as the optional column. After creation of the mining model, at the properties following can be observed.
Data types can be observed from the following screen.
The data types can be changed from the above screen. Most of the time, default data types are used.
Finally after the building model and processing it, users can see the predictions from the mining model viewer.
There are a few options in this screen, where you can show the deviations and the number of prediction steps.
Also, rather than selecting all the regions, you have the options of selecting them. By selecting required group, only that group can be viewed.
Important activity in the data mining is predicting. Predicting can be done from the Mining Model Predicting tab which is shown below.
After the above configuration, the results will be shown below.
Data mining algorithm doesn’t understand that after 201312, it should be 201401. Instead it will show 201313.
DMX queries can be executed to retrieve data.
The following query is to retrieve above result which will present next 4.
SELECT [v Time Series].[Model Region], PredictTimeSeries([v Time Series].[Amount],4) FROM [v Time Series]
The FLATTENED command will present the result in a tabular format.
SELECT FLATTENED [v Time Series].[Model Region], PredictTimeSeries([v Time Series].[Amount],4) FROM [v Time Series]
The output will be the following.
DMX has the option of giving the range for the prediction. The following DMX query will return the next 4-10 predictions ignoring 1st, 2nd and 3rd.
SELECT FLATTENED [v Time Series].[Model Region], PredictTimeSeries([v Time Series].[Amount],4,10) FROM [v Time Series]
A Filtering option is also available with the DMX.
SELECT FLATTENED [v Time Series].[Model Region], PredictTimeSeries([v Time Series].[Amount],4) FROM [v Time Series] Where [Model Region] ='M200 Europe'
Standard deviation and variance are important attributes in time series which the following query will provide.
SELECT FLATTENED [Model Region], (SELECT *, PredictStdev(Amount) AS std, PredictVariance(Amount) AS Vari FROM PredictTimeSeries(Amount, 5) ) AS Forecast FROM [v Time Series]
Since gaps in data are not allowed in time series, missing value substitution is a major concern. In case of cross prediction ( discussed below) with multiple values, all series should have the same end points. With SQL Server Analysis Service there are three options which are Previous, Mean and Constant.
· Previous – > Repeats the value from the previous time slice.
· Mean -> Uses moving average of time slices used in training.
· Numeric Constant -> Uses the specified number to replace all missing values.
Also, in some cases data mining algorithms are used. For example, if the 10th value is not available, it will be predicted using previous nine values.
Periodicity or seasonality is a repeating pattern found within the data. For example, seasonal temperature fluctuations, weekly or daily traffic flow fluctuations, sales in holiday seasons. Both ARTxP and ARIMA (discussed later) support detection of seasonality or periodicity. Fast Fourier transformations are used to detect seasonality before training. You can control seasonality detection by using AUTODETECT_SEASONALITY and PERIODICITY_HINT parameters.
Cross prediction means predicting the outcome of one series based on the behavior of another series. Typically, every time series does not behave only based on it’s previous values. For example, petrol prices depends on the political stability and other factors.
There are few parameters available for users to tweak the time series algorithm.
Specifies a numerical value between 0 and 1 used to detect periodicity. Setting this value closer to 1 favors discovery of many near-periodic patterns and automatic generation of periodicity hints. Dealing with a large number of periodicity hints will likely lead to significantly longer model training times. If the value is closer to 0, periodicity is detected only for strongly periodic data.
Inhibits the growth of the decision tree. Decreasing this value increases the likelihood of a split, while increasing this value decreases the likelihood. This is available only Enterprise Edition.
Specifies the forecasting algorithm to use. There are two types available, ARTxp and ARIMA. ARTxp based on Microsoft Decision Trees. This is optimized for predicting the next likely value in a series and this supports cross-prediction. According to Microsoft this is a unique feature available with Microsoft tools. On the other hand ARIMA is well known and understood in the industry which does not supports cross-prediction.
Select MIXED to create models for both ARTXP and ARIMA time series algorithms and combine their results during prediction.
In Standard Edition, the models are combined using an automatic ratio that favors ARTXP for near-term and ARIMA for long-term prediction. In Enterprise Edition, the models are combined and weighted according to the value set for PREDICTION_SMOOTHING.
If FORECAST_METHOD is set to ARTXP or ARIMA, the value for PREDICTION_SMOOTHING is ignored.
This specifies the number of historic models that will be built. This is an Enterprise Edition feature
This specifies the upper constraint for any time series prediction. Predicted values will never be greater than this constraint. This is an Enterprise Edition feature.
As discussed before, this specifies the method used to fill the gaps in historical data. By default, irregular gaps or ragged edges in data are not allowed. The methods available to fill in irregular gaps or edges are: by Previous value, by Mean value, or by specific numerical constant. This parameter is used only for incomplete rows of input data.
This controls the threshold at which the algorithm detects instability in variances in the ARTXP time series algorithm and stops making predictions. The specified value is only valid when FORECAST_METHOD is set to ARTXP. The default value of 1 turns on instability detection. This setting optimizes the reliability of predictions, but might limit the number of prediction steps. A value of 0 turns off instability detection. This setting extends predictions indefinitely, but those predictions might become statistically invalid. This is an Enterprise Edition feature.