Time Series Algorithms in SQL Server
This is the fourth article on data mining
series. The below are the previous articles in this series.
This article focuses Time Series Algorithms which
are a forecasting technique. One of the most common algorithms used in
industry are time series algorithms which can be used to answer questions on the future values such sales volume for the next season,
or petrol prices in winter. Most of the cases, time series algorithms are limited to prices
and quantities. However, using the same theories and techniques, they have the
capabilities of predicting trajectory or a moving object and next video frame.
Time series data is typically correlated
with time. Also, there is often a seasonality factor. For example, hotels they have a seasons for
their sales. So when running predictions, you will need to consider seasonal activities.
Requirements for Time Series Algorithm
Let us look at requirements for a Time
Series Algorithm. Let us use the vTimeSeries view in the AdventureWorksDW2014
database. Here is the output of the vTimeSeries view.
A single key time column
This defines the time
slices that the model will use. In the above example, TimeIndex is the time
column. This needs to be a continuous value as there shouldn’t be any gaps. In case of missing values there are few techniques can be
adopted which will be discussed later.
A predictable column
This is the column which
you will predict. In the above example, Amount and Quantity are the
Series key column
This is an optional column
to identify a series. In case of sales, there can be several regions so that
you can predict sales for region wise. In the above example, ModelRegion is
the series key column.
Let us do a sample with SQL Server 2014
using the sample database. We will create a SQL Server Analysis Service (SSAS)
project. After the project creation is completed, the data source should be created
with the AdventureWorksDW2014 database. Then vTimeSeries view needs to be added
to the data source views.
After that your solution explorer of the
project looks like the below.
The next task is to create a data mining model.
By right clicking the mining structures nodes and select new mining structure.
In this wizard driven approach, you need to select Microsoft Time Series and
then vTimeSeries as the case table.
After setting a few default settings, you will end up
with option of selecting the key, predictable and series columns.
The TimeIndex column is the key column. ‘Amount’ is
selected as the Input and Predict column. In data mining a predication column
can be Predict or Predict Only. Predict means, for the prediction of next value
previous predicted values will be used. Predict Only means that predicted
values will not be used for prediction.
The ModelRegion column is selected as the
optional column. After creation of the mining model, at the properties
following can be observed.
Data types can be observed from the
The data types can be changed from the above
screen. Most of the time, default data types are used.
Finally after the building model and
processing it, users can see the predictions from the mining model viewer.
There are a few options in this screen, where
you can show the deviations and the number of prediction steps.
Also, rather than selecting all the
regions, you have the options of selecting them. By selecting required group,
only that group can be viewed.
Important activity in the data mining is
predicting. Predicting can be done from the Mining Model Predicting tab
which is shown below.
After the above configuration, the results will
be shown below.
Data mining algorithm doesn’t understand
that after 201312, it should be 201401. Instead it will show 201313.
DMX queries can be executed to retrieve
The following query is to retrieve above result which will present next
SELECT [v Time Series].[Model Region], PredictTimeSeries([v Time Series].[Amount],4) FROM [v Time Series]
The FLATTENED command will present the
result in a tabular format.
SELECT FLATTENED [v Time Series].[Model Region], PredictTimeSeries([v Time Series].[Amount],4) FROM [v Time Series]
The output will be the following.
DMX has the option of giving the range for the prediction. The following DMX
query will return the next 4-10 predictions ignoring 1st, 2nd
SELECT FLATTENED [v Time Series].[Model Region], PredictTimeSeries([v Time Series].[Amount],4,10) FROM [v Time Series]
A Filtering option is also available with the DMX.
SELECT FLATTENED [v Time Series].[Model Region], PredictTimeSeries([v Time Series].[Amount],4) FROM [v Time Series] Where [Model Region] ='M200 Europe'
Standard deviation and variance are important attributes in time
series which the following query will provide.
SELECT FLATTENED [Model Region], (SELECT *, PredictStdev(Amount) AS std, PredictVariance(Amount) AS Vari FROM PredictTimeSeries(Amount, 5) ) AS Forecast FROM [v Time Series]
Since gaps in data are not allowed in time
series, missing value substitution is a major concern. In case of cross
prediction ( discussed below) with multiple values, all series should have
the same end points. With SQL Server Analysis Service there are three
options which are Previous, Mean and Constant.
Previous – > Repeats the value from the
previous time slice.
Mean -> Uses moving average of time slices
used in training.
Numeric Constant -> Uses the specified number
to replace all missing values.
Also, in some cases data mining algorithms
are used. For example, if the 10th value is not available, it will
be predicted using previous nine values.
Periodicity or seasonality is a repeating
pattern found within the data. For example, seasonal temperature fluctuations,
weekly or daily traffic flow fluctuations, sales in holiday seasons. Both
ARTxP and ARIMA (discussed later) support detection of seasonality or
periodicity. Fast Fourier transformations are used to detect seasonality
before training. You can control seasonality detection by using AUTODETECT_SEASONALITY
and PERIODICITY_HINT parameters.
Cross prediction means predicting the
outcome of one series based on the behavior of another series. Typically, every
time series does not behave only based on it’s previous values. For example, petrol
prices depends on the political stability and other factors.
There are few parameters available for
users to tweak the time series algorithm.
Specifies a numerical value between 0 and 1
used to detect periodicity. Setting this value closer to 1 favors discovery of
many near-periodic patterns and automatic generation of periodicity hints.
Dealing with a large number of periodicity hints will likely lead to
significantly longer model training times. If the value is closer to 0,
periodicity is detected only for strongly periodic data.
Inhibits the growth of the decision tree.
Decreasing this value increases the likelihood of a split, while increasing
this value decreases the likelihood. This is available only Enterprise
Specifies the forecasting algorithm to use.
There are two types available, ARTxp and ARIMA. ARTxp based on Microsoft
Decision Trees. This is optimized for predicting the next likely value in a
series and this supports cross-prediction. According to Microsoft this is a
unique feature available with Microsoft tools. On the other hand ARIMA is well
known and understood in the industry which does not supports cross-prediction.
Select MIXED to create models for both
ARTXP and ARIMA time series algorithms and combine their results during
In Standard Edition, the models are
combined using an automatic ratio that favors ARTXP for near-term and ARIMA for
long-term prediction. In Enterprise Edition, the models are combined and
weighted according to the value set for PREDICTION_SMOOTHING.
If FORECAST_METHOD is set to ARTXP or
ARIMA, the value for PREDICTION_SMOOTHING is ignored.
This specifies the number of historic
models that will be built. This is an Enterprise Edition feature
This specifies the upper constraint for any
time series prediction. Predicted values will never be greater than this
constraint. This is an Enterprise Edition feature.
As discussed before, this specifies the
method used to fill the gaps in historical data. By default, irregular gaps or
ragged edges in data are not allowed. The methods available to fill in
irregular gaps or edges are: by Previous value, by Mean value, or by specific
numerical constant. This parameter is used only for incomplete rows of input
This controls the threshold at which the
algorithm detects instability in variances in the ARTXP time series algorithm
and stops making predictions. The specified value is only valid when
FORECAST_METHOD is set to ARTXP. The default value of 1 turns on instability
detection. This setting optimizes the reliability of predictions, but might
limit the number of prediction steps. A value of 0 turns off instability
detection. This setting extends predictions indefinitely, but those predictions
might become statistically invalid. This is an Enterprise Edition feature.