Time Series Algorithms in SQL Server

This is the fourth article on data mining
series. The below are the previous articles in this series.


Shopping
Basket Analysis in SQL Server


Using
Decision Trees in SQL Server


Data
Mining Cluster Analysis in SQL Server

This article focuses Time Series Algorithms which
are a forecasting technique. One of the most common algorithms used in
industry are time series algorithms which can be used to answer questions on the future values such sales volume for the next season,
or petrol prices in winter. Most of the cases, time series algorithms are limited to prices
and quantities. However, using the same theories and techniques, they have the
capabilities of predicting trajectory or a moving object and next video frame.

Time series data is typically correlated
with time. Also, there is often a seasonality factor. For example, hotels they have a seasons for
their sales. So when running predictions, you will need to consider seasonal activities.

Requirements for Time Series Algorithm

Let us look at requirements for a Time
Series Algorithm.  Let us use the vTimeSeries view in the AdventureWorksDW2014
database. Here is the output of the vTimeSeries view.


A single key time column

This defines the time
slices that the model will use.  In the above example, TimeIndex is the time
column. This needs to be a continuous value as there shouldn’t be any gaps. In case of missing values there are few techniques can be
adopted which will be discussed later. 


A predictable column

This is the column which
you will predict.  In the above example, Amount and Quantity are the
predictable column.


Series key column

This is an optional column
to identify a series. In case of sales, there can be several regions so that
you can predict sales for region wise. In the above example, ModelRegion is
the series key column.   

Sample

Let us do a sample with SQL Server 2014
using the sample database.  We will create a SQL Server Analysis Service (SSAS)
project. After the project creation is completed, the data source should be created
with the AdventureWorksDW2014 database. Then vTimeSeries view needs to be added
to the data source views.

After that your solution explorer of the
project looks like the below.

The next task is to create a data mining model.
By right clicking the mining structures nodes and select new mining structure. 
In this wizard driven approach, you need to select Microsoft Time Series and
then vTimeSeries as the case table.

After setting a few default settings, you will end up
with option of selecting the key, predictable and series columns.

The TimeIndex column is the key column. ‘Amount’ is
selected as the Input and Predict column.  In data mining a predication column
can be Predict or Predict Only. Predict means, for the prediction of next value
previous predicted values will be used.  Predict Only means that predicted
values will not be used for prediction.

The ModelRegion column is selected as the
optional column. After creation of the mining model, at the properties
following can be observed.

Data types can be observed from the
following screen.

The data types can be changed from the above
screen. Most of the time, default data types are used.

Finally after the building model and
processing it, users can see the predictions from the mining model viewer.  

There are a few options in this screen, where
you can show the deviations and the number of prediction steps.  

Also, rather than selecting all the
regions, you have the options of selecting them. By selecting required group,
only that group can be viewed.

Prediction

Important activity in the data mining is
predicting. Predicting can be done from the Mining Model Predicting tab
which is shown below.


After the above configuration, the results will
be shown below.

Data mining algorithm doesn’t understand
that after 201312, it should be 201401. Instead it will show 201313.

DMX Query

DMX queries can be executed to retrieve
data.

The following query is to retrieve above result which will present next
4.

       SELECT
  [v Time Series].[Model Region],
PredictTimeSeries([v Time Series].[Amount],4)
FROM
  [v Time Series]

The FLATTENED command will present the
result in a tabular format. 

SELECT FLATTENED [v Time Series].[Model Region],
  PredictTimeSeries([v Time Series].[Amount],4)
FROM
  [v Time Series]

   The output will be the following.

DMX has the option of giving the range for the prediction. The following DMX
query will return the next 4-10 predictions ignoring 1st, 2nd
and 3rd.

SELECT FLATTENED [v Time Series].[Model Region],
PredictTimeSeries([v Time Series].[Amount],4,10)
FROM
  [v Time Series]

A Filtering option is also available with the DMX.

        SELECT FLATTENED [v Time Series].[Model Region],
PredictTimeSeries([v Time Series].[Amount],4)
FROM
  [v Time Series]
  Where [Model Region] ='M200 Europe'

Standard deviation and variance are important attributes in time
series which the following query will provide.

SELECT FLATTENED [Model Region],
    (SELECT *,
    PredictStdev(Amount) AS std,
    PredictVariance(Amount) AS Vari
FROM PredictTimeSeries(Amount, 5)
    ) AS Forecast
FROM [v Time Series]

Missing Values

Since gaps in data are not allowed in time
series, missing value substitution is a major concern. In case of cross
prediction ( discussed below) with multiple values, all series should have
the same end points. With SQL Server Analysis Service there are three
options which are Previous, Mean and Constant.

·
Previous – > Repeats the value from the
previous time slice.

·
Mean -> Uses moving average of time slices
used in training.

·
Numeric Constant -> Uses the specified number
to replace all missing values.

Also, in some cases data mining algorithms
are used.  For example, if the 10th value is not available, it will
be predicted using previous nine values.

Periodicity

Periodicity or seasonality is a repeating
pattern found within the data. For example, seasonal temperature fluctuations,
weekly or daily traffic flow fluctuations, sales in holiday seasons.  Both
ARTxP and ARIMA (discussed later) support detection of seasonality or
periodicity.  Fast Fourier transformations are used to detect seasonality
before training. You can control seasonality detection by using AUTODETECT_SEASONALITY
and PERIODICITY_HINT parameters.

Cross Prediction

Cross prediction means predicting the
outcome of one series based on the behavior of another series. Typically, every
time series does not behave only based on it’s previous values. For example, petrol
prices depends on the political stability and other factors.

Parameters

There are few parameters available for
users to tweak the time series algorithm. 

AUTO_DETECT_PERIODICITY

Specifies a numerical value between 0 and 1
used to detect periodicity. Setting this value closer to 1 favors discovery of
many near-periodic patterns and automatic generation of periodicity hints.
Dealing with a large number of periodicity hints will likely lead to
significantly longer model training times. If the value is closer to 0,
periodicity is detected only for strongly periodic data.

COMPLEXITY_PENALTY

Inhibits the growth of the decision tree.
Decreasing this value increases the likelihood of a split, while increasing
this value decreases the likelihood.  This is available only Enterprise
Edition.

FORECAST_METHOD

Specifies the forecasting algorithm to use.
There are two types available, ARTxp and ARIMA. ARTxp based on Microsoft
Decision Trees. This is optimized for predicting the next likely value in a
series and this supports cross-prediction. According to Microsoft this is a
unique feature available with Microsoft tools.  On the other hand ARIMA is well
known and understood in the industry which does not supports cross-prediction.

Select MIXED to create models for both
ARTXP and ARIMA time series algorithms and combine their results during
prediction.

In Standard Edition, the models are
combined using an automatic ratio that favors ARTXP for near-term and ARIMA for
long-term prediction. In Enterprise Edition, the models are combined and
weighted according to the value set for PREDICTION_SMOOTHING.

If FORECAST_METHOD is set to ARTXP or
ARIMA, the value for PREDICTION_SMOOTHING is ignored.

HISTORIC_MODEL_COUNT

This specifies the number of historic
models that will be built. This is an Enterprise Edition feature

MAXIMUM_SERIES_VALUE

This specifies the upper constraint for any
time series prediction. Predicted values will never be greater than this
constraint.  This is an Enterprise Edition feature.

MISSING_VALUE_SUBSTITUTION

As discussed before, this specifies the
method used to fill the gaps in historical data. By default, irregular gaps or
ragged edges in data are not allowed. The methods available to fill in
irregular gaps or edges are: by Previous value, by Mean value, or by specific
numerical constant.  This parameter is used only for incomplete rows of input
data.

INSTABILITY_SENSITIVITY

This controls the threshold at which the
algorithm detects instability in variances in the ARTXP time series algorithm
and stops making predictions. The specified value is only valid when
FORECAST_METHOD is set to ARTXP. The default value of 1 turns on instability
detection. This setting optimizes the reliability of predictions, but might
limit the number of prediction steps. A value of 0 turns off instability
detection. This setting extends predictions indefinitely, but those predictions
might become statistically invalid.  This is an Enterprise Edition feature.




Array

No comments yet... Be the first to leave a reply!