Prediction Models

How much training data do I need?

Determining the amount of data required for using a certain statistical method is not a simple problem, and there are no universally correct answers. Some rules of thumb exist, but any such rule is likely to be an oversimplification. This is because the amount of data required depends on the characteristics of the data in question.

The most important factor is the signal-to-noise ratio in the data. If there exists a clear relationship between the input predictors and the target, which is not obscured by noise, a model may be able to identify the relationship with relatively little data. On the other hand, when there is much noise, you may require large amounts of data.

Another important factor is the model you choose to use, and in particular the degree of freedom of the given model. For univariate linear regression with just one input predictor, just a handful of points can give good results. For each input predictor you add to this model, you increase the degree of freedom by 1, requiring more data points to avoid overfitting. For ARIMA most rules of thumb would recommend something closer to 30 data points, and if you include seasonal components as well, we might recommend 50 or even 100 data points as a minimum.

It also depends on your purpose with the analysis. If your aim is to quantify the relationships between the time series so well that you can use them for predicting future data points, you will typically need more data. An extreme example is the task of predicting stock prices. In an efficient market the available knowledge you could potentially put into the model has, to a large extent, already been priced into the market, and it is likely to be difficult to extract whatever signal remains in the data, as it is swamped in noise. In such cases you may need an extraordinary amount of data.

Generally, more data is better. But it is not always better to include more data. If your time series are stationary, meaning that the properties of the time series do not depend on the time when the time series are observed, it never hurts to add more data. However, many time series you encounter in the real world are not stationary. If you go sufficiently far back in time, the situation may have been markedly different from what it is today, and relations that held a few years ago, may no longer be relevant to understanding the current market. A stark example of this is the financial crisis of 2007 and 2008. Many macroeconomic indicators and other economic time series show conspicuous behaviour during the crisis, and for many data sets you cannot expect behaviour from this period to be informative with regard to the current time. Worse, the magnitude of the time series movements during such chaotic periods are often large compared to those in other time periods, leading many statistical models to give them disproportionate weight. You should therefore consider leaving aberrant time periods out of the training data.

How is the data preprocessed by the system?

We perform some preprocessing of input data in order to make them more amenable to statistical analysis. The exact procedure depends on the type of models and other factors, but in general the preprocessing procedure looks as follows:

Shift the mean of the time series to zero

This is performed by subtracting the mean of the time series. It is not always desirable to shift the mean to zero for all the time series involved. In particular, the target time series are only shifted when we use the machine learning models, not when we use the time series models.

For the predictor time series we will try both variants (both shifting the mean to zero and leaving the time series unchanged) and choose the method that gives the best results.

Normalise by the mean absolute deviation

This is performed by dividing the time series by the mean absolute deviation.

This step is always performed on all the time series. The reason we do this is that some time series models would otherwise accord too high a weight to a certain predictor just due to the scaling of the predictor. (In other words, whether a time series is measured in meters or feet should not have an effect on the statistical analysis.)

[optional] Squash the data to reduce the effect of outliers

This is done by applying a function to the data which leaves small values relatively untouched, but moves larger values closer to 0.

The particular formula we use is that we replace an input value x by log(1+x) if x is positive and by -log(1-x) if x is negative.

For prediction models we try both variants, i.e. squashing and not squashing, and choose the one that gives the best results.

The rationale for doing this is that many statistical models weigh outliers rather heavily, proportional to how far from the mean they are, with the consequence that one or a few outliers may influence the results unduly.

How is (automatic) resampling of input predictors performed?

The models are built with targets and input predictors having the same resolution. Thus, if your data have different resolutions, it needs to be resampled. The way you want to resample the data, depends on what kind of data you have.

Typically, the target metric, such as revenue numbers, has a low resolution such as quarterly, while the input data may be higher resolution alternative data, such as daily credit card spend numbers or web traffic. This means that the input data needs to be downsampled. In many cases the most natural choice is to downsample by taking the average or the sum of all data points within the period, e.g. summing up all the daily values within the quarter. Another possibility is to take the last value in every time period, which is useful for e.g. a share price (technically, for “integrated time series”, representing levels rather than transactions).

By default, if the provided input signals have a higher resolution than the target signal, the input signals will be downsampled by taking the average. If a different behaviour is desired, you should downsample the input data explicitly by creating a new signal using a DSL expression that does the desired downsampling, and then choosing this new signal as the input in the modeller tool.

When should I use panel vs. single-company modelling?

Performing a panel model analysis on a set of companies means that we perform a statistical analysis on a number of companies, giving us one set of coefficients that apply for all the companies in the panel.

Arguments for using panel modelling are the following:

You believe that the companies react in a similar way to the predictors.

For example, it is reasonable to expect that web traffic has a positive effect on the performance of a company, and for companies in the same sector, one can furthermore expect the effect to be comparable. But for companies in different sectors, this may no longer hold: for example, web traffic is probably much more important for online retail shops, than it would be for petroleum exporters.

You don’t have enough data to perform an analysis on a single company.

If you have little data for a given company, it’s quite possible that you have too little data for you to be able to determine any meaningful statistical effects for a single company, and the model may give you effects that aren’t really reflected by reality (overfitting).

This is especially a concern for data with low resolution. Especially for indicators such as revenue, for which you may only have quarterly data, you are unlikely to have a particularly long time series. If you do have such data, they may cover such a long time span that the earliest data may not be particularly relevant for understanding the market today.

What should I consider if an input signal is a percentage or a difference between two numbers?

It works well to use time series consisting of differences or ratios or percentage changes, and in fact, it is often desirable to use such series rather than absolute values, because difference series generally are more stationary.

When it comes to choosing between using absolute time series and differenced time series, the most natural choice is to stick to one of them. That is, either all the input variables are absolute time series, or they are all differenced time series.

This is not an absolute rule, and it may make sense to create a model which mixes time series, but if you do so, you should carefully contemplate if it makes sense in your case.

For prediction models we generate several features from the input data, which generally includes the previous value of the target time series. For this reason, it makes sense to let the target value be time series consisting of absolute values, while the predictors are difference series, because the model in that case would be able to infer that the next target value can be modelled as the previous target with some combination of the input predictors added to it.

Can I build a model with no input predictors? If so, what would it do?

It is possible to specify a model without specifying any external input predictors. For such models we extract extra predictors that are calculated from the target time series itself. For example, we may add autoregressive terms, which means that we include one or more of the last observations as predictors in the model. This makes it possible to determine certain patterns in the time series, such as trend or seasonality, even in the absence of external input predictors.

What is auto modelling / "hyperopt"?

Our system offers a number of statistical algorithms, and it may be difficult for someone without statistical training to determine which algorithms are suitable for a certain modelling problem.

By choosing auto modelling you let our system choose the modelling technique. In this case, you don’t have to make any choices with regard to the modelling technique yourself. The input required from you is the target variable you would like to model, and which input variables you would like to use as predictors.

Currently, the system makes the selection by running several algorithms on the data you provide, and it then selects the algorithm that gives the best backtesting results. The system does this model search in a smart way, by learning what parameters give the best results, and focusing the effort on the most promising modelling techniques and its parameters, the input signals to include, and what derived features to use (such as lagged input variables, autoregressive features, linear trends, and seasonality). This process is called “hyperparameter optimization”.

What is "Optimize for high-value alternative data"?

By choosing this config option, a different type of model is constructed that is suitable when the input signals are proportional to the target. The prototypical example would be a credit card spend signal used to predict the revenue of a consumer company, where you expect that a 10% change in the credit card spend corresponds to a 10% change in the revenue of the company.

If the input time series is proportional to the target time series, you can use a linear regression, which would express that target = k * input. The problem is that the proportionality constant k typically will vary over time, and that cannot be captured with a normal linear regression. When this config option is enabled, the model will rather treat k as a time-varying ratio, and build a model for k(t), which is then used to predict the target. The model we use for k(t) is an Unobserved Components model that takes into account the seasonality and growth trend of this ratio (as the ratio will tend to drift over time, and the input time series may have a different seasonality pattern than the target time series).

The steps in this calculation are as follows:

calculate the ratio between the input time series and the target time series
model the resulting time series (the ratio) with an Unobserved Components model
predict what the ratio will be for the next quarter to be reported
multiply the predicted ratio with the input signal value for the next quarter to arrive at a prediction for the target signal for the next quarter