Prediction Models - Evaluation

Once a model run is complete, the prediction error gives us an overview on how our model did overall on all the companies. We can deep dive into each of the companies and understand which model it ended up choosing (All companies + Insights Tab will display this for a panel model, Company page + Insights tab for Individual model). If we have the input data for next quarter, it will provide the next prediction for the target metric in future. The signals tab tells us what the model will look like if run on each input metric individually.

Evaluation metrics

There are three different metrics provided for KPI prediction models:

Relative prediction error is the Weighted Absolute Percentage Error (WAPE). This is calculated for each company by taking the average of the absolute deviation (i.e. the absolute value of the difference between the actual and the predicted value) and dividing it by the average of the actual values. Then these values are averaged across all the companies to arrive at the overall WAPE score for the model. Lower values are better (0% would be a perfect prediction).

Absolute prediction error is the Mean Absolute Error (MAE). 0.0 would be a perfect prediction. This metric should only be used when the scale is the same, for example if the target is a percentage number.

Model fit is the R^2. The R^2 is calculated versus a baseline of predicting that the next number will be the same as the previous one (note that this gives quite different numeric results compared with using the average target value as the baseline, which is possibly a more common way to calculate the R^2). Higher numbers are better (100% would be a perfect prediction, whereas if all predictions were just the previously observed value, the model fit would be 0%).

In all three cases, the metric is calculated only using out-of-sample results from the backtests, never using the in-sample results from the data that was used to train the model.

When to use each metric

As an example, let’s say that we are trying to predict the revenue for a set of companies. Let’s first assume we use the actual revenue numbers as our target metric. In this case, we recommend looking at the relative prediction errors (WAPE). If for instance this metric were 4.7%, it would mean that on average, over all the backtests for all the companies, the model’s prediction has missed by 4.7%.

We would not recommend using the absolute errors (MAE) in this case. The reason is that the revenue numbers can be orders of magnitude larger for some companies than for others, which means that this measure will be primarily dominated by the companies with the largest revenue. For instance, if one company has revenue of $50B, the prediction errors might be on the order of $1B for that company. If another company in the model has revenue of $3B, then it won’t matter as much to the MAE metric if the prediction error for this company is $30M or $100M, because either way the average is dominated by the $1B prediction error for the larger company.

It would also be meaningful to look at the model fit (R^2) numbers for this model, as a complement to the WAPE metric.

On the other hand, if we use the YoY percentage increase in revenue as our target metric, then it would be recommended to look at the absolute (MAE) metric rather than the relative (WAPE) metric. Then an absolute error of say 0.05 would mean that the predictions are off by 5 percentage points on average. This is more meaningful than looking at the relative metric, which in that case would be a percentage of the percentage, which is less relevant.

Baseline error

When you look at the results from a model run you will find two values: the prediction error and the baseline error.

The prediction error is calculated by taking the out-of-sample predictions made by the model during the backtest runs and comparing them to the observed values. This value is averaged over all companies the model is run for.

The baseline error is calculated in the same manner, according to the selected metric (either relative errors, absolute errors or model fit), but using predictions coming from a much simpler model. This simpler model uses only the target time series itself as its input and ignores all the input predictors. That is, it tries to predict future values in the target by looking at trends in the target series itself.

If the prediction error and the baseline error are roughly equal, it means that the input predictors did not add valuable information to the model. Thus, you should only consider a model useful if the prediction error is significantly better than the baseline error.

Benchmark error

If you have specified a “benchmark signal” when setting up the model, there will also be a third number: the benchmark error.

The benchmark error is also calculated in the same manner as the prediction error and the baseline error, according to the selected metric (either relative errors, absolute errors or model fit). But here the values in the provided signal are used as the predictions.

A typical use case would be to specify the analysts’ consensus estimates as the benchmark. This would let you see the accuracy of the consensus estimates compared with the accuracy of the prediction model. Note though that even if the consensus estimates have a better accuracy than the prediction model on average, there may still be significant informational value in using the prediction model in addition, because it may give important directional information about the expected surprise in the numbers. If the KPI prediction model predicts a higher number than the estimate, there may be a higher probability of the company’s actual numbers beating the consensus, versus if the prediction model predicts a lower number, there may be a higher probability of the actual numbers missing the consensus.