Auto-modelling
Automatic model selection, hyperparameter optimization, and ensembling
Introduction
When creating a machine learning model, there are numerous choices to be made, regarding which inputs to use, how those inputs should be processed, which modelling technique to use, and what parameters to use for the chosen model. It is difficult and time-consuming for a human to make these decisions and test which combination of choices gives the best result. Some choices can be made based on past experience, but even expert data scientists would need to experiment to determine the optimal choices for the particular problem at hand.
This is why Exabel's modelling suite includes auto-modelling capability that uses compute power to search for the optimal choices, so that the end user doesn't have to relate to all the detailed decisions to be made. An expert user can still make some of the decisions if desired based on experience or insight into the nature of the particular problem, and let the system determine the rest. Most users will use the "Full auto" mode, where they only need to decide the target and input signals, and let the system determine the rest.
To know which choices are the best, you need a way to evaluate a given model configuration. In the Exabel system, this is always done through cross-validation, with a choice of either walk-forward backtesting or 5-fold cross validation. For each model run, the system will evaluate tens or hundreds of model configurations. Each model configuration is tested with cross validation and scored numerically (WAPE for prediction models and log loss for classification models). The best performing model is chosen. Or, if ensemble modelling is enabled, a few (typically 2-3) of the best performing models are chosen and combined into a weighted ensemble model.
Hyperparameter optimization process
The number of model configurations that are evaluated, depends on the chosen "Hyperopt level":
Level | Runs |
---|---|
No hyperopt | One run with all of the input signals enabled, and for each input signal, one run with only that input signal enabled. |
Level 1 | 30 runs with all of the input signals enabled, and for each input signal, ten runs with only that input signal enabled. The runs are using an Elastic Net model only for prediction models, and for the "Prediction model (TS)" runs, the order is restricted to at most 1, and linear trend is not included. |
Level 2 | 50 runs where the hyperoptimization process decides which input signals are enabled, and for each input signal, ten runs with only that input signal enabled. |
Level 3 | 100 runs where the hyperoptimization process decides which input signals are enabled, and for each input signal, ten runs with only that input signal enabled. |
The user can specify which hyperopt level to use, or leave it on "Auto" to let the system decide based on the dataset size. The larger the data set, the higher hyperopt level will be used, as there is less chance of overfitting with more training data. For a small problem, like modelling a quarterly KPI for a single company, the "No hyperopt" level is chosen. For larger problems where we have enough data to get at least 40 backtested data points for evaluation, Level 3 is chosen. For problems in between, Level 2 is chosen.
The model configurations are not chosen at random. Rather, we use a library for hyperparameter optimization called Hyperopt, which uses an algorithm called Tree of Parzen Estimators to do this search in a smarter way that learns which choices give the best results.
This hyperopt search is randomized. That means that any change in the input data will lead to a different model configuration being chosen. The search space of possible model configurations is huge, and there will be many model configurations that obtain almost equally good scores during the cross validation evaluation. That's why even a small change like adding a single data point, can lead to very different model configuration choices. The seed to the random generator is fixed, so as long as the input data is exactly the same, the result will be identical.
Hyperparameters
Selection of inputs
Possibly the most important decision is which inputs to include in the model.
The hyperoptimization process is used to decide whether or not each input signal is included.
For "Prediction model (TS)" models, additionally the following decisions are made:
- Autoregressive: Whether or not to include the previous target value(s) as input to the model.
- Order: The number of lags to include in the model, from 0-3. If the order is 0, then only the current values of each input signal is included. If the order is 1, then additionally the previous value of each input signal is included. If the order is 2, then additionally the value before that again is included, and so on. If the model is autoregressive, and the order is 0, then only the previous value of the target signal is included. If the order is 1, then the two previous values are included, and so on.
- Seasonality: Whether or not to include seasonality in the model. For quarterly target time series, four indicator variables are added, one for each quarter. For monthly target time series, twelve indicator variables are added, one for each month. For non-standard fiscal quarters, the indicator variable will show how large fraction of the quarter falls within each calendar quarter.
- Linear trend: Whether or not to include a linear trend variable for each input signal (and for the target signal, if the model is autoregressive). The linear trend is calculated for the past 5 data points by calculating a linear regression with time as the regressor.
Preprocessing
With "Prediction model (ML)", the input signals and the target signal are always normalized to have a median value of 0.0 and a mean absolute deviation of 1.0, by applying a linear transform.
With "Prediction model (TS)", the input signals and the target signal are also always normalized to have a median value of 0.0. However, the hyperoptimizer chooses whether or not to also normalize them to have a median value of 0.0.
With "Prediction model (TS)", the hyperoptimizer chooses whether or not to reduce the effect of outliers in the input data by applying a non-linear transform that reduces the magnitude of such outliers (the transform is fairly linear within ±3 standard deviations, while falling logarithmically beyond that).
Model types and parameters
When using "Automatic modelling" for classification problems, it's always an Extreme Gradient Boosting model being used.
When using "Automatic modelling" for regression problems, i.e. when using Prediction model (ML) & (TS), the hyperoptimization process will choose between three modelling techniques: Elastic Net, Neural Network and Extreme Gradient Boosting. These particular modelling techniques have been chosen by Exabel because they are commonly used and tend to work well across a wide variety of modelling problems, and they are very different in nature and will complement each other well to span the full range of the common use cases in modelling financial KPIs.
Alternatively, the user can manually choose from a longer list of modelling techniques in the configuration UI if they choose "Custom configuration".
In either case, the hyperoptimization process will also optimize the parameters of the modelling technique being used. Below are shown the hyperparameters of the standard modelling techniques.
Elastic Net
Elastic net linear regression model from scikit-learn.
See documentation at:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html
Parameter | min | max | default | |
---|---|---|---|---|
l1_ratio | 0.01 | 0.95 | 0.5 | The ElasticNet mixing parameter |
alpha | 0.001 | 1.0 | 0.01 | Constant that multiplies the penalty terms |
positive | False | True | False | When set to True, forces the coefficients to be positive |
Neural Network
Neural network regression model from scikit-learn.
See documentation at:
https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html
Parameter | min | max | default | |
---|---|---|---|---|
alpha | 0.00001 | 0.1 | 0.01 | L2 penalty (regularization term) parameter |
activation | "relu" | "logistic" | "logistic" | Activation function for the hidden layer |
n_layers | 1 | 2 | 1 | Number of hidden layers in the network |
hidden_size | 5 | 50 | 25 | Number of nodes in each hidden layer |
Extreme Gradient Boosting
XGBoost model, used for both regression and classification models.
See detailed documentation of parameters here:
https://xgboost.readthedocs.io/en/latest/parameter.html
Hyperparameters for regression models ("Prediction model (TS/ML)"):
Parameter | min | max | default | |
---|---|---|---|---|
objective | squared error | pseudo huber error | squared error | regression with squared loss or with Pseudo Huber loss, a twice differentiable alternative to absolute loss |
n_estimators | 40 | 150 | 100 | The number of decision trees used in boosting |
reg_alpha | 0.001 | 0.5 | 0 | L1 regularization term on weights. Increasing this value will make model more conservative. |
reg_lambda | 0.1 | 2.0 | 0 | L2 regularization term on weights. Increasing this value will make model more conservative. |
learning_rate | 0.03 | 0.9 | 0.3 | Step size shrinkage used in update to prevent overfitting |
subsample | 0.52 | 0.95 | 1 | Subsample ratio of the training instances |
max_depth | 1 | 6 | 6 | The maximum depth of each decision tree |
Hyperparameters for classification models:
Parameter | min | max | default | |
---|---|---|---|---|
n_estimators | 40 | 150 | 100 | The number of decision trees used in boosting |
learning_rate | 0.01 | 0.5 | 0.1 | Step size shrinkage used in update to prevent overfitting |
subsample | 0.52 | 0.95 | 1 | Subsample ratio of the training instances |
max_depth | 1 | 4 | 3 | The maximum depth of each decision tree |
Ensemble models
Combining multiple models into an ensemble model is frequently done in machine learning to increase robustness and improve accuracy. The idea is that different models may have learnt different things and have different "perspectives", and by combining them you average out some of the errors to produce better predictions.
If ensemble models are enabled (which they are by default), the system will consider assembling several models into an ensemble model. This happens after the hyperoptimization has completed, based on the evaluation results of the individual model configurations.
First, the system narrows down to a set of eligible model configurations. Any model configuration with a WAPE score more than 50% higher than the best-performing model is excluded. Then similar models are removed - any model whose errors from the cross-validation have a higher than 80% correlation with a better model, is also excluded. This is for the purpose of creating variation within the ensemble.
For the remaining candidate models, we calculate the covariances between their errors from the cross-validation (either walk-forward backtesting or 5-fold cross validation). Then an optimization routine assigns optimal weights (between 0.0 and 1.0) to each model in order to minimize the estimated variance of the ensemble. The single best-performing model configuration is always included in the ensemble, with a weight of at least 0.2.
The outcome of this process may be that only one model is selected, either because all models except the best performing one were excluded, or because the weight optimization routine found that assigning a weight of 1.0 to the best performing model is what would give the lowest estimated variance. Thus even if ensemble modelling is enabled, it is normal for the result to be a single model, or else it will typically be a combination of 2-3 models.
Updated 8 months ago