Hyper-Parameters Tuning#

Hyper-parameters tuning in skfolio follows the same API as scikit-learn.

Hyper-parameters are parameters that are not directly learnt within estimators. They are passed as arguments to the constructor of the estimator classes.

It is possible and recommended to search the hyper-parameter space for the best cross validation score.

Any parameter provided when constructing an estimator may be optimized in this manner. Specifically, to find the names and current values for all parameters for a given estimator, use:

estimator.get_params()

A search consists of:

  • an estimator (such as MeanRisk)

  • a parameter space

  • a method for searching or sampling candidates

  • a cross-validation scheme

  • a score function

Two generic approaches to parameter search are provided in scikit-learn: for given values, GridSearchCV exhaustively considers all parameter combinations, while RandomizedSearchCV can sample a given number of candidates from a parameter space with a specified distribution.

After describing these tools we detail best practices applicable to these approaches.

Randomized Parameter Optimization#

While using a grid of parameter settings is currently the most widely used method for parameter optimization, other search methods have more favorable properties. RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search:

  • A budget can be chosen independent of the number of parameters and possible values.

  • Adding parameters that do not influence the performance does not decrease efficiency.

Specifying how parameters should be sampled is done using a dictionary, very similar to specifying parameters for GridSearchCV. Additionally, a computation budget, being the number of sampled candidates or sampling iterations, is specified using the n_iter parameter. For each parameter, either a distribution over possible values or a list of discrete choices (which will be sampled uniformly) can be specified.

In principle, any function can be passed that provides a rvs (random variate sample) method to sample a value. A call to the rvs function should provide independent random samples from possible parameter values on consecutive calls.

The scipy.stats module contains many useful distributions for sampling parameters, such as expon, gamma, uniform, loguniform or randint.

For continuous parameters, such as l1_coef above, it is important to specify a continuous distribution to take full advantage of the randomization. This way, increasing n_iter will always lead to a finer search.

A continuous log-uniform random variable is the continuous version of a log-spaced parameter. For example to specify the equivalent of l2_coef from above, loguniform(0.01,  1) can be used instead of [0.01, 0.1, 1].

Mirroring the example above in grid search, we can specify a continuous random variable that is log-uniformly distributed between 0.01 and 1:

import scipy.stats as stats
{'l1_coef': stats.loguniform(0.01,  1), 'risk_measure': [RiskMeasure.SEMI_VARIANCE]}

Example:

import scipy.stats as stats
from sklearn.model_selection import KFold, RandomizedSearchCV, train_test_split

from skfolio import RiskMeasure
from skfolio.datasets import load_sp500_dataset
from skfolio.optimization import MeanRisk
from skfolio.preprocessing import prices_to_returns

prices = load_sp500_dataset()
X = prices_to_returns(prices)
X_train, X_test = train_test_split(X, test_size=0.33, shuffle=False)

param_dist = {'l2_coef': stats.loguniform(0.01,  1), 'risk_measure': [RiskMeasure.CVAR]}

rd_search = RandomizedSearchCV(
    estimator=MeanRisk(min_weights=-1),
    cv=KFold(),
    n_iter=10,
    param_distributions=param_dist,
    n_jobs=-1  # using all cores
)
rd_search.fit(X_train)
print(rd_search.cv_results_)

best_model = rd_search.best_estimator_
print(best_model.weights_)