Hyper-Parameters Tuning#

Hyper-parameters tuning in skfolio follows the same API as scikit-learn.

Hyper-parameters are parameters that are not directly learnt within estimators. They are passed as arguments to the constructor of the estimator classes.

It is possible and recommended to search the hyper-parameter space for the best cross validation score.

Any parameter provided when constructing an estimator may be optimized in this manner. Specifically, to find the names and current values for all parameters for a given estimator, use:

estimator.get_params()

A search consists of:

an estimator (such as MeanRisk)
a parameter space
a method for searching or sampling candidates
a cross-validation scheme
a score function

Two generic approaches to parameter search are provided in scikit-learn: for given values, GridSearchCV exhaustively considers all parameter combinations, while RandomizedSearchCV can sample a given number of candidates from a parameter space with a specified distribution.

After describing these tools we detail best practices applicable to these approaches.

Exhaustive Grid Search#

The grid search provided by GridSearchCV exhaustively generates candidates from a grid of parameter values specified with the param_grid parameter. For instance, the following param_grid:

param_grid = [
    {'l1_coef': [0.001, 0.01, 0.1], 'risk_measure': [RiskMeasure.SEMI_VARIANCE]},
    {'l1_coef': [0.001, 0.01, 0.1], 'l2_coef': [0.01, 0.1, 1], 'risk_measure': [RiskMeasure.CVAR]},
]

specifies that two grids should be explored: one with a Semi-Variance risk measure and l1_coef values in [0.001, 0.01, 0.1], and the second one with a CVaR risk measure, and the cross-product of l1_coef values ranging in [0.001, 0.01, 0.1] and l2_coef values in [0.01, 0.1, 1].

The GridSearchCV instance implements the usual estimator API: when “fitting” it on a dataset all the possible combinations of parameter values are evaluated and the best combination is retained.

Example:

from sklearn.model_selection import GridSearchCV, KFold, train_test_split

from skfolio import RiskMeasure
from skfolio.datasets import load_sp500_dataset
from skfolio.optimization import MeanRisk
from skfolio.preprocessing import prices_to_returns

prices = load_sp500_dataset()
X = prices_to_returns(prices)
X_train, X_test = train_test_split(X, test_size=0.33, shuffle=False)

param_grid = [
    {'l1_coef': [0.001, 0.01, 0.1], 'risk_measure': [RiskMeasure.SEMI_VARIANCE]},
    {'l1_coef': [0.001, 0.01, 0.1], 'l2_coef': [0.01, 0.1, 1], 'risk_measure': [RiskMeasure.CVAR]},
]

grid_search = GridSearchCV(
    estimator=MeanRisk(min_weights=-1),
    cv=KFold(),
    param_grid=param_grid,
    n_jobs=-1  # using all cores
)
grid_search.fit(X_train)
print(grid_search.cv_results_)

best_model = grid_search.best_estimator_
print(best_model.weights_)

Randomized Parameter Optimization#

While using a grid of parameter settings is currently the most widely used method for parameter optimization, other search methods have more favorable properties. RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search:

A budget can be chosen independent of the number of parameters and possible values.
Adding parameters that do not influence the performance does not decrease efficiency.

Specifying how parameters should be sampled is done using a dictionary, very similar to specifying parameters for GridSearchCV. Additionally, a computation budget, being the number of sampled candidates or sampling iterations, is specified using the n_iter parameter. For each parameter, either a distribution over possible values or a list of discrete choices (which will be sampled uniformly) can be specified.

In principle, any function can be passed that provides a rvs (random variate sample) method to sample a value. A call to the rvs function should provide independent random samples from possible parameter values on consecutive calls.

The scipy.stats module contains many useful distributions for sampling parameters, such as expon, gamma, uniform, loguniform or randint.

For continuous parameters, such as l1_coef above, it is important to specify a continuous distribution to take full advantage of the randomization. This way, increasing n_iter will always lead to a finer search.

A continuous log-uniform random variable is the continuous version of a log-spaced parameter. For example to specify the equivalent of l2_coef from above, loguniform(0.01, 1) can be used instead of [0.01, 0.1, 1].

Mirroring the example above in grid search, we can specify a continuous random variable that is log-uniformly distributed between 0.01 and 1:

import scipy.stats as stats
{'l1_coef': stats.loguniform(0.01,  1), 'risk_measure': [RiskMeasure.SEMI_VARIANCE]}

Example:

import scipy.stats as stats
from sklearn.model_selection import KFold, RandomizedSearchCV, train_test_split

from skfolio import RiskMeasure
from skfolio.datasets import load_sp500_dataset
from skfolio.optimization import MeanRisk
from skfolio.preprocessing import prices_to_returns

prices = load_sp500_dataset()
X = prices_to_returns(prices)
X_train, X_test = train_test_split(X, test_size=0.33, shuffle=False)

param_dist = {'l2_coef': stats.loguniform(0.01,  1), 'risk_measure': [RiskMeasure.CVAR]}

rd_search = RandomizedSearchCV(
    estimator=MeanRisk(min_weights=-1),
    cv=KFold(),
    n_iter=10,
    param_distributions=param_dist,
    n_jobs=-1  # using all cores
)
rd_search.fit(X_train)
print(rd_search.cv_results_)

best_model = rd_search.best_estimator_
print(best_model.weights_)

Tips for Parameter Search#

Specifying an Objective Metric#

By default, all portfolio optimization estimators have the same score function which is the Sharpe Ratio. This score function can be customized with make_scorer by using another measure or by writing your own score function.

Example:

In the below example, the Sortino Ratio is used instead of the default Sharpe Ratio:

from sklearn.model_selection import GridSearchCV, KFold, train_test_split

from skfolio import RatioMeasure
from skfolio.datasets import load_sp500_dataset
from skfolio.metrics import make_scorer
from skfolio.optimization import MeanRisk
from skfolio.preprocessing import prices_to_returns

prices = load_sp500_dataset()
X = prices_to_returns(prices)
X_train, X_test = train_test_split(X, test_size=0.33, shuffle=False)

scoring = make_scorer(RatioMeasure.SORTINO_RATIO)

grid_search = GridSearchCV(
    estimator=MeanRisk(min_weights=-1),
    cv=KFold(),
    param_grid={'l2_coef': [0.0001, 0.001,  0.01, 1]},
    scoring=scoring
)

grid_search.fit(X_train)
print(grid_search.cv_results_)

best_model = grid_search.best_estimator_
print(best_model.weights_)

pred = best_model.predict(X_test)
print(pred.sortino_ratio)

Example:

In this example, we use a custom score function:

def custom_score(pred):
    return pred.mean - 2 * pred.variance - 3 * pred.semi_variance

scoring = make_scorer(custom_score)

Composite Estimators and Parameter Spaces#

GridSearchCV and RandomizedSearchCV allow searching over parameters of composite or nested estimators using a dedicated <estimator>__<parameter> syntax.

Example:

In the below example, we search the optimal parameter alpha of the nested estimator EWMu:

from sklearn.model_selection import GridSearchCV, KFold, train_test_split

from skfolio.datasets import load_sp500_dataset
from skfolio.moments import EWMu
from skfolio.optimization import MeanRisk, ObjectiveFunction
from skfolio.preprocessing import prices_to_returns
from skfolio.prior import EmpiricalPrior

prices = load_sp500_dataset()
X = prices_to_returns(prices)
X_train, X_test = train_test_split(X, test_size=0.33, shuffle=False)

model = MeanRisk(
    objective_function=ObjectiveFunction.MAXIMIZE_RATIO,
    prior_estimator=EmpiricalPrior(mu_estimator=EWMu(alpha=0.2)),
)

print(model.get_params(deep=True))

param_grid = {"prior_estimator__mu_estimator__alpha": [0.001, 0.01, 0.01, 0.1]}

grid_search = GridSearchCV(
    estimator=model,
    cv=KFold(),
    param_grid=param_grid,
)

grid_search.fit(X_train)
print(grid_search.best_estimator_)

Example:

The same logic applies for Pipeline. Here we search the optimal risk measure of MeanRisk which is part of a Pipeline:

from sklearn.model_selection import GridSearchCV, KFold, train_test_split
from sklearn.pipeline import Pipeline

from skfolio import RiskMeasure
from skfolio.datasets import load_sp500_dataset
from skfolio.optimization import MeanRisk
from skfolio.pre_selection import SelectKExtremes
from skfolio.preprocessing import prices_to_returns

prices = load_sp500_dataset()
X = prices_to_returns(prices)
X_train, X_test = train_test_split(X, test_size=0.33, shuffle=False)

model = Pipeline(
    [
        ("pre_selection", SelectKExtremes(k=10, highest=True)),
        ("optimization", MeanRisk()),
    ]
)

param_grid = {
    "optimization__risk_measure": [RiskMeasure.SEMI_VARIANCE, RiskMeasure.CVAR]
}

grid_search = GridSearchCV(
    estimator=model,
    cv=KFold(),
    param_grid=param_grid,
)

grid_search.fit(X_train)
print(grid_search.best_estimator_)

Parallelism#

The parameter search tools evaluate each parameter combination on each data fold independently. Computations can be run in parallel by using the keyword n_jobs=-1. See function signature for more details.