Model Selection#

The Model Selection module extends sklearn.model_selection by adding additional methods tailored for portfolio selection.

Cross-Validation Prediction#

Every skfolio estimator is compatible with sklearn.model_selection.cross_val_predict. We also implement our own cross_val_predict for enhanced integration with Portfolio and Population objects, as well as compatibility with CombinatorialPurgedCV.

Danger

When using scikit-learn selection tools like KFold or train_test_split, ensure that the parameter shuffle is set to False to avoid data leakage. Financial features often incorporate series that exhibit serial correlation (like ARMA processes) and shuffling the data will lead to leakage from the test set to the training set.

In cross_val_predict, the data is split according to the cv parameter. The portfolio optimization estimator is fitted on the training set and portfolios are predicted on the corresponding test set.

For non-combinatorial cross-validation like Kfold, the output is the predicted MultiPeriodPortfolio where each Portfolio corresponds to the prediction on each train/test pair (K portfolios for Kfold).

For combinatorial cross-validation like CombinatorialPurgeCV, the output is the predicted Population of multiple MultiPeriodPortfolio. This is because each test output is a collection of multiple paths instead of one single path.

Example:

import numpy as np
from sklearn.model_selection import KFold

from skfolio.datasets import load_sp500_dataset
from skfolio.model_selection import CombinatorialPurgedCV, cross_val_predict
from skfolio.optimization import MeanRisk
from skfolio.preprocessing import prices_to_returns

prices = load_sp500_dataset()
X = prices_to_returns(prices)

# One single path -> pred is a MultiPeriodPortfolio
pred = cross_val_predict(MeanRisk(), X, cv=KFold())
print(pred.sharpe_ratio)
np.asarray(pred)  # predicted returns vector

# Multiple paths -> pred is a Population of MultiPeriodPortfolio
pred = cross_val_predict(MeanRisk(), X, cv=CombinatorialPurgedCV())
print(pred.summary())
print(np.asarray(pred))  # predicted returns matrix

Combinatorial Purged Cross-Validation#

Compared to KFold, which splits the data into k folds and generates one single testing path, the CombinatorialPurgedCV uses the combination of multiple train/test sets to generate multiple testing paths.

To avoid data leakage, purging and embargoing can be performed.

Purging consist of removing from the training set all observations whose labels overlapped in time with those labels included in the testing set. Embargoing consist of removing from the training set observations that immediately follow an observation in the testing set, since financial features often incorporate series that exhibit serial correlation (like ARMA processes).

When used with cross_val_predict, the object returned is a Population of MultiPeriodPortfolio representing each prediction path.

Example:

from skfolio import RatioMeasure
from skfolio.datasets import load_sp500_dataset
from skfolio.model_selection import CombinatorialPurgedCV, cross_val_predict
from skfolio.optimization import MeanRisk
from skfolio.preprocessing import prices_to_returns

prices = load_sp500_dataset()
X = prices_to_returns(prices)

pred = cross_val_predict(MeanRisk(), X, cv=CombinatorialPurgedCV())
print(pred.summary())

portfolio = pred.quantile(measure=RatioMeasure.SHARPE_RATIO, q=0.95)
print(portfolio.annualized_sharpe_ratio)

The default parameters of the CombinatorialPurgedCV are n_folds=10 and n_test_folds=8. You may want to choose these parameters to target a number of test paths and an average training size. The later depends on the number of observations. For that, you can use the function optimal_folds_number as shown in the example HRP vs HERC.

n_folds, n_test_folds = optimal_folds_number(
    n_observations=X_test.shape[0],
    target_n_test_paths=100,
    target_train_size=252,
)

cv = CombinatorialPurgedCV(n_folds=n_folds, n_test_folds=n_test_folds)
cv.summary(X_test)