Drop Highly Correlated Assets#

This tutorial introduces the pre-selection transformers DropCorrelated to remove highly correlated assets before the optimization.

Highly correlated assets tend to increase the instability of mean-variance optimization.

In this example, we will compare a mean-variance optimization with and without pre-selection.

Data#

We load the FTSE 100 dataset composed of the daily prices of 64 assets from the FTSE 100 Index composition starting from 2000-01-04 up to 2023-05-31:

from plotly.io import show
from sklearn import set_config
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from skfolio import Population, RatioMeasure
from skfolio.datasets import load_ftse100_dataset
from skfolio.model_selection import (
    CombinatorialPurgedCV,
    cross_val_predict,
    optimal_folds_number,
)
from skfolio.optimization import MeanRisk, ObjectiveFunction
from skfolio.pre_selection import DropCorrelated
from skfolio.preprocessing import prices_to_returns

prices = load_ftse100_dataset()

X = prices_to_returns(prices)
X_train, X_test = train_test_split(X, test_size=0.33, shuffle=False)

Model#

First, we create a maximum Sharpe Ratio model without pre-selection and fit it on the training set:

model1 = MeanRisk(objective_function=ObjectiveFunction.MAXIMIZE_RATIO)
model1.fit(X_train)
model1.weights_
array([1.11484352e-11, 6.92866673e-02, 2.99620661e-02, 1.10758846e-10,
       1.16254780e-11, 4.14964179e-11, 2.84069930e-11, 1.22736942e-11,
       1.38469617e-01, 4.65046123e-04, 2.67826275e-10, 4.03454161e-11,
       8.79558938e-03, 1.85405564e-11, 1.75767841e-11, 5.92183632e-11,
       8.30675668e-02, 6.89279675e-11, 2.31563874e-11, 2.74514930e-11,
       3.13808863e-02, 1.84375687e-11, 3.89248471e-02, 1.30775206e-11,
       1.08797657e-01, 2.02346903e-11, 1.81108516e-01, 3.40116574e-11,
       1.18849904e-11, 3.64862543e-11, 2.09437885e-11, 9.84526736e-12,
       5.12587013e-10, 8.82872903e-12, 8.41661311e-02, 1.27314017e-11,
       6.20495386e-03, 2.09702783e-11, 3.38535227e-11, 1.68752474e-11,
       1.12489362e-01, 3.53621465e-11, 1.48148340e-11, 1.63374918e-11,
       1.94513306e-11, 2.11982861e-11, 1.84887386e-11, 9.74395933e-11,
       1.79633825e-11, 2.74523289e-11, 3.80823150e-03, 8.28133405e-02,
       2.98247367e-03, 1.54391616e-11, 3.96913808e-11, 1.72770463e-02,
       2.20575709e-11, 2.42859778e-11, 4.36110766e-11, 6.52209193e-11,
       3.12158920e-11, 1.00965673e-10, 1.80664221e-11, 1.14323018e-10])

Pipeline#

Then, we create a maximum Sharpe ratio model with pre-selection using Pipepline and fit it on the training set:

set_config(transform_output="pandas")

model2 = Pipeline(
    [
        ("pre_selection", DropCorrelated(threshold=0.5)),
        ("optimization", MeanRisk(objective_function=ObjectiveFunction.MAXIMIZE_RATIO)),
    ]
)
model2.fit(X_train)
model2.named_steps["optimization"].weights_
array([8.17183301e-02, 2.99627408e-02, 5.75907600e-11, 2.59687833e-11,
       1.82564384e-01, 2.92880653e-03, 1.52217700e-10, 1.13239659e-02,
       1.76926114e-11, 1.62035863e-11, 4.94836568e-11, 8.20549958e-02,
       6.08919173e-11, 2.64696218e-11, 3.58312606e-02, 4.20210608e-02,
       1.25426519e-11, 1.83697624e-11, 1.84916828e-01, 2.91846109e-11,
       1.96049980e-11, 1.05201907e-03, 8.54482537e-12, 8.99284245e-02,
       1.91471166e-11, 2.98391244e-11, 1.23117943e-01, 2.98558904e-11,
       1.40973224e-11, 1.60306246e-11, 1.91741539e-11, 1.73658093e-11,
       5.65079997e-11, 1.67279164e-11, 8.53270264e-03, 8.56311233e-02,
       1.32678922e-02, 1.48213167e-11, 3.17212380e-11, 2.51475217e-02,
       2.12051211e-11, 2.42766417e-11, 4.17359260e-11, 2.77319466e-11,
       5.40287802e-11, 7.07120393e-11])

Prediction#

We predict both models on the test set:

ptf1 = model1.predict(X_test)
ptf1.name = "model1"
ptf2 = model2.predict(X_test)
ptf2.name = "model2"

print(ptf1.n_assets)
print(ptf2.n_assets)
64
46

Each predicted object is a MultiPeriodPortfolio. For improved analysis, we can add them to a Population:

population = Population([ptf1, ptf2])

Let’s plot the portfolios cumulative returns on the test set:

population.plot_cumulative_returns()


Combinatorial Purged Cross-Validation#

Only using one testing path (the historical path) may not be enough for comparing both models. For a more robust analysis, we can use the CombinatorialPurgedCV to create multiple testing paths from different training folds combinations.

We choose n_folds and n_test_folds to obtain around 100 test paths and an average training size of 800 days:

n_folds, n_test_folds = optimal_folds_number(
    n_observations=X_test.shape[0],
    target_n_test_paths=100,
    target_train_size=800,
)

cv = CombinatorialPurgedCV(n_folds=n_folds, n_test_folds=n_test_folds)
cv.summary(X_test)
Number of Observations             1967
Total Number of Folds                10
Number of Test Folds                  6
Purge Size                            0
Embargo Size                          0
Average Training Size               786
Number of Test Paths                126
Number of Training Combinations     210
dtype: int64
pred_1 = cross_val_predict(
    model1,
    X_test,
    cv=cv,
    n_jobs=-1,
    portfolio_params=dict(annualized_factor=252, tag="model1"),
)

pred_2 = cross_val_predict(
    model2,
    X_test,
    cv=cv,
    n_jobs=-1,
    portfolio_params=dict(annualized_factor=252, tag="model2"),
)

The predicted object is a Population of MultiPeriodPortfolio. Each MultiPeriodPortfolio represents one testing path of a rolling portfolio. For improved analysis, we can merge the populations of each model:

population = pred_1 + pred_2

Distribution#

We plot the out-of-sample distribution of Sharpe ratio for both models:

fig = population.plot_distribution(
    measure_list=[RatioMeasure.SHARPE_RATIO], tag_list=["model1", "model2"], n_bins=40
)
show(fig)

Model 1:

print(
    "Average of Sharpe Ratio:"
    f" {pred_1.measures_mean(measure=RatioMeasure.ANNUALIZED_SHARPE_RATIO):0.2f}"
)
print(
    "Std of Sharpe Ratio:"
    f" {pred_1.measures_std(measure=RatioMeasure.ANNUALIZED_SHARPE_RATIO):0.2f}"
)
Average of Sharpe Ratio: 0.46
Std of Sharpe Ratio: 0.20

Model 2:

print(
    "Average of Sharpe Ratio:"
    f" {pred_2.measures_mean(measure=RatioMeasure.ANNUALIZED_SHARPE_RATIO):0.2f}"
)
print(
    "Std of Sharpe Ratio:"
    f" {pred_2.measures_std(measure=RatioMeasure.ANNUALIZED_SHARPE_RATIO):0.2f}"
)
Average of Sharpe Ratio: 0.51
Std of Sharpe Ratio: 0.21

Total running time of the script: (0 minutes 5.738 seconds)

Gallery generated by Sphinx-Gallery