Drop Highly Correlated Assets#

This tutorial introduces the pre-selection transformers DropCorrelated to remove highly correlated assets before the optimization.

Highly correlated assets tend to increase the instability of mean-variance optimization.

In this example, we will compare a mean-variance optimization with and without pre-selection.

Data#

We load the FTSE 100 dataset composed of the daily prices of 64 assets from the FTSE 100 Index composition starting from 2000-01-04 up to 2023-05-31:

from plotly.io import show
from sklearn import set_config
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from skfolio import Population, RatioMeasure
from skfolio.datasets import load_ftse100_dataset
from skfolio.model_selection import (
    CombinatorialPurgedCV,
    cross_val_predict,
    optimal_folds_number,
)
from skfolio.optimization import MeanRisk, ObjectiveFunction
from skfolio.pre_selection import DropCorrelated
from skfolio.preprocessing import prices_to_returns

prices = load_ftse100_dataset()

X = prices_to_returns(prices)
X_train, X_test = train_test_split(X, test_size=0.33, shuffle=False)

Model#

First, we create a maximum Sharpe Ratio model without pre-selection and fit it on the training set:

model1 = MeanRisk(objective_function=ObjectiveFunction.MAXIMIZE_RATIO)
model1.fit(X_train)
model1.weights_
array([5.72489247e-08, 6.92799374e-02, 2.99565070e-02, 5.15427367e-07,
       5.98248542e-08, 2.11954169e-07, 1.45472664e-07, 6.30531933e-08,
       1.38474971e-01, 5.39755883e-04, 1.19668649e-06, 2.04407314e-07,
       8.79495071e-03, 9.56173665e-08, 9.05345542e-08, 3.00727418e-07,
       8.30609346e-02, 3.48492426e-07, 1.18995204e-07, 1.41496349e-07,
       3.13544690e-02, 9.49328188e-08, 3.89233565e-02, 6.73529746e-08,
       1.08791263e-01, 1.03983576e-07, 1.81104294e-01, 1.71678031e-07,
       6.17026726e-08, 1.85876177e-07, 1.06482315e-07, 5.09757679e-08,
       1.97934913e-06, 4.57814906e-08, 8.41631508e-02, 6.52279083e-08,
       6.16740873e-03, 1.07868906e-07, 1.72699953e-07, 8.59571972e-08,
       1.12484994e-01, 1.77846764e-07, 7.68211297e-08, 8.46476487e-08,
       9.91391602e-08, 1.08820473e-07, 9.52131430e-08, 4.71020368e-07,
       9.27208518e-08, 1.40414869e-07, 3.82148954e-03, 8.28101461e-02,
       3.04499311e-03, 7.93434219e-08, 1.98631697e-07, 1.72168221e-02,
       1.14241628e-07, 1.24128911e-07, 2.22757657e-07, 3.33244835e-07,
       1.59411484e-07, 4.79955612e-07, 9.26676073e-08, 5.55225622e-07])

Pipeline#

Then, we create a maximum Sharpe ratio model with pre-selection using Pipepline and fit it on the training set:

set_config(transform_output="pandas")

model2 = Pipeline(
    [
        ("pre_selection", DropCorrelated(threshold=0.5)),
        ("optimization", MeanRisk(objective_function=ObjectiveFunction.MAXIMIZE_RATIO)),
    ]
)
model2.fit(X_train)
model2.named_steps["optimization"].weights_
array([8.18629046e-02, 2.99990921e-02, 1.16397541e-06, 3.33347825e-07,
       1.82548038e-01, 2.85482485e-03, 3.56265072e-06, 1.10513944e-02,
       2.32253197e-07, 2.05141268e-07, 7.34710301e-07, 8.21495217e-02,
       9.52097597e-07, 3.41747905e-07, 3.58028595e-02, 4.20420351e-02,
       1.56245104e-07, 2.35646825e-07, 1.85053358e-01, 3.80020780e-07,
       2.42398710e-07, 1.15032452e-03, 1.09539530e-07, 9.00503120e-02,
       2.41991003e-07, 3.83742726e-07, 1.23265556e-01, 3.91693342e-07,
       1.83251807e-07, 2.09337576e-07, 2.38594386e-07, 2.26123969e-07,
       1.11576161e-06, 2.18898632e-07, 8.33465412e-03, 8.57587244e-02,
       1.29547016e-02, 1.87531533e-07, 4.25937609e-07, 2.51052265e-02,
       2.73616225e-07, 3.03373275e-07, 5.56618079e-07, 3.50951178e-07,
       1.05787166e-06, 1.45723457e-06])

Prediction#

We predict both models on the test set:

ptf1 = model1.predict(X_test)
ptf1.name = "model1"
ptf2 = model2.predict(X_test)
ptf2.name = "model2"

print(ptf1.n_assets)
print(ptf2.n_assets)
64
46

Each predicted object is a MultiPeriodPortfolio. For improved analysis, we can add them to a Population:

population = Population([ptf1, ptf2])

Let’s plot the portfolios cumulative returns on the test set:

population.plot_cumulative_returns()


Combinatorial Purged Cross-Validation#

Only using one testing path (the historical path) may not be enough for comparing both models. For a more robust analysis, we can use the CombinatorialPurgedCV to create multiple testing paths from different training folds combinations.

We choose n_folds and n_test_folds to obtain around 100 test paths and an average training size of 800 days:

n_folds, n_test_folds = optimal_folds_number(
    n_observations=X_test.shape[0],
    target_n_test_paths=100,
    target_train_size=800,
)

cv = CombinatorialPurgedCV(n_folds=n_folds, n_test_folds=n_test_folds)
cv.summary(X_test)
Number of Observations             1967
Total Number of Folds                10
Number of Test Folds                  6
Purge Size                            0
Embargo Size                          0
Average Training Size               786
Number of Test Paths                126
Number of Training Combinations     210
dtype: int64
pred_1 = cross_val_predict(
    model1,
    X_test,
    cv=cv,
    n_jobs=-1,
    portfolio_params=dict(annualized_factor=252, tag="model1"),
)

pred_2 = cross_val_predict(
    model2,
    X_test,
    cv=cv,
    n_jobs=-1,
    portfolio_params=dict(annualized_factor=252, tag="model2"),
)

The predicted object is a Population of MultiPeriodPortfolio. Each MultiPeriodPortfolio represents one testing path of a rolling portfolio. For improved analysis, we can merge the populations of each model:

population = pred_1 + pred_2

Distribution#

We plot the out-of-sample distribution of Sharpe ratio for both models:

fig = population.plot_distribution(
    measure_list=[RatioMeasure.SHARPE_RATIO], tag_list=["model1", "model2"], n_bins=40
)
show(fig)

Model 1:

print(
    "Average of Sharpe Ratio:"
    f" {pred_1.measures_mean(measure=RatioMeasure.ANNUALIZED_SHARPE_RATIO):0.2f}"
)
print(
    "Std of Sharpe Ratio:"
    f" {pred_1.measures_std(measure=RatioMeasure.ANNUALIZED_SHARPE_RATIO):0.2f}"
)
Average of Sharpe Ratio: 0.46
Std of Sharpe Ratio: 0.20

Model 2:

print(
    "Average of Sharpe Ratio:"
    f" {pred_2.measures_mean(measure=RatioMeasure.ANNUALIZED_SHARPE_RATIO):0.2f}"
)
print(
    "Std of Sharpe Ratio:"
    f" {pred_2.measures_std(measure=RatioMeasure.ANNUALIZED_SHARPE_RATIO):0.2f}"
)
Average of Sharpe Ratio: 0.51
Std of Sharpe Ratio: 0.21

Total running time of the script: (0 minutes 5.559 seconds)

Gallery generated by Sphinx-Gallery