Note
Go to the end to download the full example code. or to run this example in your browser via JupyterLite or Binder
Factor Model#
This tutorial shows how to use the FactorModel
estimator in
the MeanRisk
optimization.
A Prior Estimator in skfolio
fits a ReturnDistribution
containing your pre-optimization inputs (\(\mu\), \(\Sigma\), returns, sample
weight, Cholesky decomposition).
The term “prior” is used in a general optimization sense, not confined to Bayesian priors. It denotes any a priori assumption or estimation method for the return distribution before optimization, unifying both Frequentist, Bayesian and Information-theoretic approaches into a single cohesive framework:
- Frequentist:
- Bayesian:
- Information-theoretic:
In skfolio’s API, all such methods share the same interface and adhere to scikit-learn’s
estimator API: the fit
method accepts X
(the asset returns) and stores the
resulting ReturnDistribution
in its return_distribution_
attribute.
The ReturnDistribution
is a dataclass containing:
mu
: Estimated expected returns of shape (n_assets,)
covariance
: Estimated covariance matrix of shape (n_assets, n_assets)
returns
: (Estimated) asset returns of shape (n_observations, n_assets)
sample_weight
: Sample weight for each observation of shape (n_observations,) (optional)
cholesky
: Lower-triangular Cholesky factor of the covariance (optional)
The FactorModel
estimator estimates the ReturnDistribution
by fitting
a factor model on asset returns alongside a specified prior estimator
for the factor returns.
The purpose of factor models is to impose a structure on financial variables and their covariance matrix by explaining them through a small number of common factors. This can help overcome estimation error by reducing the number of parameters, i.e., the dimensionality of the estimation problem, making portfolio optimization more robust against noise in the data. Factor models also provide a decomposition of financial risk into systematic and security-specific components.
To be fully compatible with scikit-learn
, the fit
method takes X
as the assets
returns and y
as the factors returns. Note that y
is in lowercase even for a 2D
array (more than one factor). This is for consistency with the scikit-learn API.
In this tutorial we will build a Maximum Sharpe Ratio portfolio using the FactorModel
estimator.
Data#
We load the S&P 500 dataset composed of the daily prices of 20 assets from the SPX Index composition and the Factors dataset composed of the daily prices of 5 ETF representing common factors:
from plotly.io import show
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import train_test_split
from skfolio import Population, RiskMeasure
from skfolio.datasets import load_factors_dataset, load_sp500_dataset
from skfolio.moments import GerberCovariance, ShrunkMu
from skfolio.optimization import MeanRisk, ObjectiveFunction
from skfolio.preprocessing import prices_to_returns
from skfolio.prior import EmpiricalPrior, FactorModel, LoadingMatrixRegression
prices = load_sp500_dataset()
factor_prices = load_factors_dataset()
X, y = prices_to_returns(prices, factor_prices)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=False)
Factor Model#
We create a Maximum Sharpe Ratio model using the Factor Model that we fit on the training set:
model_factor_1 = MeanRisk(
risk_measure=RiskMeasure.VARIANCE,
objective_function=ObjectiveFunction.MAXIMIZE_RATIO,
prior_estimator=FactorModel(),
portfolio_params=dict(name="Factor Model 1"),
)
model_factor_1.fit(X_train, y_train)
model_factor_1.weights_
array([1.03294289e-06, 1.27482685e-03, 4.19682806e-07, 3.34130827e-06,
7.36838290e-07, 1.28824409e-06, 5.13031432e-02, 6.35619183e-02,
6.14804836e-07, 1.79106051e-01, 5.03130911e-02, 7.13734379e-02,
4.13002526e-02, 2.27978407e-01, 5.13348034e-02, 1.44130375e-01,
2.99026118e-07, 6.19737850e-02, 5.63413085e-02, 8.67773201e-07])
We can change the BaseLoadingMatrix
that estimates the loading
matrix (betas) of the factors.
The default is the LoadingMatrixRegression
, which fit the factors using a
LassoCV
on each asset separately.
For example, let’s change the LassoCV
into a RidgeCV
without intercept and use
parallelization:
model_factor_2 = MeanRisk(
risk_measure=RiskMeasure.VARIANCE,
objective_function=ObjectiveFunction.MAXIMIZE_RATIO,
prior_estimator=FactorModel(
loading_matrix_estimator=LoadingMatrixRegression(
linear_regressor=RidgeCV(fit_intercept=False), n_jobs=-1
)
),
portfolio_params=dict(name="Factor Model 2"),
)
model_factor_2.fit(X_train, y_train)
model_factor_2.weights_
array([3.97758339e-02, 6.57843874e-03, 2.18405141e-02, 8.98258882e-03,
3.16197378e-02, 1.42391168e-02, 8.00124906e-02, 8.32090802e-02,
4.74782930e-02, 8.59470407e-02, 4.59776221e-02, 5.91778878e-02,
8.42236770e-02, 1.05684777e-01, 6.43841778e-02, 7.94729901e-02,
3.76786710e-05, 5.23695742e-02, 4.35215146e-02, 4.54669667e-02])
We can also change the prior estimator of the factors.
It is used to estimate the ReturnDistribution
containing the
factors expected returns and covariance matrix.
For example, let’s estimate the factors expected returns with James-Stein shrinkage and the factors covariance matrix with the Gerber covariance estimator:
model_factor_3 = MeanRisk(
risk_measure=RiskMeasure.VARIANCE,
objective_function=ObjectiveFunction.MAXIMIZE_RATIO,
prior_estimator=FactorModel(
factor_prior_estimator=EmpiricalPrior(
mu_estimator=ShrunkMu(), covariance_estimator=GerberCovariance()
)
),
portfolio_params=dict(name="Factor Model 3"),
)
model_factor_3.fit(X_train, y_train)
model_factor_3.weights_
array([4.86490688e-07, 4.38230191e-07, 4.24408219e-08, 6.69653310e-08,
5.11878211e-08, 6.14581598e-08, 1.68436387e-02, 2.08439608e-06,
5.27854151e-08, 6.45513596e-02, 6.24004728e-02, 9.61498232e-02,
3.68209826e-01, 2.44692220e-01, 5.86512134e-07, 9.30385157e-06,
2.19939734e-08, 1.47139096e-01, 3.09340377e-07, 5.81316428e-08])
Factor Analysis#
Each fitted estimator is saved with a trailing underscore. For example, we can access the fitted prior estimator with:
prior_estimator = model_factor_3.prior_estimator_
We can access the return distribution with:
return_distribution = prior_estimator.return_distribution_
We can access the loading matrix with:
loading_matrix = prior_estimator.loading_matrix_estimator_.loading_matrix_
Empirical Model#
For comparison, we also create a Maximum Sharpe Ratio model using the default Empirical estimator:
model_empirical = MeanRisk(
risk_measure=RiskMeasure.VARIANCE,
objective_function=ObjectiveFunction.MAXIMIZE_RATIO,
portfolio_params=dict(name="Empirical"),
)
model_empirical.fit(X_train)
model_empirical.weights_
array([1.01561518e-01, 7.81165193e-02, 6.29030035e-07, 1.89005488e-02,
3.05610118e-07, 1.55770502e-07, 1.10594710e-01, 1.22328443e-06,
1.56471742e-06, 3.39453275e-06, 1.62631058e-01, 1.92171373e-06,
1.77783711e-01, 9.61805760e-02, 4.64493061e-07, 9.68566446e-03,
7.41771350e-08, 2.44533886e-01, 1.83984286e-06, 2.34178300e-07])
Prediction#
We predict all models on the test set:
ptf_factor_1_test = model_factor_1.predict(X_test)
ptf_factor_2_test = model_factor_2.predict(X_test)
ptf_factor_3_test = model_factor_3.predict(X_test)
ptf_empirical_test = model_empirical.predict(X_test)
population = Population(
[ptf_factor_1_test, ptf_factor_2_test, ptf_factor_3_test, ptf_empirical_test]
)
fig = population.plot_cumulative_returns()
show(fig)
Let’s plot the portfolios’ composition:
population.plot_composition()
Total running time of the script: (0 minutes 2.741 seconds)