skfolio.moments.EWCovariance#
- class skfolio.moments.EWCovariance(half_life=None, assume_centered=True, min_observations=None, window_size=None, alpha=None, nearest=True, higham=False, higham_max_iteration=100)[source]#
Exponentially Weighted Covariance estimator with NaN-aware pairwise updates.
This estimator uses the recursive EWMA formula:
\[\Sigma_t = \lambda \Sigma_{t-1} + (1-\lambda) r_t r_t^\top\]where \(\lambda\) is the decay factor, which determines how much weight is given to past observations. It is computed from the half-life parameter:
\[\lambda = 2^{-1/\text{half-life}}\]The half-life is the number of observations for the weight to decay to 50%.
This estimator supports both batch fitting via
fitand incremental updates viapartial_fit, making it suitable for online learning.NaN handling:
The estimator handles missing data (NaN returns) caused by late listings, delistings, and holidays using EWMA updates together with
active_mask. An asset withactive_mask=Trueis treated as active at time \(t\). If its return is finite, the EWMA is updated normally. If its return is NaN, the observation is treated as a holiday and covariance entries involving this asset are kept unchanged. An asset withactive_mask=Falseis treated as inactive, for example during pre-listing or post-delisting periods, and covariance entries involving this asset are set to NaN.Active with valid return: Normal EWMA update.
Active with NaN return (holiday): Freeze; covariance entries involving this asset are kept unchanged.
Inactive (
active_mask=False): Covariance entries involving this asset are set to NaN.
When
active_maskis not provided, trailing NaN returns are ambiguous: they could correspond either to holidays, in which case covariance is frozen, or to inactive periods, in which case covariance is set to NaN.Late-listing bias correction:
When an asset becomes active (late listing), the EWMA recursion for its covariance entries is initialized at zero rather than at the outer product of the first return. This initialization guarantees that the internal covariance state remains positive semi-definite at every step, but it introduces a transient downward scale bias: after \(n_i\) observations, the raw EWMA for asset \(i\) is damped by a factor \((1 - \lambda^{n_i})\). At output time, a per-asset correction removes this bias:
\[\hat{\Sigma}_{ij} = \frac{S_{ij}}{\sqrt{(1 - \lambda^{n_i})(1 - \lambda^{n_j})}}\]where \(S\) is the raw internal EWMA. This is a congruence transform \(D S D\) with \(D = \text{diag}(1 / \sqrt{1 - \lambda^{n_i}})\), which preserves positive semi-definiteness while restoring the correct variance scale. Correlations are unaffected by the correction. For assets with a long history, the correction is negligible (\(\lambda^{n_i} \to 0\)).
The
min_observationsparameter controls a warm-up period: an asset’s covariance entries remain NaN in the output until it has accumulated enough valid observations for a reliable estimate.- Parameters:
- half_lifefloat, default=40
Half-life of the exponential weights in number of observations.
The half-life controls how quickly older observations lose their influence:
Larger half-life: More stable estimates, slower to adapt (robust to noise)
Smaller half-life: More responsive estimates, faster to adapt (sensitive to noise)
The decay factor \(\lambda\) is computed as: \(\lambda = 2^{-1/\text{half-life}}\)
- For example:
half-life = 40: \(\lambda \approx 0.983\)
half-life = 23: \(\lambda \approx 0.970\)
half-life = 11: \(\lambda \approx 0.939\)
half-life = 6: \(\lambda \approx 0.891\)
Note
For portfolio optimization, larger half-lives (>= 20) are generally preferred to avoid excessive turnover from estimation noise.
- assume_centeredbool, default=True
If True (default), the EWMA update uses raw returns without demeaning. This is the standard convention for EWMA covariance estimation in finance. If False, returns are demeaned using an EWMA mean estimate before computing the covariance update, and
location_tracks the EWMA mean.- min_observationsint, optional
Minimum number of valid observations per asset before its covariance entries are considered reliable and exposed in the output
covariance_. Until this threshold is reached, the asset’s covariance entries remain NaN.The default (
None) usesint(half_life)as the threshold, ensuring the late-listing initialization bias has decayed to at most 50%. Set to 1 to disable warm-up entirely.- window_sizeint, optional
Window size to truncate data to the last
window_sizeobservations before fitting. Only applies to the initialfitcall (or equivalently, the firstpartial_fitcall); subsequentpartial_fitcalls use all provided data.This is a computational optimization for very long time series. Due to exponential decay, observations far in the past contribute negligibly to the current estimate. For example, with half-life = 23 (\(\lambda = 0.97\)), observations beyond ~150 periods contribute less than 1% to the estimate. Truncating to a reasonable window (e.g., 252 trading days) speeds up computation without materially affecting results.
The default (
None) uses all available data.- alphafloat, optional
Deprecated since version 0.17.0:
alphais deprecated and will be removed in a future version. Usehalf_lifeinstead. Note:alpha = 1 - decay_factorandhalf_life = -ln(2) / ln(1 - alpha).- nearestbool, default=True
If this is set to True, the covariance is replaced by the nearest covariance matrix that is positive definite and with a Cholesky decomposition than can be computed. The variance is left unchanged. A covariance matrix that is not positive definite often occurs in high dimensional problems. It can be due to multicollinearity, floating-point inaccuracies, or when the number of observations is smaller than the number of assets. For more details, see
cov_nearest. The default isTrue.- highambool, default=False
If this is set to True, the Higham (2002) algorithm is used to find the nearest PD covariance, otherwise the eigenvalues are clipped to a threshold above zeros (1e-13). The default is
Falseand uses the clipping method as the Higham algorithm can be slow for large datasets.- higham_max_iterationint, default=100
Maximum number of iterations of the Higham (2002) algorithm. The default value is
100.
- Attributes:
- covariance_ndarray of shape (n_assets, n_assets)
Estimated covariance. Contains NaN for assets that are inactive or have not yet accumulated
min_observationsvalid observations.- location_ndarray of shape (n_assets,)
Estimated location (mean). If
assume_centered=True, this is zeros. Otherwise, it tracks the EWMA mean of returns. Contains NaN for inactive assets.- n_features_in_int
Number of assets seen during
fit.- feature_names_in_ndarray of shape (
n_features_in_,) Names of features seen during
fit. Defined only whenXhas feature names that are all strings.
Methods
fit(X[, y, active_mask])Fit the Exponentially Weighted Covariance estimator.
Get metadata routing of this object.
get_params([deep])Get parameters for this estimator.
mahalanobis(X_test)Compute the squared Mahalanobis distance of observations.
partial_fit(X[, y, active_mask])Incrementally fit the Exponentially Weighted Covariance estimator.
score(X_test[, y])Compute the mean log-likelihood of observations under the estimated model.
set_fit_request(*[, active_mask])Configure whether metadata should be requested to be passed to the
fitmethod.set_params(**params)Set the parameters of this estimator.
set_partial_fit_request(*[, active_mask])Configure whether metadata should be requested to be passed to the
partial_fitmethod.set_score_request(*[, X_test])Configure whether metadata should be requested to be passed to the
scoremethod.Examples
>>> import numpy as np >>> from skfolio.datasets import load_sp500_dataset >>> from skfolio.moments import EWCovariance >>> from skfolio.preprocessing import prices_to_returns >>> >>> prices = load_sp500_dataset() >>> X = prices_to_returns(prices) >>> >>> # Batch fitting >>> model = EWCovariance(half_life=40) >>> model.fit(X) >>> print(model.covariance_.shape) >>> >>> # Streaming updates with partial_fit >>> model2 = EWCovariance(half_life=20) >>> model2.partial_fit(X[:100]) # Initial fit >>> model2.partial_fit(X[100:200]) # Update with new data >>> model2.partial_fit(X[200:]) # Continue updating >>> >>> # NaN-aware fitting with active_mask >>> # Asset 2 is listed starting from observation 50 >>> active_mask = np.ones(X.shape, dtype=bool) >>> active_mask[:50, 2] = False >>> X_nan = X.copy() >>> X_nan[:50, 2] = np.nan >>> model3 = EWCovariance(half_life=40) >>> model3.fit(X_nan, active_mask=active_mask)
- fit(X, y=None, *, active_mask=None)[source]#
Fit the Exponentially Weighted Covariance estimator.
- Parameters:
- Xarray-like of shape (n_observations, n_assets)
Price returns of the assets. May contain NaN for missing data (holidays, late listings, delistings).
- yIgnored
Not used, present for API consistency by convention.
- active_maskarray-like of shape (n_observations, n_assets), optional
Boolean mask indicating whether each asset is structurally active at each observation. Use this to distinguish between holidays (
active_mask=Trueand NaN return: covariance is frozen) and inactive periods such as pre-listing or post-delisting (active_mask=False: covariance is set to NaN). IfNone(default), all assets are assumed active.
- Returns:
- selfEWCovariance
Fitted estimator.
- get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequestencapsulating routing information.
- get_params(deep=True)#
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- mahalanobis(X_test)#
Compute the squared Mahalanobis distance of observations.
The squared Mahalanobis distance of an observation \(r\) is defined as:
\[d^2 = (r - \mu)^T \Sigma^{-1} (r - \mu)\]where \(\Sigma\) is the estimated covariance matrix (
self.covariance_) and \(\mu\) is the estimated mean (self.location_if available, otherwise zero).This distance measure accounts for correlations between assets and is useful for:
Outlier detection in portfolio returns
Risk-adjusted distance calculations
Identifying unusual market regimes
- Parameters:
- X_testarray-like of shape (n_observations, n_assets) or (n_assets,)
Observations for which to compute the squared Mahalanobis distance. Each row represents one observation. If 1D, treated as a single observation. Assets with non-finite fitted variance are excluded from inference. Inside the retained inference subspace, the observations must be finite.
- Returns:
- distancesndarray of shape (n_observations,) or float
Squared Mahalanobis distance for each observation. Returns a scalar if input is 1D.
Examples
>>> import numpy as np >>> from skfolio.moments import EmpiricalCovariance >>> X = np.random.randn(100, 3) >>> model = EmpiricalCovariance() >>> model.fit(X) >>> distances = model.mahalanobis(X) >>> # Distances follow approximately chi-squared distribution with n_assets DoF >>> print(f"Mean distance: {distances.mean():.2f}, Expected: {3:.2f}")
- partial_fit(X, y=None, *, active_mask=None)[source]#
Incrementally fit the Exponentially Weighted Covariance estimator.
This method allows for streaming/online updates to the covariance estimate. Each call updates the internal state with new observations.
- Parameters:
- Xarray-like of shape (n_observations, n_assets)
Price returns of the assets. May contain NaN for missing data (holidays, late listings, delistings).
- yIgnored
Not used, present for API consistency by convention.
- active_maskarray-like of shape (n_observations, n_assets), optional
Boolean mask indicating whether each asset is structurally active at each observation. Use this to distinguish between holidays (
active_mask=Trueand NaN return: covariance is frozen) and inactive periods such as pre-listing or post-delisting (active_mask=False: covariance is set to NaN). IfNone(default), all assets are assumed active.
- Returns:
- selfEWCovariance
Fitted estimator.
- score(X_test, y=None)#
Compute the mean log-likelihood of observations under the estimated model.
Evaluates how well the fitted covariance matrix explains new observations, assuming a multivariate Gaussian distribution. This is useful for:
Model selection (comparing different covariance estimators)
Cross-validation of covariance estimation methods
Assessing goodness-of-fit
The log-likelihood for a single observation \(r\) is:
\[\log p(r | \mu, \Sigma) = -\frac{1}{2} \left[ n \log(2\pi) + \log|\Sigma| + (r - \mu)^T \Sigma^{-1} (r - \mu) \right]\]where \(n\) is the number of assets, \(\Sigma\) is the estimated covariance matrix (
self.covariance_), and \(\mu\) is the estimated mean (self.location_if available, otherwise zero).- Parameters:
- X_testarray-like of shape (n_observations, n_assets)
Observations for which to compute the log-likelihood. Typically held-out test data not used during fitting. Assets with non-finite fitted variance are excluded from inference. This typically happens when the fitted covariance cannot be estimated for an asset, for example before listing, after delisting, or during a warmup period. After this asset-level filtering, each row of
X_testis scored using the remaining available values only. This covers row-level missing values inX_test, such as market holidays or pre/post-listing.- yIgnored
Not used, present for scikit-learn API consistency.
- Returns:
- scorefloat
Mean log-likelihood of the observations. Higher values indicate better fit. The score is averaged over all observations.
Examples
>>> import numpy as np >>> from skfolio.moments import EmpiricalCovariance, LedoitWolf >>> X_train = np.random.randn(100, 5) >>> X_test = np.random.randn(50, 5) >>> emp = EmpiricalCovariance().fit(X_train) >>> lw = LedoitWolf().fit(X_train) >>> # Compare models on held-out data >>> print(f"Empirical: {emp.score(X_test):.2f}") >>> print(f"LedoitWolf: {lw.score(X_test):.2f}")
- set_fit_request(*, active_mask='$UNCHANGED$')#
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
- active_maskstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
active_maskparameter infit.
- Returns:
- selfobject
The updated object.
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.
- set_partial_fit_request(*, active_mask='$UNCHANGED$')#
Configure whether metadata should be requested to be passed to the
partial_fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topartial_fitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topartial_fit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
- active_maskstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
active_maskparameter inpartial_fit.
- Returns:
- selfobject
The updated object.
- set_score_request(*, X_test='$UNCHANGED$')#
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
- X_teststr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
X_testparameter inscore.
- Returns:
- selfobject
The updated object.