skfolio.preprocessing.CSStandardScaler#

class skfolio.preprocessing.CSStandardScaler(*, min_group_size=8, atol=1e-12)[source]#

Cross-sectional standardization.

Standardizes each finite value within an observation’s cross-section to have weighted mean zero and unit equal-weighted standard deviation over the estimation universe.

When cs_weights is provided, weighted means and unbiased equal-weighted standard deviations are estimated only on the estimation universe, defined by cs_weights > 0. Assets outside that universe still receive standardized values relative to the estimation universe. For this estimator, cs_weights is used to define the estimation universe and to compute the cross-sectional mean, while the standard deviation remains equal-weighted over the selected assets.

NaNs are treated as missing values. They are ignored when computing cross-sectional statistics and are preserved in the output.

When cs_groups is None, standardization is performed globally within each observation. For observation $t$, the standardized value $z_{t,i}$ is defined by:

\[z_{t,i} = \frac{x_{t,i} - \mu_t}{\sigma_t}\]

where $\mu_t$ is the weighted mean, $\sigma_t$ is the unbiased equal-weighted standard deviation, $\mathcal{E}_t$ is the estimation universe, and $N_{\mathcal{E}_t}$ is its number of assets:

\[\mu_t = \frac{\sum_{i \in \mathcal{E}_t} w_{t,i} x_{t,i}} {\sum_{i \in \mathcal{E}_t} w_{t,i}}, \quad \sigma_t = \sqrt{\frac{1}{N_{\mathcal{E}_t} - 1} \sum_{i \in \mathcal{E}_t} (x_{t,i} - \mu_t)^2}\]

When cs_groups is provided, the same centering and scaling scheme is first applied within each group. Groups with fewer than min_group_size estimation assets, and missing groups (cs_groups == -1), fall back to global cross-sectional statistics. The grouped result is then globally recentered to weighted mean zero and globally rescaled to unit equal-weighted standard deviation over the estimation universe.

This transformer is stateless.

Parameters:

min_group_sizeint, default=8: Minimum number of estimation assets required in a group. Smaller groups fall back to global cross-sectional statistics.
atolfloat, default=1e-12: Absolute tolerance below which the cross-sectional standard deviation is treated as zero. When cs_groups is None, this means that the observation has no measurable cross-sectional dispersion on its estimation universe, so finite outputs are set to zero rather than NaN and the row is treated as a neutral exposure. When cs_groups is provided, the same convention applies to the within-group standardization step and to the final global rescaling step.

Methods

`fit`(X[, y, cs_weights, cs_groups])	Fit the transformer.
`fit_transform`(X[, y, cs_weights, cs_groups])	Fit to `X` and return the transformed values.
`get_feature_names_out`([input_features])	Get output feature names for transformation.
`get_metadata_routing`()	Get metadata routing of this object.
`get_params`([deep])	Get parameters for this estimator.
`set_fit_request`(*[, cs_groups, cs_weights])	Configure whether metadata should be requested to be passed to the `fit` method.
`set_params`(**params)	Set the parameters of this estimator.
`set_transform_request`(*[, cs_groups, cs_weights])	Configure whether metadata should be requested to be passed to the `transform` method.
`transform`(X[, cs_weights, cs_groups])	Standardize each observation into cross-sectional z-scores.

See also

CSPercentileRankScaler
CSGaussianRankScaler

Examples

>>> import numpy as np
>>> from skfolio.preprocessing import CSStandardScaler
>>>
>>> X = np.array([[1.0, np.nan, 3.0, 4.0],
...               [4.0, 3.0, 2.0, 1.0],
...               [10.0, 20.0, np.nan, 40.0]])
>>>
>>> transformer = CSStandardScaler()
>>> transformer.fit_transform(X)
array([[-1.09108945,         nan,  0.21821789,  0.87287156],
       [ 1.161895  ,  0.38729833, -0.38729833, -1.161895  ],
       [-0.87287156, -0.21821789,         nan,  1.09108945]])
>>>
>>> # Use cs_weights for the estimation universe and weighted means, then standardize within groups.
>>> cs_weights = np.array([[3.0, 0.0, 1.0, 2.0],
...                        [4.0, 0.0, 2.0, 3.0],
...                        [2.0, 3.0, 0.0, 5.0]])
>>> cs_groups = np.array([[0, 0, 1, 1],
...                       [0, 0, 1, 1],
...                       [0, 0, 1, 1]])
>>>
>>> transformer = CSStandardScaler(min_group_size=2)
>>> transformer.fit_transform(X, cs_weights=cs_weights, cs_groups=cs_groups)
array([[-0.55454325,         nan, -0.62182063,  1.1427252 ],
       [ 0.62254586, -0.15324206,  0.5035012 , -1.16572861],
       [-1.33736075,  0.20821245,         nan,  0.41001683]])

fit(X, y=None, cs_weights=None, cs_groups=None)#

Fit the transformer.

Cross-sectional transformers are stateless and do not learn data-dependent parameters. This method validates the estimator parameters, validates X, and records n_features_in_ for scikit-learn compatibility.

Parameters:

Xarray-like of shape (n_observations, n_assets): Input matrix where each row is an observation and each column is an asset.
yIgnored: Not used, present for API consistency by convention.
cs_weightsarray-like of shape (n_observations, n_assets), optional: Optional cross-sectional weights accepted for API consistency with transform. They are ignored during fitting.
cs_groupsarray-like of shape (n_observations, n_assets), optional: Optional cross-sectional group labels accepted for API consistency with transform. They are ignored during fitting.

Returns:

selfBaseCSTransformer: Fitted estimator.

fit_transform(X, y=None, cs_weights=None, cs_groups=None)#

Fit to X and return the transformed values.

Parameters:

Xarray-like of shape (n_observations, n_assets): Input matrix where each row is an observation and each column is an asset.
yIgnored: Not used, present for API consistency by convention.
cs_weightsarray-like of shape (n_observations, n_assets), optional: Optional cross-sectional weights forwarded to transform.
cs_groupsarray-like of shape (n_observations, n_assets), optional: Optional cross-sectional group labels forwarded to transform.

Returns:

X_newndarray of shape (n_observations, n_assets): Transformed array.

get_feature_names_out(input_features=None)#

Get output feature names for transformation.

Parameters:

input_featuresarray-like of str or None, default=None

Input features.

If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: ["x0", "x1", ..., "x(n_features_in_ - 1)"].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:

feature_names_outndarray of str objects: Same as input features.

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routingMetadataRequest: A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

set_fit_request(*, cs_groups='$UNCHANGED$', cs_weights='$UNCHANGED$')#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

cs_groupsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for cs_groups parameter in fit.
cs_weightsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for cs_weights parameter in fit.

Returns:

selfobject: The updated object.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

set_transform_request(*, cs_groups='$UNCHANGED$', cs_weights='$UNCHANGED$')#

Configure whether metadata should be requested to be passed to the transform method.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

cs_groupsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for cs_groups parameter in transform.
cs_weightsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for cs_weights parameter in transform.

Returns:

selfobject: The updated object.

transform(X, cs_weights=None, cs_groups=None)[source]#

Standardize each observation into cross-sectional z-scores.

Parameters:

Xarray-like of shape (n_observations, n_assets): Input matrix where each row represents an observation and each column represents an asset. NaNs are allowed and preserved.
cs_weightsarray-like of shape (n_observations, n_assets), optional: Optional non-negative cross-sectional weights. Positive weights define the estimation universe and are used to compute weighted means. The standard deviation remains equal-weighted over the selected assets. If None, all finite assets are included in the estimation universe with unit weight.
cs_groupsarray-like of shape (n_observations, n_assets), optional: Integer group labels >= -1. Missing groups (-1) and groups with fewer than min_group_size estimation assets fall back to global cross-sectional statistics. If None, standardization is performed globally within each observation.

Returns:

Zndarray of shape (n_observations, n_assets): Standardized values with weighted mean zero and unit equal-weighted standard deviation over the estimation universe.

Raises:

ValueError: If min_group_size < 1, atol < 0, X is not a non-empty 2D array, cs_weights is invalid, cs_groups is invalid, or any observation has no estimation asset.