skfolio.preprocessing.CSGaussianRankScaler#

class skfolio.preprocessing.CSGaussianRankScaler(*, min_group_size=8, scale=True, atol=1e-12)[source]#

Cross-sectional rank Gaussianization.

Computes percentile ranks within each cross-section (see CSPercentileRankScaler), maps them through the inverse standard normal CDF \(\Phi^{-1}\), and recenters to weighted mean zero over the estimation universe. When scale=True, the result is also rescaled to unit equal-weighted standard deviation.

When cs_weights is provided, the estimation universe is defined by cs_weights > 0. Assets outside that universe still receive Gaussianized scores relative to it. cs_weights is used to define the estimation universe and to compute the final weighted recentering; ranking itself remains equal-weighted over the selected assets.

NaNs are treated as missing values. They are ignored when computing cross-sectional ranks and are preserved in the output.

For observation \(t\), the Gaussianized value of asset \(i\) is:

\[z_{t,i} = \frac{\Phi^{-1}(p_{t,i}) - \mu_t}{\sigma_t}\]

where \(p_{t,i}\) is the percentile rank, \(\mu_t\) the weighted mean of \(\Phi^{-1}(p_{t,\cdot})\) over the estimation universe, and \(\sigma_t\) its unbiased equal-weighted standard deviation. When scale=False, the rescaling step is skipped and only weighted recentering is applied.

When cs_groups is provided, the same scheme is applied within each group. Groups with fewer than min_group_size estimation assets, and missing groups (cs_groups == -1), fall back to the global cross-section. Recentering and rescaling are always performed over the full cross-section, not within groups.

This transformer is stateless.

Parameters:
min_group_sizeint, default=8

Minimum number of estimation assets required in a group. Smaller groups fall back to the global cross-section.

scalebool, default=True

If True, rescale final exposures to unit equal-weighted standard deviation over the estimation universe. If False, only weighted recentering is applied. Use this when feeding the output to a scale-invariant downstream model (e.g. gradient-boosted trees) and you want to avoid injecting per-cross-section noise from the unbiased standard-deviation estimate.

atolfloat, default=1e-12

Absolute tolerance used to guard against division by a near-zero equal-weighted standard deviation. Must be finite and non-negative.

Methods

fit(X[, y, cs_weights, cs_groups])

Fit the transformer.

fit_transform(X[, y, cs_weights, cs_groups])

Fit to X and return the transformed values.

get_feature_names_out([input_features])

Get output feature names for transformation.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_fit_request(*[, cs_groups, cs_weights])

Configure whether metadata should be requested to be passed to the fit method.

set_params(**params)

Set the parameters of this estimator.

set_transform_request(*[, cs_groups, cs_weights])

Configure whether metadata should be requested to be passed to the transform method.

transform(X[, cs_weights, cs_groups])

Transform values into cross-sectional Gaussianized exposures.

Examples

>>> import numpy as np
>>> from skfolio.preprocessing import CSGaussianRankScaler
>>>
>>> X = np.array([[1.0, np.nan, 3.0, 4.0],
...               [4.0, 3.0, 2.0, 1.0],
...               [10.0, 20.0, np.nan, 40.0]])
>>>
>>> transformer = CSGaussianRankScaler()
>>> transformer.fit_transform(X)
array([[-1.        ,         nan,  0.        ,  1.        ],
       [ 1.180302  ,  0.32693605, -0.32693605, -1.180302  ],
       [-1.        ,  0.        ,         nan,  1.        ]])
>>>
>>> # Use cs_weights for the estimation universe and weighted recentering, and rank within groups.
>>> cs_weights = np.array([[3.0, 0.0, 1.0, 2.0],
...                        [4.0, 0.0, 2.0, 3.0],
...                        [2.0, 3.0, 0.0, 5.0]])
>>> cs_groups = np.array([[0, 0, 1, 1],
...                       [0, 0, 1, 1],
...                       [0, 0, 1, 1]])
>>>
>>> transformer = CSGaussianRankScaler(min_group_size=2)
>>> transformer.fit_transform(X, cs_weights=cs_weights, cs_groups=cs_groups)
array([[-0.6791367 ,         nan, -0.34541391,  1.19141201],
       [ 0.69857792,  0.0863589 ,  0.36442412, -1.17438663],
       [-1.33305449,  0.13413753,         nan,  0.45273928]])
fit(X, y=None, cs_weights=None, cs_groups=None)#

Fit the transformer.

Cross-sectional transformers are stateless and do not learn data-dependent parameters. This method validates the estimator parameters, validates X, and records n_features_in_ for scikit-learn compatibility.

Parameters:
Xarray-like of shape (n_observations, n_assets)

Input matrix where each row is an observation and each column is an asset.

yIgnored

Not used, present for API consistency by convention.

cs_weightsarray-like of shape (n_observations, n_assets), optional

Optional cross-sectional weights accepted for API consistency with transform. They are ignored during fitting.

cs_groupsarray-like of shape (n_observations, n_assets), optional

Optional cross-sectional group labels accepted for API consistency with transform. They are ignored during fitting.

Returns:
selfBaseCSTransformer

Fitted estimator.

fit_transform(X, y=None, cs_weights=None, cs_groups=None)#

Fit to X and return the transformed values.

Parameters:
Xarray-like of shape (n_observations, n_assets)

Input matrix where each row is an observation and each column is an asset.

yIgnored

Not used, present for API consistency by convention.

cs_weightsarray-like of shape (n_observations, n_assets), optional

Optional cross-sectional weights forwarded to transform.

cs_groupsarray-like of shape (n_observations, n_assets), optional

Optional cross-sectional group labels forwarded to transform.

Returns:
X_newndarray of shape (n_observations, n_assets)

Transformed array.

get_feature_names_out(input_features=None)#

Get output feature names for transformation.

Parameters:
input_featuresarray-like of str or None, default=None

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: ["x0", "x1", ..., "x(n_features_in_ - 1)"].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:
feature_names_outndarray of str objects

Same as input features.

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_fit_request(*, cs_groups='$UNCHANGED$', cs_weights='$UNCHANGED$')#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
cs_groupsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for cs_groups parameter in fit.

cs_weightsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for cs_weights parameter in fit.

Returns:
selfobject

The updated object.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

set_transform_request(*, cs_groups='$UNCHANGED$', cs_weights='$UNCHANGED$')#

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
cs_groupsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for cs_groups parameter in transform.

cs_weightsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for cs_weights parameter in transform.

Returns:
selfobject

The updated object.

transform(X, cs_weights=None, cs_groups=None)[source]#

Transform values into cross-sectional Gaussianized exposures.

Parameters:
Xarray-like of shape (n_observations, n_assets)

Input matrix where each row is an observation and each column is an asset. NaNs are allowed and preserved.

cs_weightsarray-like of shape (n_observations, n_assets), optional

Optional non-negative cross-sectional weights. They define the estimation universe through cs_weights > 0 and drive the final weighted recentering. Ranking itself remains equal-weighted over the selected assets. If None, all finite assets are included in the estimation universe.

cs_groupsarray-like of shape (n_observations, n_assets), optional

Integer group labels >= -1. Missing groups (-1) and groups with fewer than min_group_size estimation assets fall back to the global cross-section. If None, ranking is performed on the full cross-section of each observation.

Returns:
Zndarray of shape (n_observations, n_assets)

Gaussianized exposures. Each cross-section has weighted mean zero over its estimation universe and, when scale=True, unit equal-weighted standard deviation. NaNs from X are preserved.

Raises:
ValueError
If min_group_size is not an integer >= 1, atol is not finite or < 0,

X is not a non-empty 2D array, cs_weights is invalid, cs_groups is invalid, or any observation has no estimation asset.