skfolio.preprocessing.CSTanhShrinker#

class skfolio.preprocessing.CSTanhShrinker(*, knee=3.0, atol=1e-12)[source]#

Cross-sectional tanh outlier shrinker.

Smoothly shrinks extreme values within an observation toward the cross-sectional center while preserving the original scale and units of the input values. Values near the center are left nearly unchanged, while extreme values are compressed inward.

NaNs are treated as missing values. They are ignored when computing the cross-sectional median and MAD and are preserved in the output.

Compared to winsorization (CSWinsorizer):

  • No hard threshold. The mapping is smooth, so small data changes do not create discontinuous jumps at a clipping boundary.

  • Strict monotonicity. Tail ordering is preserved because distinct inputs remain distinct after transformation.

  • Smooth transformed values. This can lead to better-conditioned cross-sectional regressions and more stable coefficient estimates in downstream models.

For observation \(t\) with cross-section \(\mathbf{x}_t\), the transformation is

\[x_{t,i}' = m_t + h_t \cdot \tanh\!\left(\frac{x_{t,i} - m_t}{h_t}\right), \quad h_t = c \cdot s_t\]

where \(m_t = \operatorname{median}(\mathbf{x}_t)\), \(s_t = 1.4826 \cdot \operatorname{MAD}(\mathbf{x}_t)\) is a robust scale estimator consistent for the standard deviation under normality, and \(c\) is the knee parameter (see knee). The quantity \(h_t = c \cdot s_t\) is the half-width of the near-linear region for observation \(t\).

When cs_weights is provided, median and MAD are computed from the estimation universe, defined by cs_weights > 0. Assets outside the estimation universe still receive shrunk values using those statistics. For this estimator, cs_weights is used only to define the estimation universe; the median and MAD remain equal-weighted over the selected assets.

This transformer is stateless.

Parameters:
kneefloat, default=3.0

Compression knee in robust standard deviations. It controls the width of the near-linear region around the median. Larger values reduce shrinkage. Must be finite and strictly positive.

atolfloat, default=1e-12

Absolute tolerance for the robust scale. If \(s\) is below atol, the observation is returned unchanged. Must be finite and non-negative.

Methods

fit(X[, y, cs_weights, cs_groups])

Fit the transformer.

fit_transform(X[, y, cs_weights, cs_groups])

Fit to X and return the transformed values.

get_feature_names_out([input_features])

Get output feature names for transformation.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_fit_request(*[, cs_groups, cs_weights])

Configure whether metadata should be requested to be passed to the fit method.

set_params(**params)

Set the parameters of this estimator.

set_transform_request(*[, cs_groups, cs_weights])

Configure whether metadata should be requested to be passed to the transform method.

transform(X[, cs_weights, cs_groups])

Shrink outliers within each observation using a tanh mapping.

See also

CSWinsorizer

Hard percentile-based clipping.

Examples

>>> import numpy as np
>>> from skfolio.preprocessing import CSTanhShrinker
>>>
>>> X = np.array([[1.0, np.nan, 3.0, 4.0],
...               [4.0, 3.0, 2.0, 1.0],
...               [10.0, 20.0, np.nan, 40.0]])
>>>
>>> transformer = CSTanhShrinker()
>>> transformer.fit_transform(X)
array([[ 1.12471866,         nan,  3.        ,  3.98348436],
       [ 3.94560619,  2.99790441,  2.00209559,  1.05439381],
       [10.16515641, 20.        ,         nan, 38.75281341]])
>>>
>>> # Use cs_weights for the estimation universe before computing the median and MAD.
>>> cs_weights = np.array([[1.0, 0.0, 1.0, 1.0],
...                        [1.0, 0.0, 1.0, 1.0],
...                        [1.0, 1.0, 0.0, 1.0]])
>>>
>>> transformer.fit_transform(X, cs_weights=cs_weights)
array([[ 1.12471866,         nan,  3.        ,  3.98348436],
       [ 3.87528134,  2.98348436,  2.        ,  1.01651564],
       [10.16515641, 20.        ,         nan, 38.75281341]])
fit(X, y=None, cs_weights=None, cs_groups=None)#

Fit the transformer.

Cross-sectional transformers are stateless and do not learn data-dependent parameters. This method validates the estimator parameters, validates X, and records n_features_in_ for scikit-learn compatibility.

Parameters:
Xarray-like of shape (n_observations, n_assets)

Input matrix where each row is an observation and each column is an asset.

yIgnored

Not used, present for API consistency by convention.

cs_weightsarray-like of shape (n_observations, n_assets), optional

Optional cross-sectional weights accepted for API consistency with transform. They are ignored during fitting.

cs_groupsarray-like of shape (n_observations, n_assets), optional

Optional cross-sectional group labels accepted for API consistency with transform. They are ignored during fitting.

Returns:
selfBaseCSTransformer

Fitted estimator.

fit_transform(X, y=None, cs_weights=None, cs_groups=None)#

Fit to X and return the transformed values.

Parameters:
Xarray-like of shape (n_observations, n_assets)

Input matrix where each row is an observation and each column is an asset.

yIgnored

Not used, present for API consistency by convention.

cs_weightsarray-like of shape (n_observations, n_assets), optional

Optional cross-sectional weights forwarded to transform.

cs_groupsarray-like of shape (n_observations, n_assets), optional

Optional cross-sectional group labels forwarded to transform.

Returns:
X_newndarray of shape (n_observations, n_assets)

Transformed array.

get_feature_names_out(input_features=None)#

Get output feature names for transformation.

Parameters:
input_featuresarray-like of str or None, default=None

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: ["x0", "x1", ..., "x(n_features_in_ - 1)"].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:
feature_names_outndarray of str objects

Same as input features.

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_fit_request(*, cs_groups='$UNCHANGED$', cs_weights='$UNCHANGED$')#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
cs_groupsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for cs_groups parameter in fit.

cs_weightsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for cs_weights parameter in fit.

Returns:
selfobject

The updated object.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

set_transform_request(*, cs_groups='$UNCHANGED$', cs_weights='$UNCHANGED$')#

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
cs_groupsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for cs_groups parameter in transform.

cs_weightsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for cs_weights parameter in transform.

Returns:
selfobject

The updated object.

transform(X, cs_weights=None, cs_groups=None)[source]#

Shrink outliers within each observation using a tanh mapping.

Parameters:
Xarray-like of shape (n_observations, n_assets)

Input matrix where each row is an observation and each column is an asset. NaNs are allowed and preserved.

cs_weightsarray-like of shape (n_observations, n_assets), optional

Optional non-negative cross-sectional weights used only to define the estimation universe through the convention cs_weights > 0. The median and MAD are then computed in an equal-weighted way over the selected assets. Non-estimation assets still receive shrunk values using those statistics. If None, all finite assets are used.

cs_groupsarray-like of shape (n_observations, n_assets), optional

Not used, present for API consistency by convention.

Returns:
X_shrunkndarray of shape (n_observations, n_assets)

Shrunk values in the original scale. NaN values from the input are preserved.

Raises:
ValueError

If knee is not finite or <= 0, atol is not finite or < 0, X is not a non-empty 2D array, cs_weights is invalid, or any observation has no estimation asset.