skfolio.model_selection.MultipleRandomizedCV#

class skfolio.model_selection.MultipleRandomizedCV(walk_forward, n_subsamples, asset_subset_size, window_size=None, random_state=None)[source]#

Multiple Randomized Cross-Validation.

Based on the “Multiple Randomized Backtests” methodology of Palomar [1], this cross-validation strategy performs a resampling-based evaluation by repeatedly sampling distinct asset subsets (without replacement) and contiguous time windows, then applying an inner walk-forward split to each subsample, capturing both temporal and cross-sectional variability in performance.

On each of the n_subsamples iterations, the following actions are performed:

  1. Randomly pick a contiguous time window of length window_size (or the full history if None).

  2. Randomly pick an asset subset of size asset_subset_size (without replacement).

  3. Run a walk-forward split (via the supplied walk_forward object) on that sub-dataset.

  4. Yield (train_indices, test_indices, asset_indices) for each inner split.

Each asset subset is sampled without replacement (assets within each subset are distinct) and no subset is repeated across the n_subsamples draws. We employ the combinatorial unranking algorithm to compute any k-combination in O(n_subsamples * asset_subset_size) time and space, without generating or storing all \(M=\binom{n\_assets}{asset\_subset\_size}\) subsets. When \(M\) is small, this guarantees exhaustive coverage of every possible asset-universe. Because ranks are drawn without replacement from a finite population of size \(M\), the variance of the sample mean is reduced by the finite-population correction factor \(\tfrac{M - n\_subsamples}{M - 1}\).

Parameters:
walk_forwardWalkForward

A WalkForward CV object to be applied to each subsample.

n_subsamplesint

Number of independent subsamples (sub-datasets) to draw. Each subsample is a (time window x asset subset) on which you run the inner walk-forward.

asset_subset_sizeint

How many assets to include in each subsample. Must be less or equal to the total number of assets.

window_sizeint or None, default=None

Length of the contiguous time slice (number of observations) for each subsample. If None, uses the full time series observations in every draw.

random_stateint, RandomState instance or None, default=None

Seed or random state to ensure reproducibility.

Methods

get_n_splits([X, y, groups])

Return the number of splitting iterations in the cross-validator.

get_path_ids()

Return the path id of each test sets in each split.

split(X[, y])

Generate indices to split data into training and test set.

References

[1]

“Portfolio Optimization: Theory and Application”, Chapter 8, Daniel P. Palomar (2025)

Examples

Tutorials using MultipleRandomizedCV:
>>> import numpy as np
>>> from skfolio.datasets import load_sp500_dataset, load_factors_dataset
>>> from skfolio.model_selection import WalkForward, MultipleRandomizedCV
>>> from skfolio.preprocessing import prices_to_returns
>>>
>>> X = np.random.randn(4, 5) # 4 observations and 5 assets.
>>> # Draw 2 subsamples (sub-datasets) with 3 assets chosen randomly among the 5.
>>> # For each subsample, run a Walk Forward.
>>> # Use the full time series (no time resampling).
>>> cv = MultipleRandomizedCV(
...     walk_forward=WalkForward(test_size=1, train_size=2),
...     n_subsamples=2,
...     asset_subset_size=3,
...     window_size=None,
...     random_state=0,
... )
>>> for i, (train_index, test_index, assets) in enumerate(cv.split(X)):
...     print(f"Fold {i}:")
...     print(f"  Train:  index={train_index}")
...     print(f"  Test:   index={test_index}")
...     print(f"  Assets: columns={assets}")
Fold 0:
  Train:  index=[0 1]
  Test:   index=[2]
  Assets: columns=[0 1 4]
Fold 1:
  Train:  index=[1 2]
  Test:   index=[3]
  Assets: columns=[0 1 4]
Fold 2:
  Train:  index=[0 1]
  Test:   index=[2]
  Assets: columns=[1 3 4]
Fold 3:
  Train:  index=[1 2]
  Test:   index=[3]
  Assets: columns=[1 3 4]
>>> print(f"Path ids: {cv.get_path_ids()}")
Path ids: [0 0 1 1]
>>>
>>> # Random contiguous time slice of 4 observations among 10 observations.
>>> X = np.random.randn(10, 5) # 10 observations and 5 assets.
>>> cv = MultipleRandomizedCV(
...     walk_forward=WalkForward(test_size=1, train_size=2),
...     n_subsamples=2,
...     asset_subset_size=3,
...     window_size=4,
...     random_state=0,
... )
>>> for i, (train_index, test_index, assets) in enumerate(cv.split(X)):
...     print(f"Fold {i}:")
...     print(f"  Train:  index={train_index}")
...     print(f"  Test:   index={test_index}")
...     print(f"  Assets: columns={assets}")
Fold 0:
  Train:  index=[4 5]
  Test:   index=[6]
  Assets: columns=[0 1 4]
Fold 1:
  Train:  index=[5 6]
  Test:   index=[7]
  Assets: columns=[0 1 4]
Fold 2:
  Train:  index=[5 6]
  Test:   index=[7]
  Assets: columns=[1 3 4]
Fold 3:
  Train:  index=[6 7]
  Test:   index=[8]
  Assets: columns=[1 3 4]
>>>
>>> # Walk Forward with time-based (calendar) rebalancing.
>>> # Rebalance every 3 months on the third Friday, and train on the last 12 months.
>>> prices = load_sp500_dataset()
>>> X = prices_to_returns(prices)
>>> X = X["2021":"2022"]
>>> cv = MultipleRandomizedCV(
...     walk_forward=WalkForward(test_size=3, train_size=12, freq="WOM-3FRI"),
...     n_subsamples=2,
...     asset_subset_size=3,
...     window_size=None,
...     random_state=0,
... )
>>> for i, (train_index, test_index, assets) in enumerate(cv.split(X)):
...     print(f"Fold {i}:")
...     print(f"  Train:  size={len(train_index)}")
...     print(f"  Test:   size={len(test_index)}")
...     print(f"  Assets: columns={assets}")
Fold 0:
  Train:  size=256
  Test:   size=59
  Assets: columns=[ 9 16 17]
Fold 1:
  Train:  size=253
  Test:   size=61
  Assets: columns=[ 9 16 17]
Fold 2:
  Train:  size=251
  Test:   size=69
  Assets: columns=[ 9 16 17]
Fold 3:
  Train:  size=256
  Test:   size=59
  Assets: columns=[ 7 10 14]
Fold 4:
  Train:  size=253
  Test:   size=61
  Assets: columns=[ 7 10 14]
Fold 5:
  Train:  size=251
  Test:   size=69
  Assets: columns=[ 7 10 14]
>>> print(f"Path ids: {cv.get_path_ids()}")
[0 0 0 1 1 1]
get_n_splits(X=None, y=None, groups=None)[source]#

Return the number of splitting iterations in the cross-validator.

When combining a frequency-based walk-forward with window_size, the exact count depends on the random time slices drawn during split, so split must be called first. In all other cases the count is computed directly from the parameters and X.

Parameters:
Xarray-like of shape (n_observations, n_assets)

Price returns of the assets. Required when the count can be pre-computed (i.e. window_size is None or the inner walk-forward has no frequency). Ignored after split has been called.

yarray-like of shape (n_observations, n_targets)

Always ignored, exists for compatibility.

groupsarray-like of shape (n_observations,)

Always ignored, exists for compatibility.

Returns:
n_splitsint

Number of splitting iterations in the cross-validator.

get_path_ids()[source]#

Return the path id of each test sets in each split.

split(X, y=None)[source]#

Generate indices to split data into training and test set.

Parameters:
Xarray-like of shape (n_observations, n_assets)

Price returns of the assets.

yarray-like of shape (n_observations, n_targets)

Always ignored, exists for compatibility.

Yields:
trainndarray

The training set indices for that split.

testndarray

The testing set indices for that split.

assetsndarray

The assets indices for that split.