skfolio.model_selection.MultipleRandomizedCV#

class skfolio.model_selection.MultipleRandomizedCV(walk_forward, n_subsamples, asset_subset_size, window_size=None, random_state=None)[source]#

Multiple Randomized Cross-Validation.

Based on the “Multiple Randomized Backtests” methodology of Palomar [1], this cross-validation strategy performs a resampling-based evaluation by repeatedly sampling distinct asset subsets (without replacement) and contiguous time windows, then applying an inner walk-forward split to each subsample, capturing both temporal and cross-sectional variability in performance.

On each of the n_subsamples iterations, the following actions are performed:

Randomly pick a contiguous time window of length window_size (or the full history if None).
Randomly pick an asset subset of size asset_subset_size (without replacement).
Run a walk-forward split (via the supplied walk_forward object) on that sub-dataset.
Yield (train_indices, test_indices, asset_indices) for each inner split.

Each asset subset is sampled without replacement (assets within each subset are distinct) and no subset is repeated across the n_subsamples draws. We employ the combinatorial unranking algorithm to compute any k-combination in O(n_subsamples * asset_subset_size) time and space, without generating or storing all \(M=\binom{n\_assets}{asset\_subset\_size}\) subsets. When \(M\) is small, this guarantees exhaustive coverage of every possible asset-universe. Because ranks are drawn without replacement from a finite population of size \(M\), the variance of the sample mean is reduced by the finite-population correction factor \(\tfrac{M - n\_subsamples}{M - 1}\).

Parameters:

walk_forwardWalkForward: A WalkForward CV object to be applied to each subsample.
n_subsamplesint: Number of independent subsamples (sub-datasets) to draw. Each subsample is a (time window x asset subset) on which you run the inner walk-forward.
asset_subset_sizeint: How many assets to include in each subsample. Must be less or equal to the total number of assets.
window_sizeint or None, default=None: Length of the contiguous time slice (number of observations) for each subsample. If None, uses the full time series observations in every draw.
random_stateint, RandomState instance or None, default=None: Seed or random state to ensure reproducibility.

Methods

`get_path_ids`()	Return the path id of each test sets in each split.
`split`(X[, y])	Generate indices to split data into training and test set.

References

[1]

“Portfolio Optimization: Theory and Application”, Chapter 8, Daniel P. Palomar (2025)

Examples

Tutorials using MultipleRandomizedCV:

Multiple Randomized Cross-Validation

>>> import numpy as np
>>> from skfolio.datasets import load_sp500_dataset, load_factors_dataset
>>> from skfolio.model_selection import WalkForward, MultipleRandomizedCV
>>> from skfolio.preprocessing import prices_to_returns
>>>
>>> X = np.random.randn(4, 5) # 4 observations and 5 assets.
>>> # Draw 2 subsamples (sub-datasets) with 3 assets chosen randomly among the 5.
>>> # For each subsample, run a Walk Forward.
>>> # Use the full time series (no time resampling).
>>> cv = MultipleRandomizedCV(
...     walk_forward=WalkForward(test_size=1, train_size=2),
...     n_subsamples=2,
...     asset_subset_size=3,
...     window_size=None,
...     random_state=0,
... )
>>> for i, (train_index, test_index, assets) in enumerate(cv.split(X)):
...     print(f"Fold {i}:")
...     print(f"  Train:  index={train_index}")
...     print(f"  Test:   index={test_index}")
...     print(f"  Assets: columns={assets}")
Fold 0:
  Train:  index=[0 1]
  Test:   index=[2]
  Assets: columns=[0 1 4]
Fold 1:
  Train:  index=[1 2]
  Test:   index=[3]
  Assets: columns=[0 1 4]
Fold 2:
  Train:  index=[0 1]
  Test:   index=[2]
  Assets: columns=[1 3 4]
Fold 3:
  Train:  index=[1 2]
  Test:   index=[3]
  Assets: columns=[1 3 4]
>>> print(f"Path ids: {cv.get_path_ids()}")
Path ids: [0 0 1 1]
>>>
>>> # Random contiguous time slice of 4 observations among 10 observations.
>>> X = np.random.randn(10, 5) # 10 observations and 5 assets.
>>> cv = MultipleRandomizedCV(
...     walk_forward=WalkForward(test_size=1, train_size=2),
...     n_subsamples=2,
...     asset_subset_size=3,
...     window_size=4,
...     random_state=0,
... )
>>> for i, (train_index, test_index, assets) in enumerate(cv.split(X)):
...     print(f"Fold {i}:")
...     print(f"  Train:  index={train_index}")
...     print(f"  Test:   index={test_index}")
...     print(f"  Assets: columns={assets}")
Fold 0:
  Train:  index=[4 5]
  Test:   index=[6]
  Assets: columns=[0 1 4]
Fold 1:
  Train:  index=[5 6]
  Test:   index=[7]
  Assets: columns=[0 1 4]
Fold 2:
  Train:  index=[5 6]
  Test:   index=[7]
  Assets: columns=[1 3 4]
Fold 3:
  Train:  index=[6 7]
  Test:   index=[8]
  Assets: columns=[1 3 4]
>>>
>>> # Walk Forward with time-based (calendar) rebalancing.
>>> # Rebalance every 3 months on the third Friday, and train on the last 12 months.
>>> prices = load_sp500_dataset()
>>> X = prices_to_returns(prices)
>>> X = X["2021":"2022"]
>>> cv = MultipleRandomizedCV(
...     walk_forward=WalkForward(test_size=3, train_size=12, freq="WOM-3FRI"),
...     n_subsamples=2,
...     asset_subset_size=3,
...     window_size=None,
...     random_state=0,
... )
>>> for i, (train_index, test_index, assets) in enumerate(cv.split(X)):
...     print(f"Fold {i}:")
...     print(f"  Train:  size={len(train_index)}")
...     print(f"  Test:   size={len(test_index)}")
...     print(f"  Assets: columns={assets}")
Fold 0:
  Train:  size=256
  Test:   size=59
  Assets: columns=[ 9 16 17]
Fold 1:
  Train:  size=253
  Test:   size=61
  Assets: columns=[ 9 16 17]
Fold 2:
  Train:  size=251
  Test:   size=69
  Assets: columns=[ 9 16 17]
Fold 3:
  Train:  size=256
  Test:   size=59
  Assets: columns=[ 7 10 14]
Fold 4:
  Train:  size=253
  Test:   size=61
  Assets: columns=[ 7 10 14]
Fold 5:
  Train:  size=251
  Test:   size=69
  Assets: columns=[ 7 10 14]
>>> print(f"Path ids: {cv.get_path_ids()}")
[0 0 0 1 1 1]

get_path_ids()[source]#: Return the path id of each test sets in each split.

split(X, y=None)[source]#

Generate indices to split data into training and test set.

Parameters:

Xarray-like of shape (n_observations, n_assets): Price returns of the assets.
yarray-like of shape (n_observations, n_targets): Always ignored, exists for compatibility.

Yields:

trainndarray: The training set indices for that split.
testndarray: The testing set indices for that split.
assetsndarray: The assets indices for that split.