skfolio.utils.stats.compute_optimal_n_clusters#

skfolio.utils.stats.compute_optimal_n_clusters(distance, linkage_matrix)[source]#

Compute the optimal number of clusters based on Two-Order Difference to Gap Statistic [1].

The Two-Order Difference to Gap Statistic has been developed to improve the performance and stability of the Tibshiranis Gap statistic. It applies the two-order difference of the within-cluster dispersion to replace the reference null distribution in the Gap statistic.

The number of cluster \(k\) is determined by:

\[\begin{split}\begin{cases} \begin{aligned} &\max_{k} & & W_{k+2} + W_{k} - 2 W_{k+1} \\ &\text{s.t.} & & 1 \ge c \ge max\bigl(8, \sqrt{n}\bigr) \\ \end{aligned} \end{cases}\end{split}\]

with \(n\) the sample size and \(W_{k}\) the within-cluster dispersions defined as:

\[W_{k} = \sum_{i=1}^{k} \frac{D_{i}}{2|C_{i}|}\]

where \(|C_{i}|\) is the cardinality of cluster \(i\) and \(D_{i}\) its density defined as:

\[D_{i} = \sum_{u \in C_{i}} \sum_{v \in C_{i}} d(u,v)\]

with \(d(u,v)\) the distance between u and v.

Parameters:
distancendarray of shape (n, n)

Distance matrix.

linkage_matrixndarray of shape (n - 1, 4)

Linkage matrix.

Returns:
valueint

Optimal number of clusters.

References

[1]

“Application of two-order difference to gap statistic”. Yue, Wang & Wei (2009)