skfolio.utils.stats
.compute_optimal_n_clusters#
- skfolio.utils.stats.compute_optimal_n_clusters(distance, linkage_matrix)[source]#
Compute the optimal number of clusters based on Two-Order Difference to Gap Statistic [1].
The Two-Order Difference to Gap Statistic has been developed to improve the performance and stability of the Tibshiranis Gap statistic. It applies the two-order difference of the within-cluster dispersion to replace the reference null distribution in the Gap statistic.
The number of cluster \(k\) is determined by:
\[\begin{split}\begin{cases} \begin{aligned} &\max_{k} & & W_{k+2} + W_{k} - 2 W_{k+1} \\ &\text{s.t.} & & 1 \ge c \ge max\bigl(8, \sqrt{n}\bigr) \\ \end{aligned} \end{cases}\end{split}\]with \(n\) the sample size and \(W_{k}\) the within-cluster dispersions defined as:
\[W_{k} = \sum_{i=1}^{k} \frac{D_{i}}{2|C_{i}|}\]where \(|C_{i}|\) is the cardinality of cluster \(i\) and \(D_{i}\) its density defined as:
\[D_{i} = \sum_{u \in C_{i}} \sum_{v \in C_{i}} d(u,v)\]with \(d(u,v)\) the distance between u and v.
- Parameters:
- distancendarray of shape (n, n)
Distance matrix.
- linkage_matrixndarray of shape (n - 1, 4)
Linkage matrix.
- Returns:
- valueint
Optimal number of clusters.
References
[1]“Application of two-order difference to gap statistic”. Yue, Wang & Wei (2009)