Comparing Estimators

Different fields use different definitions of "effective dimension". This tutorial highlights the differences.

PCA vs Participation Ratio

PCA relies on a hard threshold (e.g., 95% variance). It answers "how many axes do I need to keep?".
Participation Ratio (PR) is a "soft" count. It answers "how spread out is the variance?".

Consider a spectrum where eigenvalues decay slowly: \(\lambda_i = 1/i\).

import numpy as np
import effdim

# Simulate a slow decay spectrum directly
D = 50
lambdas = 1.0 / np.arange(1, D+1)

# Generate data X (N=1000, D=50) that respects this spectrum
# X = U * S * V.T
# Singular values s_i = sqrt(lambda_i * (N-1))
N = 1000
s = np.sqrt(lambdas * (N - 1))
# Random orthogonal matrix U (N x D)
U, _ = np.linalg.qr(np.random.randn(N, D))

X = U @ np.diag(s)

results = effdim.compute_dim(X)
pca_95 = results['pca_explained_variance_95']
pr = results['participation_ratio']

print(f"PCA (95%): {pca_95}")
print(f"Participation Ratio: {pr:.2f}")

In heavy-tailed distributions, PCA might suggest a very high dimension (to capture the tail), whereas PR might suggest a lower dimension because the mass is concentrated at the start.

Shannon vs Rényi

Shannon Entropy weights probabilities logarithmically. Rényi entropy (with \(\alpha=2\), which relates to PR) weights higher probabilities more heavily.

Shannon is sensitive to the entire distribution.
PR (Rényi-2) is more dominated by the largest eigenvalues.

If you have a dataset with many small noise directions, Shannon dimension might be higher than PR.