Geometric Analysis
Geometric estimators calculate the "Intrinsic Dimension" (ID) based on distances between points, rather than variance of global projections. This is crucial for manifolds that are non-linear (e.g., a Swiss Roll).
The Swiss Roll Problem
A "Swiss Roll" is a 2D plane rolled up in 3D.
- PCA will see it as 3D (because variance exists in x, y, z).
- Geometric ID should see it as 2D (locally, it's a plane).
import numpy as np
import effdim
from sklearn.datasets import make_swiss_roll
# Generate Swiss Roll
X, _ = make_swiss_roll(n_samples=2000, noise=0.01)
# Compute dimensionalities
results = effdim.compute_dim(X)
# PCA
pca_dim = results['pca_explained_variance_95']
print(f"Global PCA Dimension: {pca_dim}")
# Likely 3, because the roll occupies 3D volume globally.
# kNN Intrinsic Dimension (MLE)
knn_dim = results['mle_dimensionality']
print(f"kNN Intrinsic Dimension: {knn_dim:.2f}")
# Should be close to 2.0
# Two-NN
twonn_dim = results['two_nn_dimensionality']
print(f"Two-NN Intrinsic Dimension: {twonn_dim:.2f}")
# Should be close to 2.0
When to use Geometric Estimators?
- Non-linear manifolds: Image datasets (digits, faces) often lie on low-dimensional non-linear manifolds.
- Manifold Learning: Checking if your autoencoder latent space has matched the intrinsic dimension of the data.
- Local Analysis: Using pure geometry approaches can capture local variability better.
Limitations
- Computational Cost: Requires computing nearest neighbors, which can be slow for large \(N\).
effdimutilizes the highly efficient CFaiss implementation under the hood to speed this up. - Curse of Dimensionality: In extremely high dimensions, distance concentration can make geometric estimation unstable.