The Weird Geometry of High-Dimensional Representations: Why Your ML Models Behave Differently
By Chima Emmanuel & Angel Ezeahurukwe
Reading Time: 12 minutes
1. Introduction
When students first learn Euclidean geometry, they learn that space is isotropic: all directions are equivalent, and distance is a reliable indicator of similarity. This intuition applies faithfully in 2D or 3D.
But modern machine learning models operate in a very different regime. Word embeddings live in 300 dimensions. Transformer hidden states inhabit spaces of dimension 1024. Image features reside in spaces of 2048 or more.
In these hyper-spaces, geometric intuition becomes not merely imprecise but actively misleading. The culprit is concentration of measure. Loosely, in high dimensions, a smooth function of many independent random variables is very close to its mean with overwhelmingly high probability. This has dramatic consequences: norms concentrate, angles shrink toward 90 degrees, and inter-point distances become approximately equal.
2. The Toolbox: Sub-Gaussianity
To understand these results rigorously, we look at sub-Gaussian distributions. This class includes Gaussian, Rademacher (), and any bounded random variable.
Isotropic Vectors
A random vector is called isotropic if:
- (the components are uncorrelated and have unit variance).
Key Inequality: Bernsteinβs
Our proofs rest on Bernstein's Inequality. It tells us that for independent sub-exponential variables (like the squares of sub-Gaussian variables): Essentially, as dimension grows, the probability of the sum deviating from the mean drops exponentially.
3. Norm Concentration: Vectors Live on a "Thin Shell"
In 2D, a point can be anywhere in a circle. But what happens as grows?
Consider . By the law of large numbers, . Because of concentration, the fluctuations around this limit are tiny.
Theorem 3.1: Virtually all the probability mass of a high-dimensional distribution lies within a "thin shell" of radius and width .
The Intuition: As increases, the volume of a ball is concentrated almost entirely in a thin layer near the surface. Vectors aren't "inside" the space; they are all clinging to the edge.
4. Distance Concentration: All Points are Equally Far
If individual norms concentrate, what about the distance between two independent points and ?
Through the same concentration logic, we find that:
The Consequence: In high-dimensional representation spaces, any two randomly drawn vectors will be approximately the same distance apart. There are no "close" neighbors and no "far" neighborsβjust a single typical distance.
5. Asymptotic Orthogonality
In 2D, two random vectors can have any angle. In high dimensions, all pairs are nearly perpendicular.
Consider the cosine similarity:
- The dot product has a typical magnitude of .
- The norm product is approximately .
- Therefore, , which goes to zero as .
Key Takeaway: Random vectors in high dimensions are nearly orthogonal. This is exploited in random projections and compressed sensing.
6. Metric Degeneracy: The Collapse of Contrast
This is the most critical result for ML practitioners. We define Metric Degeneracy as the loss of discriminative power in distance metrics.
As grows, the relative difference between the farthest and nearest neighbor vanishes:
If your "nearest" neighbor is 9.9 units away and your "farthest" is 10.1 units away, the distinction between them is essentially noise.
7. Failure Modes in Machine Learning
1. Breakdown of Nearest-Neighbour Methods
For , all data points are within ~3% of the same distance from any query. In this regime, -NN becomes unreliable and computationally expensive for very little gain.
2. Degeneration of RBF Kernels
The RBF kernel becomes a constant (0 or 1) in high dimensions unless the bandwidth is perfectly scaled as .
3. Temperature Scaling in Contrastive Learning
In Contrastive Loss (like SimCLR), cosine similarities are naturally tiny (). We use a temperature parameter to rescale these similarities, effectively "zooming in" on the differences so the model can receive a gradient signal.
4. The Hubness Phenomenon
High-dimensional datasets often contain "hubs"βpoints that are the "nearest neighbor" for a disproportionately large number of other points. This is a direct result of the geometry of the thin shell.
8. Design Principles for Practitioners
How do we navigate this "cursed" geometry?
- Normalization: Always normalize your representations. Projecting onto the unit sphere removes the norm-concentration degree of freedom.
- Cosine Similarity: In high-D, focus on angles rather than absolute Euclidean distances.
- Dimensionality Reduction: Use techniques like PCA or UMAP. Most real-world data lives on a low-dimensional manifold; reducing restores the power of distance metrics.
- Scale Bandwidths: If using kernels or temperatures, ensure they scale inversely with the dimension ( or ).
9. Conclusion
High-dimensional spaces are governed by the concentration of measure, imposing a rigid structure: same norms, same distances, and perpendicular angles. Understanding this "metric degeneracy" is the first step toward building more robust and scalable AI systems.
References
- [1] Vershynin, R. (2018). High-Dimensional Probability. Cambridge University Press.
- [2] Beyer, K., et al. (1999). When is "nearest neighbor" meaningful? ICDT.
- [3] Chen, T., et al. (2020). A simple framework for contrastive learning (SimCLR). ICML.
If you enjoyed this deep dive, consider sharing it with your fellow researchers!