The Weird Geometry of High-Dimensional Representations: Why Your ML Models Behave Differently

By Chima Emmanuel & Angel Ezeahurukwe
Reading Time: 12 minutes

📄 Download Full PDF Version

1. Introduction

When students first learn Euclidean geometry, they learn that space is isotropic: all directions are equivalent, and distance is a reliable indicator of similarity. This intuition applies faithfully in 2D or 3D.

But modern machine learning models operate in a very different regime. Word embeddings live in 300 dimensions. Transformer hidden states inhabit spaces of dimension 1024. Image features reside in spaces of 2048 or more.

In these hyper-spaces, geometric intuition becomes not merely imprecise but actively misleading. The culprit is concentration of measure. Loosely, in high dimensions, a smooth function of many independent random variables is very close to its mean with overwhelmingly high probability. This has dramatic consequences: norms concentrate, angles shrink toward 90 degrees, and inter-point distances become approximately equal.

2. The Toolbox: Sub-Gaussianity

To understand these results rigorously, we look at sub-Gaussian distributions. This class includes Gaussian, Rademacher ( $\pm 1$ ), and any bounded random variable.

Isotropic Vectors

A random vector $\mathbf{X} \in \mathbb{R}^d$ is called isotropic if:

$E[\mathbf{X}] = 0$
$\text{Cov}(\mathbf{X}) = I_d$ (the components are uncorrelated and have unit variance).

Key Inequality: Bernstein’s

Our proofs rest on Bernstein's Inequality. It tells us that for independent sub-exponential variables (like the squares of sub-Gaussian variables): $P\left(\left|\sum_i Z_i / d - E[Z_1]\right| \ge t\right) \le 2 \exp(-c d t^2 / K^4)$ Essentially, as dimension $d$ grows, the probability of the sum deviating from the mean drops exponentially.

3. Norm Concentration: Vectors Live on a "Thin Shell"

In 2D, a point can be anywhere in a circle. But what happens as $d$ grows?

Consider $\|\mathbf{X}\|^2 = X_1^2 + \dots + X_d^2$ . By the law of large numbers, $\|\mathbf{X}\|^2/d \to 1$ . Because of concentration, the fluctuations around this limit are tiny.

Theorem 3.1: Virtually all the probability mass of a high-dimensional distribution lies within a "thin shell" of radius $\sqrt{d}$ and width $O(1)$ .

The Intuition: As $d$ increases, the volume of a ball is concentrated almost entirely in a thin layer near the surface. Vectors aren't "inside" the space; they are all clinging to the edge.

4. Distance Concentration: All Points are Equally Far

If individual norms concentrate, what about the distance between two independent points $\mathbf{X}$ and $\mathbf{Y}$ ?

Through the same concentration logic, we find that: $\|\mathbf{X} - \mathbf{Y}\| \approx \sqrt{2d}$

The Consequence: In high-dimensional representation spaces, any two randomly drawn vectors will be approximately the same distance apart. There are no "close" neighbors and no "far" neighbors—just a single typical distance.

5. Asymptotic Orthogonality

In 2D, two random vectors can have any angle. In high dimensions, all pairs are nearly perpendicular.

Consider the cosine similarity: $\text{cos}(\theta) = \frac{\langle \mathbf{X}, \mathbf{Y} \rangle}{\|\mathbf{X}\| \|\mathbf{Y}\|}$

The dot product $\langle \mathbf{X}, \mathbf{Y} \rangle$ has a typical magnitude of $\sqrt{d}$ .
The norm product $\|\mathbf{X}\| \|\mathbf{Y}\|$ is approximately $d$ .
Therefore, $\text{cos}(\theta) \approx \sqrt{d}/d = 1/\sqrt{d}$ , which goes to zero as $d \to \infty$ .

Key Takeaway: Random vectors in high dimensions are nearly orthogonal. This is exploited in random projections and compressed sensing.

6. Metric Degeneracy: The Collapse of Contrast

This is the most critical result for ML practitioners. We define Metric Degeneracy as the loss of discriminative power in distance metrics.

As $d$ grows, the relative difference between the farthest and nearest neighbor vanishes: $\frac{\max D_i - \min D_i}{\min D_i} \xrightarrow{P} 0$

If your "nearest" neighbor is 9.9 units away and your "farthest" is 10.1 units away, the distinction between them is essentially noise.

7. Failure Modes in Machine Learning

1. Breakdown of Nearest-Neighbour Methods

For $d = 1000$ , all data points are within ~3% of the same distance from any query. In this regime, $k$ -NN becomes unreliable and computationally expensive for very little gain.

2. Degeneration of RBF Kernels

The RBF kernel $K(x, y) = \exp(-\gamma \|x - y\|^2)$ becomes a constant (0 or 1) in high dimensions unless the bandwidth $\gamma$ is perfectly scaled as $1/d$ .

3. Temperature Scaling in Contrastive Learning

In Contrastive Loss (like SimCLR), cosine similarities are naturally tiny ( $1/\sqrt{d}$ ). We use a temperature parameter $\tau$ to rescale these similarities, effectively "zooming in" on the differences so the model can receive a gradient signal.

4. The Hubness Phenomenon

High-dimensional datasets often contain "hubs"—points that are the "nearest neighbor" for a disproportionately large number of other points. This is a direct result of the geometry of the thin shell.

8. Design Principles for Practitioners

How do we navigate this "cursed" geometry?

$\ell_2$ Normalization: Always normalize your representations. Projecting onto the unit sphere removes the norm-concentration degree of freedom.
Cosine Similarity: In high-D, focus on angles rather than absolute Euclidean distances.
Dimensionality Reduction: Use techniques like PCA or UMAP. Most real-world data lives on a low-dimensional manifold; reducing $d$ restores the power of distance metrics.
Scale Bandwidths: If using kernels or temperatures, ensure they scale inversely with the dimension ( $1/d$ or $1/\sqrt{d}$ ).

9. Conclusion

High-dimensional spaces are governed by the concentration of measure, imposing a rigid structure: same norms, same distances, and perpendicular angles. Understanding this "metric degeneracy" is the first step toward building more robust and scalable AI systems.

References

[1] Vershynin, R. (2018). High-Dimensional Probability. Cambridge University Press.
[2] Beyer, K., et al. (1999). When is "nearest neighbor" meaningful? ICDT.
[3] Chen, T., et al. (2020). A simple framework for contrastive learning (SimCLR). ICML.

If you enjoyed this deep dive, consider sharing it with your fellow researchers!