Dimensionality Reduction

Dimensionality Reduction

Why we chose UMAP over PCA, t-SNE, and other algorithms for projecting word embeddings to 3D

Updated: January 25, 2026


Dimensionality Reduction: Why We Use UMAP

The Challenge

Word Space 3D needs to project 512-dimensional word embeddings into a 3D space that players can see and navigate. This is a classic dimensionality reduction problem, and there are several algorithms we could use:

  • PCA (Principal Component Analysis)
  • t-SNE (t-distributed Stochastic Neighbor Embedding)
  • Laplacian Eigenmaps (Spectral Embedding)
  • UMAP (Uniform Manifold Approximation and Projection)

We chose UMAP. Here's why.


The Algorithms Compared

PCA (Principal Component Analysis)

How it works: Finds the directions (principal components) that capture the most variance in the data. Projects data onto these directions.

Pros:

  • Very fast
  • Mathematically simple
  • Can transform new points instantly
  • Preserves global structure well

Cons:

  • Assumes linear relationships
  • Poor at capturing complex, nonlinear semantic relationships
  • Clusters tend to overlap and blur together

Verdict for Word Space: Not suitable. Word embeddings have highly nonlinear relationships. "King" is related to "queen" and "royalty" in ways that PCA's linear projection can't capture well. The 3D space would look like a blurry cloud with little structure.


t-SNE (t-distributed Stochastic Neighbor Embedding)

How it works: Preserves local neighborhood structure by modeling probability distributions of distances in high and low dimensions, then minimizing their divergence.

Pros:

  • Excellent at revealing clusters
  • Creates visually striking, well-separated groups
  • Good for static visualizations

Cons:

  • Very slow (O(n²) complexity)
  • Cannot transform new points — must rerun on entire dataset
  • Global structure is distorted (cluster positions are arbitrary)
  • Results vary with random initialization

Verdict for Word Space: Not suitable. The inability to transform new points is a dealbreaker. When a player guesses "banana", we need to project it into 3D instantly. With t-SNE, we'd have to rerun the entire algorithm on 6,017+ words — far too slow for real-time gameplay.


Laplacian Eigenmaps (Spectral Embedding)

How it works: Builds a graph of nearest neighbors, computes the graph Laplacian matrix, and uses its eigenvectors to embed points in low dimensions.

Pros:

  • Good at preserving local structure
  • Theoretically elegant (graph theory foundations)
  • Works well for manifold learning

Cons:

  • Cannot transform new points (no out-of-sample extension)
  • Assumes uniform data distribution
  • Computationally expensive for large datasets
  • Sensitive to graph construction parameters

Verdict for Word Space: Not suitable. Same fundamental problem as t-SNE — no way to project new words without rebuilding the entire graph and recomputing eigenvectors.


UMAP (Uniform Manifold Approximation and Projection)

How it works: Combines topological data analysis with manifold learning. Builds a fuzzy topological representation of the high-dimensional data, then optimizes a low-dimensional representation to match it.

Pros:

  • Can transform new points via .transform() method
  • Fast (competitive with PCA for moderate datasets)
  • Preserves both local and global structure
  • Theoretically grounded (Riemannian geometry, algebraic topology)
  • Handles non-uniform data distributions well

Cons:

  • More parameters to tune than PCA
  • Results can vary with random seed (we fix this with random_state=42)

Verdict for Word Space: Perfect fit. UMAP gives us the nonlinear projection we need, preserves semantic clusters, and critically — lets us transform player guesses in real-time.


The Dealbreaker: Transforming New Points

This comparison really comes down to one critical requirement: real-time projection of new words.

When a player guesses a word:

  1. We embed it using OpenAI's API → 512-dimensional vector
  2. We project it into 3D using our trained reducer → (x, y, z) coordinates
  3. We display it in the visualization → instant feedback
AlgorithmCan Transform New Points?Suitable for Word Space?
PCAYesNo (poor nonlinear capture)
t-SNENoNo
Laplacian EigenmapsNoNo
UMAPYesYes

Only PCA and UMAP can transform new points. And only UMAP handles the nonlinear semantic relationships in word embeddings properly.


How UMAP's Transform Works

When we train UMAP on our 6,017-word vocabulary:

reducer = umap.UMAP(
    n_components=3,
    metric="cosine",
    n_neighbors=15,
    min_dist=0.08,
    random_state=42
)
reducer.fit(vocabulary_embeddings)  # Train on 6,017 × 512 matrix

The trained reducer learns the structure of semantic space. Later, when we need to project a new word:

new_embedding = get_embedding("banana")  # 512-dimensional vector
new_coords = reducer.transform([new_embedding])  # → [x, y, z]

This transform operation:

  • Takes ~50-100ms (fast enough for real-time)
  • Uses the learned manifold structure
  • Places the new point where it semantically belongs
  • Doesn't require retraining or accessing the original vocabulary

Our UMAP Configuration

umap.UMAP(
    n_components=3,       # 3D output for visualization
    metric="cosine",      # Best for normalized embeddings
    n_neighbors=15,       # Balance: local detail vs global structure
    min_dist=0.08,        # Tighter clusters (range: 0.0-1.0)
    random_state=42       # Reproducibility
)

Why these parameters?

  • n_components=3: We need 3D for spatial visualization
  • metric="cosine": Word embeddings are typically compared via cosine similarity, not Euclidean distance
  • n_neighbors=15: Higher values (30+) preserve more global structure but blur local clusters. Lower values (5-10) create tighter clusters but lose global relationships. 15 is a good balance.
  • min_dist=0.08: Controls how tightly points can pack together. Lower values create denser clusters. 0.08 gives us visible clusters without points overlapping.
  • random_state=42: UMAP has stochastic elements. Fixing the seed ensures reproducible results.

Visual Comparison

If we visualized the same word embeddings with each algorithm:

PCA: A diffuse cloud with some directional stretching. Similar words might be vaguely near each other, but clusters are poorly defined. Global directions are preserved but local neighborhoods are noisy.

t-SNE: Beautiful, well-separated clusters. "Animals" clearly separate from "vehicles" from "emotions". But the positions are arbitrary — animals might appear "above" vehicles in one run and "below" in another. And we can't add new words without starting over.

Laplacian Eigenmaps: Similar to t-SNE but often with more continuous manifold structure. Still can't handle new points.

UMAP: Clear clusters like t-SNE, but with meaningful global arrangement. Related clusters (e.g., "animals" and "nature") tend to be near each other. And new words slot into the right neighborhoods seamlessly.


Summary

RequirementPCAt-SNELaplacianUMAP
Nonlinear relationshipsNoYesYesYes
Transform new pointsYesNoNoYes
Preserve local structureNoYesYesYes
Preserve global structureYesNoPartialYes
Fast trainingYesNoNoYes
Suitable for Word SpaceNoNoNoYes

UMAP is the only algorithm that meets all our requirements. It's the right tool for the job.


Further Reading