Technical Specifications

Technical Specifications

Verification checklist for the Word Space 3D pipeline including embeddings, UMAP, and similarity scoring

Updated: January 25, 2026


Technical Specifications

A verification checklist for the Word Space 3D pipeline.


Pipeline Overview

word → OpenAI embedding (512D) → UMAP transform (3D) → normalize to [-1,1]
                ↓
        cosine similarity vs mystery word (computed in 512D, not 3D)

Embedding

ParameterValueRationale
Modeltext-embedding-3-smallOptimized for short content
Dimensions512Sufficient for single words; avoids curse of dimensionality
NormalizationL2 normalized by APIStandard for cosine similarity

Verification: Embedding vectors have unit norm (||v|| ≈ 1.0).


UMAP

ParameterValueRationale
n_components33D visualization
metriccosineMatches embedding similarity metric
n_neighbors15Balance: local structure vs global coherence
min_dist0.08Tighter clusters without overlap
random_state42Reproducibility

Training: Fit once on vocabulary (6,017 words). Saved as .joblib.

Inference: .transform() for new points. No retraining required.

Verification: UMAP preserves local neighborhoods. K-NN recall in 3D should correlate with K-NN in 512D.


Coordinate Normalization

Raw UMAP output normalized to [-1, 1] per axis:

normalized = 2 × (raw - min) / (max - min) - 1

Min/max computed from vocabulary corpus, stored in database metadata.

Verification: All vocabulary coordinates fall in [-1, 1]. New words may exceed bounds slightly (out-of-distribution).


Similarity Scoring

Cosine similarity in original 512D space (not 3D):

similarity(A, B) = (A · B) / (||A|| × ||B||)

Range: [-1, 1], but word embeddings typically yield [0, 1].

Interpretation:

  • 0.0: Orthogonal (semantically unrelated)
  • 0.5: Moderate similarity
  • 0.9+: High similarity
  • 1.0: Identical

Verification: Self-similarity = 1.0. Synonym pairs > 0.7. Unrelated pairs < 0.3.


Related Words (K-NN)

Selection: K nearest neighbors to guess embedding (not mystery word).

Scoring: Cosine similarity of each neighbor to mystery embedding.

Purpose: Reveals semantic temperature of guess's neighborhood.

Implementation: Brute-force scan over 6,017 vocabulary embeddings. O(n) per query, <50ms.


Database Schema

CREATE TABLE words (
    word TEXT PRIMARY KEY,
    embedding BLOB,      -- 512 × float32 = 2,048 bytes
    x REAL,              -- Normalized UMAP x ∈ [-1, 1]
    y REAL,              -- Normalized UMAP y ∈ [-1, 1]
    z REAL,              -- Normalized UMAP z ∈ [-1, 1]
    pos TEXT             -- Internal: 'noun', 'verb', 'adj'
);

CREATE TABLE metadata (
    key TEXT PRIMARY KEY,
    value TEXT           -- Stores UMAP min/max for normalization
);

Latency Profile

OperationTypicalNotes
OpenAI embedding200-400msNetwork bound
UMAP transform10-20msFirst call includes JIT warmup (~500ms)
Cosine similarity<1ms
K-NN search30-50msBrute force over 6,017 words
Total~300msDominated by embedding API

Verification Commands

# Check vocabulary size and POS distribution
sqlite3 services/data/word_space.db "SELECT pos, COUNT(*) FROM words GROUP BY pos;"

# Verify coordinate normalization
sqlite3 services/data/word_space.db "SELECT MIN(x), MAX(x), MIN(y), MAX(y), MIN(z), MAX(z) FROM words;"

# Test embedding + UMAP pipeline
python -c "
import asyncio
from services.word_space_service import WordSpaceService
async def test():
    svc = WordSpaceService()
    r = await svc.embed_and_reduce('telescope')
    print(f'telescope: ({r[\"x\"]:.4f}, {r[\"y\"]:.4f}, {r[\"z\"]:.4f})')
    await svc.close()
asyncio.run(test())
"

# Verify self-similarity = 1.0
# (Requires mystery word set to test word)

File Reference

FilePurpose
server/services/word_space_service.pyPipeline orchestration
server/services/embed/embedding_service.pyOpenAI API wrapper
server/services/umap_reduce/umap_service.pyUMAP transform (loads fitted model)
server/services/umap_reduce/fit_vocab_umap_3d.pyUMAP training script
server/services/umap_reduce/umap_reducer_3d.joblibFitted UMAP model
server/services/vector/sqlite_vector_service.pyK-NN search
server/services/data/word_space.dbVocabulary + embeddings + coordinates
server/services/data/vocab.csvSource vocabulary (word, pos)

Decision Rationale

For detailed justification of design choices: