Verification checklist for the Word Space 3D pipeline including embeddings, UMAP, and similarity scoring

Technical Specifications

A verification checklist for the Word Space 3D pipeline.

Pipeline Overview

word → OpenAI embedding (512D) → UMAP transform (3D) → normalize to [-1,1]
                ↓
        cosine similarity vs mystery word (computed in 512D, not 3D)

Embedding

Parameter	Value	Rationale
Model	`text-embedding-3-small`	Optimized for short content
Dimensions	512	Sufficient for single words; avoids curse of dimensionality
Normalization	L2 normalized by API	Standard for cosine similarity

Verification: Embedding vectors have unit norm (||v|| ≈ 1.0).

UMAP

Parameter	Value	Rationale
`n_components`	3	3D visualization
`metric`	cosine	Matches embedding similarity metric
`n_neighbors`	15	Balance: local structure vs global coherence
`min_dist`	0.08	Tighter clusters without overlap
`random_state`	42	Reproducibility

Training: Fit once on vocabulary (6,017 words). Saved as .joblib.

Inference: .transform() for new points. No retraining required.

Verification: UMAP preserves local neighborhoods. K-NN recall in 3D should correlate with K-NN in 512D.

Coordinate Normalization

Raw UMAP output normalized to [-1, 1] per axis:

normalized = 2 × (raw - min) / (max - min) - 1

Min/max computed from vocabulary corpus, stored in database metadata.

Verification: All vocabulary coordinates fall in [-1, 1]. New words may exceed bounds slightly (out-of-distribution).

Similarity Scoring

Cosine similarity in original 512D space (not 3D):

similarity(A, B) = (A · B) / (||A|| × ||B||)

Range: [-1, 1], but word embeddings typically yield [0, 1].

Interpretation:

0.0: Orthogonal (semantically unrelated)
0.5: Moderate similarity
0.9+: High similarity
1.0: Identical

Verification: Self-similarity = 1.0. Synonym pairs > 0.7. Unrelated pairs < 0.3.

Related Words (K-NN)

Selection: K nearest neighbors to guess embedding (not mystery word).

Scoring: Cosine similarity of each neighbor to mystery embedding.

Purpose: Reveals semantic temperature of guess's neighborhood.

Implementation: Brute-force scan over 6,017 vocabulary embeddings. O(n) per query, <50ms.

Database Schema

CREATE TABLE words (
    word TEXT PRIMARY KEY,
    embedding BLOB,      -- 512 × float32 = 2,048 bytes
    x REAL,              -- Normalized UMAP x ∈ [-1, 1]
    y REAL,              -- Normalized UMAP y ∈ [-1, 1]
    z REAL,              -- Normalized UMAP z ∈ [-1, 1]
    pos TEXT             -- Internal: 'noun', 'verb', 'adj'
);

CREATE TABLE metadata (
    key TEXT PRIMARY KEY,
    value TEXT           -- Stores UMAP min/max for normalization
);

Latency Profile

Operation	Typical	Notes
OpenAI embedding	200-400ms	Network bound
UMAP transform	10-20ms	First call includes JIT warmup (~500ms)
Cosine similarity	<1ms
K-NN search	30-50ms	Brute force over 6,017 words
Total	~300ms	Dominated by embedding API

Verification Commands

# Check vocabulary size and POS distribution
sqlite3 services/data/word_space.db "SELECT pos, COUNT(*) FROM words GROUP BY pos;"

# Verify coordinate normalization
sqlite3 services/data/word_space.db "SELECT MIN(x), MAX(x), MIN(y), MAX(y), MIN(z), MAX(z) FROM words;"

# Test embedding + UMAP pipeline
python -c "
import asyncio
from services.word_space_service import WordSpaceService
async def test():
    svc = WordSpaceService()
    r = await svc.embed_and_reduce('telescope')
    print(f'telescope: ({r[\"x\"]:.4f}, {r[\"y\"]:.4f}, {r[\"z\"]:.4f})')
    await svc.close()
asyncio.run(test())
"

# Verify self-similarity = 1.0
# (Requires mystery word set to test word)

File Reference

File	Purpose
`server/services/word_space_service.py`	Pipeline orchestration
`server/services/embed/embedding_service.py`	OpenAI API wrapper
`server/services/umap_reduce/umap_service.py`	UMAP transform (loads fitted model)
`server/services/umap_reduce/fit_vocab_umap_3d.py`	UMAP training script
`server/services/umap_reduce/umap_reducer_3d.joblib`	Fitted UMAP model
`server/services/vector/sqlite_vector_service.py`	K-NN search
`server/services/data/word_space.db`	Vocabulary + embeddings + coordinates
`server/services/data/vocab.csv`	Source vocabulary (word, pos)

Decision Rationale

For detailed justification of design choices: