Technical Specifications
A verification checklist for the Word Space 3D pipeline.
Pipeline Overview
word → OpenAI embedding (512D) → UMAP transform (3D) → normalize to [-1,1]
↓
cosine similarity vs mystery word (computed in 512D, not 3D)
Embedding
| Parameter | Value | Rationale |
|---|---|---|
| Model | text-embedding-3-small | Optimized for short content |
| Dimensions | 512 | Sufficient for single words; avoids curse of dimensionality |
| Normalization | L2 normalized by API | Standard for cosine similarity |
Verification: Embedding vectors have unit norm (||v|| ≈ 1.0).
UMAP
| Parameter | Value | Rationale |
|---|---|---|
n_components | 3 | 3D visualization |
metric | cosine | Matches embedding similarity metric |
n_neighbors | 15 | Balance: local structure vs global coherence |
min_dist | 0.08 | Tighter clusters without overlap |
random_state | 42 | Reproducibility |
Training: Fit once on vocabulary (6,017 words). Saved as .joblib.
Inference: .transform() for new points. No retraining required.
Verification: UMAP preserves local neighborhoods. K-NN recall in 3D should correlate with K-NN in 512D.
Coordinate Normalization
Raw UMAP output normalized to [-1, 1] per axis:
normalized = 2 × (raw - min) / (max - min) - 1
Min/max computed from vocabulary corpus, stored in database metadata.
Verification: All vocabulary coordinates fall in [-1, 1]. New words may exceed bounds slightly (out-of-distribution).
Similarity Scoring
Cosine similarity in original 512D space (not 3D):
similarity(A, B) = (A · B) / (||A|| × ||B||)
Range: [-1, 1], but word embeddings typically yield [0, 1].
Interpretation:
- 0.0: Orthogonal (semantically unrelated)
- 0.5: Moderate similarity
- 0.9+: High similarity
- 1.0: Identical
Verification: Self-similarity = 1.0. Synonym pairs > 0.7. Unrelated pairs < 0.3.
Related Words (K-NN)
Selection: K nearest neighbors to guess embedding (not mystery word).
Scoring: Cosine similarity of each neighbor to mystery embedding.
Purpose: Reveals semantic temperature of guess's neighborhood.
Implementation: Brute-force scan over 6,017 vocabulary embeddings. O(n) per query, <50ms.
Database Schema
CREATE TABLE words (
word TEXT PRIMARY KEY,
embedding BLOB, -- 512 × float32 = 2,048 bytes
x REAL, -- Normalized UMAP x ∈ [-1, 1]
y REAL, -- Normalized UMAP y ∈ [-1, 1]
z REAL, -- Normalized UMAP z ∈ [-1, 1]
pos TEXT -- Internal: 'noun', 'verb', 'adj'
);
CREATE TABLE metadata (
key TEXT PRIMARY KEY,
value TEXT -- Stores UMAP min/max for normalization
);
Latency Profile
| Operation | Typical | Notes |
|---|---|---|
| OpenAI embedding | 200-400ms | Network bound |
| UMAP transform | 10-20ms | First call includes JIT warmup (~500ms) |
| Cosine similarity | <1ms | |
| K-NN search | 30-50ms | Brute force over 6,017 words |
| Total | ~300ms | Dominated by embedding API |
Verification Commands
# Check vocabulary size and POS distribution
sqlite3 services/data/word_space.db "SELECT pos, COUNT(*) FROM words GROUP BY pos;"
# Verify coordinate normalization
sqlite3 services/data/word_space.db "SELECT MIN(x), MAX(x), MIN(y), MAX(y), MIN(z), MAX(z) FROM words;"
# Test embedding + UMAP pipeline
python -c "
import asyncio
from services.word_space_service import WordSpaceService
async def test():
svc = WordSpaceService()
r = await svc.embed_and_reduce('telescope')
print(f'telescope: ({r[\"x\"]:.4f}, {r[\"y\"]:.4f}, {r[\"z\"]:.4f})')
await svc.close()
asyncio.run(test())
"
# Verify self-similarity = 1.0
# (Requires mystery word set to test word)
File Reference
| File | Purpose |
|---|---|
server/services/word_space_service.py | Pipeline orchestration |
server/services/embed/embedding_service.py | OpenAI API wrapper |
server/services/umap_reduce/umap_service.py | UMAP transform (loads fitted model) |
server/services/umap_reduce/fit_vocab_umap_3d.py | UMAP training script |
server/services/umap_reduce/umap_reducer_3d.joblib | Fitted UMAP model |
server/services/vector/sqlite_vector_service.py | K-NN search |
server/services/data/word_space.db | Vocabulary + embeddings + coordinates |
server/services/data/vocab.csv | Source vocabulary (word, pos) |
Decision Rationale
For detailed justification of design choices: