Vectorizing the spacetag index — eight retrieval decisions, resolved

design-decision · sourced

Semantic + faceted search over ~256 spacetags (a controlled capability/action vocabulary plus NL summary, verdict, and numeric scores). Eight Phase-0 research questions — hybrid retrieval, embedding model, what to embed, similarity & norm, rerank, storage, dedup, evaluation — each resolved with a one-line reason and a stable interface that survives the eventual backend swap.

~256spacetags indexed~1.5 MBfull vector setsub-msbrute-force scan≥10kmigration trigger

Findings

Hybrid retrieval: pre-filter on an exact facet mask, then brute-force cosine over survivors — at a few hundred items a full scan is sub-millisecond and 100% recall, so none of the 'filter breaks the HNSW graph' failure modes exist. [O]
Embed a single vector of name + summary + verdict + capabilities-as-NL-phrase; drop the boilerplate affirmation sentence (near-identical across all items dilutes the vector); keep raw tokens as filter metadata too. [O]
Default model openai/text-embedding-3-small (1536-d, ~$0.02/1M) via Vercel AI Gateway; fallback cohere/embed-v4.0. At this volume monthly spend is sub-cent, so optimize for integration + quality. [O]
Cosine with L2-normalization; surface a calibrated band (strong/weak/none) plus raw score and top1−top2 gap — text-embedding-3 'good' sims sit ~0.30–0.55, not 0.8, and agents over-trust a bare cosine. [O]
In-memory brute force (JSON loaded at module scope); migrate only at ≥~10k vectors, writes more than a few/day, or p95 compute >50ms — target Upstash Vector, runner-up pgvector. A stable VectorStore interface keeps callers untouched across the swap. [O]
Dedup on a normalized owner/repo key (+ GitHub repo_id as immutable identity); near-dup merges only when Jaro-Winkler(title) ≥ 0.92 AND cosine ≥ 0.88, so sibling projects don't over-merge. [O]

Open in the directory →All studies →