Data
Index strategy
Which indexes exist, what they accelerate, and why each was worth the write cost.
Every read path in SOMA has a named index backing it. Unplanned sequential scans are a bug.
The catalog
| Index | Shape | Accelerates |
|---|---|---|
entities_embedding_hnsw | HNSW on vector(1024) cosine | Semantic recall via memory_recall |
entities_search_gin | GIN on tsvector | Lexical fallback in search_entities |
entities_user_type_idx | B-tree on (user_id, type) | Per-type listings (/app/projects, /app/entities?type=book) |
entities_user_status_idx | B-tree on (user_id, status) | Active filter (hide archived) |
edges_from_idx | B-tree on (from_id, type) | Outgoing-edge walk |
edges_to_idx | B-tree on (to_id, type) | Incoming-edge walk (backlinks) |
events_user_type_time_idx | B-tree on (user_id, type, occurred_at DESC) | Timeline views, filter-by-kind |
events_entities_gin | GIN on entity_ids uuid[] | "Events involving X" — array containment |
facts_embedding_hnsw | HNSW on vector(1024) cosine | Fact recall + dedup-on-write |
facts_user_active_idx | Partial B-tree on user_id WHERE superseded_by IS NULL | Current facts only (skip the history) |
sources_kind_external_id_uniq | UNIQUE on (kind, external_id) | Idempotent ingestion — repeated imports don't double-count |
HNSW tuning
Both entities_embedding_hnsw and facts_embedding_hnsw use the defaults — m=16, ef_construction=64. Voyage-3-large is 1024-dimensional; default ef_search gives good recall for our volumes (MVP has thousands, not millions). Revisit when we cross 100k rows.
Why partial indexes
facts_user_active_idx is partial (WHERE superseded_by IS NULL). Facts are append-on-update — old rows stay for audit but never participate in recall. A full index would triple in size over time with zero extra use. The WHERE clause keeps it tight.