Index strategy

Which indexes exist, what they accelerate, and why each was worth the write cost.

Every read path in SOMA has a named index backing it. Unplanned sequential scans are a bug.

The catalog

Index	Shape	Accelerates
`entities_embedding_hnsw`	HNSW on `vector(1024)` cosine	Semantic recall via `memory_recall`
`entities_search_gin`	GIN on `tsvector`	Lexical fallback in `search_entities`
`entities_user_type_idx`	B-tree on `(user_id, type)`	Per-type listings (`/app/projects`, `/app/entities?type=book`)
`entities_user_status_idx`	B-tree on `(user_id, status)`	Active filter (hide archived)
`edges_from_idx`	B-tree on `(from_id, type)`	Outgoing-edge walk
`edges_to_idx`	B-tree on `(to_id, type)`	Incoming-edge walk (backlinks)
`events_user_type_time_idx`	B-tree on `(user_id, type, occurred_at DESC)`	Timeline views, filter-by-kind
`events_entities_gin`	GIN on `entity_ids uuid[]`	"Events involving X" — array containment
`facts_embedding_hnsw`	HNSW on `vector(1024)` cosine	Fact recall + dedup-on-write
`facts_user_active_idx`	Partial B-tree on `user_id WHERE superseded_by IS NULL`	Current facts only (skip the history)
`sources_kind_external_id_uniq`	UNIQUE on `(kind, external_id)`	Idempotent ingestion — repeated imports don't double-count

Both entities_embedding_hnsw and facts_embedding_hnsw use the defaults — m=16, ef_construction=64. Voyage-3-large is 1024-dimensional; default ef_search gives good recall for our volumes (MVP has thousands, not millions). Revisit when we cross 100k rows.

Why partial indexes

facts_user_active_idx is partial (WHERE superseded_by IS NULL). Facts are append-on-update — old rows stay for audit but never participate in recall. A full index would triple in size over time with zero extra use. The WHERE clause keeps it tight.

The catalog

HNSW tuning

Why partial indexes

On this page