Memory strategy

SOMA has two distinct memory layers. They solve different problems and must not be collapsed into one.

Layer	Store	Purpose	Retention
Conversational (short-term)	Mastra `Memory` + PostgresStore	Last N turns per thread, used to keep multi-turn context coherent	`lastMessages: 12`
Knowledge (long-term)	`entities` / `edges` / `events` / `facts`	The graph. Queried by tools	Forever (or until `forget` / `supersede`)

Conversational memory

Mastra's built-in Memory tracks the message history of a thread, identified by threadId. On the web, the thread id is web:<userId>:<epochMs>; on the bot, it's tg:<chatId>:<epochMs>. The agent uses the last 12 turns as immediate context.

Gated on opt-in. Mastra's PgVector worker has historical issues running inside Next.js serverless (thread-stream worker resolution). We keep conversational memory disabled by default and enable only when SOMA_MASTRA_MEMORY=1 is set. SOMA's knowledge graph covers long-term recall through tool calls anyway.

Knowledge memory

The graph is the canonical memory. Tools hit it directly:

memory_recall — semantic search over entities via pgvector HNSW, optionally reranked.
search_entities — lexical full-text search over entities.name and searchable properties.
graph_neighbors — BFS walk from a root entity, up to 3 hops.

Facts are pulled in separately when the agent needs durable preferences ("user prefers morning workouts"). The fact-extract workflow runs async per conversation turn.

:::note Mastra's built-in semanticRecall is disabled. It would duplicate work (the knowledge graph already handles semantic recall) and require a second embedder config. The semanticRecall: { ... } block was removed from the Memory constructor. :::

Why two layers

Conversational memory answers "what did I just say two messages ago?". Knowledge memory answers "what do I know about Atomic Habits?". They have different:

Lifetimes — conversational is bounded (12 messages); knowledge is unbounded.
Granularity — conversational stores raw messages; knowledge stores extracted entities + facts.
Access patterns — conversational is always full-dump-on-read; knowledge is query-by-semantic-similarity.

Collapsing them would require either caching raw messages indefinitely (storage bloat) or discarding them immediately (losing multi-turn context). The split keeps both paths optimal.

Conversational memory

Knowledge memory

Why two layers

On this page