How good is Crowkis agent memory, really? The LoCoMo and LongMemEval numbers
We ran Crowkis memory against two public, hostile retrieval benchmarks — SNAP's LoCoMo and LongMemEval — on a laptop with no cloud calls. Here are the recall numbers, by question type, with the reranker on and off.
Agent memory is easy to demo and hard to measure. Anyone can store a fact and read it back; the real question is whether the right fact surfaces when the question is phrased differently, asked months later, or buried under fifty other sessions. So we stopped trusting our own demos and ran the two benchmarks the memory field actually argues about: SNAP Research's LoCoMo and LongMemEval.
LoCoMo is 10 multi-session dialogues with 1,986 question-answer pairs, each tagged with the exact conversation turns that contain the evidence. It is deliberately nasty: single-hop lookups, temporal reasoning, multi-hop chains, and open-domain questions. Here is Crowkis retrieval recall@10, with the bi-encoder alone versus the bi-encoder plus the bundled cross-encoder reranker.
Reranking lifts overall recall from ~25% (bi-encoder only) to 70.4% — roughly a 3× gain.
That reranker delta is the whole story of why Crowkis memory ships a second model. A bi-encoder embeds the query and the facts independently and compares them — fast, but blunt. The cross-encoder reads the query and each candidate together, which is slower but far sharper; Crowkis runs it only over the top candidates so the cost stays bounded. The result is the difference between a memory that mostly works and one you can build on.
LongMemEval: the harder, longer test
LongMemEval stretches the context to breaking: its hard split averages ~49 sessions and ~500 turns per question. We measured recall@5 by question type — the kinds of recall an actual assistant needs.
84.3% recall@5 on the stratified hard split; 92.7% in oracle mode over 479 focused questions.
The shape of these numbers is honest and useful. Temporal reasoning and knowledge-update score highest because consolidation is doing its job — when a fact changes, Crowkis retires the old version, so 'what's true now' is a clean lookup. Preference questions score lowest because preferences are subtle and rarely restated; that's the frontier we're working on, and we'd rather show you the 60% than hide it.
Every number here was produced on a developer laptop with the bundled ONNX models. Nothing left the machine. No API key was involved.
That last point is the part that matters for adoption. Hosted memory services post comparable recall — but they read your users' conversations to do it. Crowkis hits these numbers with a self-hosted binary and two small bundled models, which means agent memory is now something you can run in an air-gapped environment, under a compliance regime, or just without wiring your private conversations through someone else's servers.