benchmarksJune 24, 2026· 9 min read

How good is Crowkis agent memory, really? The LoCoMo and LongMemEval numbers

We ran Crowkis memory against two public, hostile retrieval benchmarks — SNAP's LoCoMo and LongMemEval — on a laptop with no cloud calls. Here are the recall numbers, by question type, with the reranker on and off.

Agent memory is easy to demo and hard to measure. Anyone can store a fact and read it back; the real question is whether the right fact surfaces when the question is phrased differently, asked months later, or buried under fifty other sessions. So we stopped trusting our own demos and ran the two benchmarks the memory field actually argues about: SNAP Research's LoCoMo and LongMemEval.

In plain words: Recall@k asks: when the agent searches its memory, does the correct fact show up in the top k results? Higher is better. It's the number that decides whether your agent answers from memory or hallucinates.

LoCoMo is 10 multi-session dialogues with 1,986 question-answer pairs, each tagged with the exact conversation turns that contain the evidence. It is deliberately nasty: single-hop lookups, temporal reasoning, multi-hop chains, and open-domain questions. Here is Crowkis retrieval recall@10, with the bi-encoder alone versus the bi-encoder plus the bundled cross-encoder reranker.

LoCoMo recall@10 — with cross-encoder rerank% recall@10

Overall70.4%

Single-hop73.7%

Temporal71.3%

Multi-hop67%

Open-domain47.8%

Reranking lifts overall recall from ~25% (bi-encoder only) to 70.4% — roughly a 3× gain.

That reranker delta is the whole story of why Crowkis memory ships a second model. A bi-encoder embeds the query and the facts independently and compares them — fast, but blunt. The cross-encoder reads the query and each candidate together, which is slower but far sharper; Crowkis runs it only over the top candidates so the cost stays bounded. The result is the difference between a memory that mostly works and one you can build on.

LongMemEval: the harder, longer test

LongMemEval stretches the context to breaking: its hard split averages ~49 sessions and ~500 turns per question. We measured recall@5 by question type — the kinds of recall an actual assistant needs.

LongMemEval-S (hard) recall@5 by question type% recall@5

Temporal reasoning95.5%

Knowledge update90.9%

Single-session user81.8%

Multi-session76.2%

Single-session preference60%

84.3% recall@5 on the stratified hard split; 92.7% in oracle mode over 479 focused questions.

The shape of these numbers is honest and useful. Temporal reasoning and knowledge-update score highest because consolidation is doing its job — when a fact changes, Crowkis retires the old version, so 'what's true now' is a clean lookup. Preference questions score lowest because preferences are subtle and rarely restated; that's the frontier we're working on, and we'd rather show you the 60% than hide it.

Every number here was produced on a developer laptop with the bundled ONNX models. Nothing left the machine. No API key was involved.

That last point is the part that matters for adoption. Hosted memory services post comparable recall — but they read your users' conversations to do it. Crowkis hits these numbers with a self-hosted binary and two small bundled models, which means agent memory is now something you can run in an air-gapped environment, under a compliance regime, or just without wiring your private conversations through someone else's servers.