Release

Releasing lexsim — a dictionary-free multilingual similarity engine

We published lexsim, a Rust crate that scores text similarity across languages without a morphological dictionary — powering memory retrieval and dedup in handoff-mcp.

Searching Handoff's memory

While building Handoff, we needed a way to search and deduplicate memories across sessions. Say you saved a lesson like "always use atomic_write for handoff files" in a previous session — next time you hit a similar situation, you want that lesson surfaced. And if you are about to write the same lesson again, you want to know it already exists.

Most NLP libraries that handle Japanese require a multi-megabyte morphological dictionary, which felt too heavy for an MCP server. lexsim was written to solve this without any dictionary at all.

Two questions, one tokenizer

lexsim answers two straightforward questions:

  • "Are these the same?" — answered by Jaccard similarity. Use it before saving a memory to check for duplicates
  • "Is this relevant?" — answered by BM25 ranking. Use it to pull past memories related to the task at hand

Both sit on the same tokenizer, so duplicate detection and search always agree on what counts as a "term." No more cases where dedup catches something but search misses it, or vice versa.

No dictionary, by design

For languages like Japanese that lack whitespace word boundaries, lexsim slides a two-character window across the text to produce overlapping pairs. "メモリ機能" becomes "メモ", "モリ", "リ機", "機能" — no dictionary needed to decide where words start and end. Normalization absorbs full/half-width variation, and cross-language character n-grams pick up identifiers and proper nouns across scripts.

Without a dictionary, lexsim cannot detect paraphrases that share no surface tokens — even within the same language. "消す" (erase) and "削除する" (delete) look different to it, despite meaning the same thing. Across languages the gap is wider still: "削除する" and "delete" will not match. Shared identifiers and proper nouns still come through, but translation-level similarity is out of scope. If the need arises, we may develop a separate dictionary crate down the line.

Availability

lexsim is the same code that runs inside Handoff's memory engine, extracted into a standalone crate. It is published on crates.io and GitHub.

See the lexsim case study