Module: Llmemory::Tokenizer
- Defined in:
- lib/llmemory/tokenizer.rb
Overview
Shared word tokenizer for keyword search and lexical scoring (BM25, MMR). Centralizes the tokenization regex that was duplicated across the codebase.
Constant Summary collapse
- WORD =
/\b[a-z0-9]{2,}\b/
Class Method Summary collapse
-
.matches?(text, query) ⇒ Boolean
Lexical match used by storage-level keyword search.
- .tokenize(text) ⇒ Object
Class Method Details
.matches?(text, query) ⇒ Boolean
Lexical match used by storage-level keyword search. A query is split into tokens and matched as an OR of per-token substrings, so multi-word queries work (a single contiguous substring of the whole query is no longer required) while single-term/partial matches are preserved. An empty query (no tokens) matches everything, keeping prior “return all” behavior.
20 21 22 23 24 25 |
# File 'lib/llmemory/tokenizer.rb', line 20 def matches?(text, query) tokens = tokenize(query) return true if tokens.empty? haystack = text.to_s.downcase tokens.any? { |t| haystack.include?(t) } end |
.tokenize(text) ⇒ Object
11 12 13 |
# File 'lib/llmemory/tokenizer.rb', line 11 def tokenize(text) text.to_s.downcase.scan(WORD) end |