Module: Llmemory::Tokenizer

Defined in:: lib/llmemory/tokenizer.rb

Overview

Shared word tokenizer for keyword search and lexical scoring (BM25, MMR). Centralizes the tokenization regex that was duplicated across the codebase.

Constant Summary collapse

WORD =

/\b[a-z0-9]{2,}\b/

Class Method Summary collapse

.matches?(text, query) ⇒ Boolean

Lexical match used by storage-level keyword search.
.tokenize(text) ⇒ Object

Class Method Details

.matches?(text, query) ⇒ `Boolean`

Lexical match used by storage-level keyword search. A query is split into tokens and matched as an OR of per-token substrings, so multi-word queries work (a single contiguous substring of the whole query is no longer required) while single-term/partial matches are preserved. An empty query (no tokens) matches everything, keeping prior “return all” behavior.

Returns:

(Boolean)

# File 'lib/llmemory/tokenizer.rb', line 20

def matches?(text, query)
  tokens = tokenize(query)
  return true if tokens.empty?
  haystack = text.to_s.downcase
  tokens.any? { |t| haystack.include?(t) }
end

.tokenize(text) ⇒ `Object`



11
12
13

# File 'lib/llmemory/tokenizer.rb', line 11

def tokenize(text)
  text.to_s.downcase.scan(WORD)
end