Class: RobotLab::DocumentStore

Inherits:
Object
  • Object
show all
Defined in:
lib/robot_lab/document_store.rb,
lib/robot_lab/document_store/version.rb

Overview

Embedding-based document store for semantic search over arbitrary text.

Documents are embedded using fastembed (BAAI/bge-small-en-v1.5 by default) and stored in memory. Queries are embedded the same way, then compared by cosine similarity to find the closest documents.

The embedding model is initialised lazily on first use — the ONNX model file is downloaded on that first call (cached locally afterwards).

When fastembed is not installed, DocumentStore falls back to a lightweight TF-IDF word-frequency embedder. The fallback is lower quality (no semantic understanding, only lexical overlap) but works offline with no downloads, making it suitable for development and testing.

Examples:

Standalone

store = RobotLab::DocumentStore.new
store.store(:q4_report, "Q4 revenue came in at $4.2M, up 18% YoY…")
store.store(:q3_report, "Q3 showed 15% growth, driven by APAC…")

results = store.search("revenue growth", limit: 2)
results.each { |r| puts "#{r[:key]} (#{r[:score].round(3)}): #{r[:text][0..60]}" }

With robot_lab Memory

memory.store_document(:readme, File.read("README.md"))
memory.search_documents("how to configure redis", limit: 3)

Constant Summary collapse

FASTEMBED_AVAILABLE =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

begin
  require 'fastembed'
  true
  # :nocov:
rescue LoadError
  false
  # :nocov:
end
DEFAULT_MODEL =

Default embedding model used when none is specified.

'BAAI/bge-small-en-v1.5'
STOP_WORDS =
%w[
  a an the is are was were be been being am do does did
  to of in and or but for with on at by from as into
  it its this that these those i you he she we they
  not no nor so yet
].to_set.freeze
VERSION =
'0.2.1'

Instance Method Summary collapse

Constructor Details

#initialize(model_name: DEFAULT_MODEL) ⇒ DocumentStore

Returns a new instance of DocumentStore.

Parameters:

  • model_name (String) (defaults to: DEFAULT_MODEL)

    fastembed model name (ignored when fastembed unavailable)



47
48
49
50
51
52
53
# File 'lib/robot_lab/document_store.rb', line 47

def initialize(model_name: DEFAULT_MODEL)
  @model_name      = model_name
  @documents       = {} # key (Symbol) => { text: String, vector: Array<Float> }
  @mutex           = Mutex.new
  @fastembed_model = nil # lazy: initialised on first embed call
  @using_fastembed = FASTEMBED_AVAILABLE
end

Instance Method Details

#clearself

Remove all stored documents.

Returns:

  • (self)


119
120
121
122
# File 'lib/robot_lab/document_store.rb', line 119

def clear
  @mutex.synchronize { @documents.clear }
  self
end

#delete(key) ⇒ self

Remove the document stored under key.

Parameters:

  • key (Symbol, String)

Returns:

  • (self)


112
113
114
115
# File 'lib/robot_lab/document_store.rb', line 112

def delete(key)
  @mutex.synchronize { @documents.delete(key.to_sym) }
  self
end

#empty?Boolean

Whether the store contains no documents.

Returns:

  • (Boolean)


105
106
107
# File 'lib/robot_lab/document_store.rb', line 105

def empty?
  @mutex.synchronize { @documents.empty? }
end

#keysArray<Symbol>

Keys of all stored documents.

Returns:

  • (Array<Symbol>)


99
100
101
# File 'lib/robot_lab/document_store.rb', line 99

def keys
  @mutex.synchronize { @documents.keys }
end

#search(query, limit: 5) ⇒ Array<Hash>

Search for documents semantically similar to query.

Parameters:

  • query (String)

    natural-language search query

  • limit (Integer) (defaults to: 5)

    maximum number of results (default 5)

Returns:

  • (Array<Hash>)

    results sorted by score descending. Each hash contains :key, :text, and :score (Float 0.0..1.0).



75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# File 'lib/robot_lab/document_store.rb', line 75

def search(query, limit: 5)
  return [] if empty?

  query_vec = query_vector(query)
  results   = []

  @mutex.synchronize do
    @documents.each do |key, doc|
      score = cosine_similarity(query_vec, doc[:vector])
      results << { key: key, text: doc[:text], score: score }
    end
  end

  results.sort_by { |r| -r[:score] }.first(limit)
end

#sizeInteger

Number of stored documents.

Returns:

  • (Integer)


93
94
95
# File 'lib/robot_lab/document_store.rb', line 93

def size
  @mutex.synchronize { @documents.size }
end

#store(key, text) ⇒ self

Embed text and store it under key.

If a document already exists under key it is replaced.

Parameters:

  • key (Symbol, String)

    identifier for this document

  • text (String)

    the document text to embed and store

Returns:

  • (self)


62
63
64
65
66
67
# File 'lib/robot_lab/document_store.rb', line 62

def store(key, text)
  key    = key.to_sym
  vector = passage_vector(text)
  @mutex.synchronize { @documents[key] = { text: text, vector: vector } }
  self
end