Module: HTM::Observability

Defined in:
lib/htm/observability.rb

Overview

Observability module for monitoring and metrics collection

Provides comprehensive monitoring of HTM components including:

  • Connection pool health monitoring with alerts

  • Query timing and performance metrics

  • Cache efficiency tracking

  • Service health checks

  • Memory usage statistics

Examples:

Basic usage

stats = HTM::Observability.collect_all
puts stats[:connection_pool][:status]  # => :healthy

Connection pool monitoring

pool_stats = HTM::Observability.connection_pool_stats
if pool_stats[:status] == :exhausted
  logger.error "Connection pool exhausted!"
end

Health check

if HTM::Observability.healthy?
  puts "All systems operational"
else
  puts "Health check failed: #{HTM::Observability.health_check[:issues]}"
end

Constant Summary collapse

POOL_WARNING_THRESHOLD =

Connection pool utilization thresholds

0.75
POOL_CRITICAL_THRESHOLD =

75% utilization triggers warning

0.90

Class Method Summary collapse

Class Method Details

.cache_statsHash?

Get query cache statistics

Returns:

  • (Hash, nil)

    Cache stats or nil if unavailable



127
128
129
130
131
132
133
# File 'lib/htm/observability.rb', line 127

def cache_stats
  # Try to access LongTermMemory cache stats
  # Note: This requires access to an LTM instance
  {
    info: "Cache stats available via LongTermMemory#stats[:cache]"
  }
end

.circuit_breaker_statsHash

Get circuit breaker states for all services

Returns:

  • (Hash)

    Circuit breaker states:

    • :embedding_service - State and failure count

    • :tag_service - State and failure count



141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
# File 'lib/htm/observability.rb', line 141

def circuit_breaker_stats
  stats = {}

  if defined?(HTM::EmbeddingService)
    cb = HTM::EmbeddingService.circuit_breaker
    stats[:embedding_service] = {
      state: cb.state,
      failure_count: cb.failure_count,
      last_failure_time: cb.last_failure_time
    }
  end

  if defined?(HTM::TagService)
    cb = HTM::TagService.circuit_breaker
    stats[:tag_service] = {
      state: cb.state,
      failure_count: cb.failure_count,
      last_failure_time: cb.last_failure_time
    }
  end

  stats
rescue StandardError => e
  { error: e.message }
end

.collect_allHash

Collect all observability metrics

Returns:

  • (Hash)

    Comprehensive metrics including:

    • :connection_pool - Pool stats with health status

    • :cache - Query cache hit rates and size

    • :circuit_breakers - Service circuit breaker states

    • :query_timings - Recent query performance

    • :service_timings - Embedding/tag generation times

    • :memory_usage - System memory stats



53
54
55
56
57
58
59
60
61
62
63
# File 'lib/htm/observability.rb', line 53

def collect_all
  {
    connection_pool: connection_pool_stats,
    cache: cache_stats,
    circuit_breakers: circuit_breaker_stats,
    query_timings: query_timing_stats,
    service_timings: service_timing_stats,
    memory_usage: memory_stats,
    collected_at: Time.now
  }
end

.connection_pool_statsHash

Get connection pool statistics with health status

Returns:

  • (Hash)

    Pool statistics including:

    • :size - Maximum pool size

    • :connections - Current total connections

    • :in_use - Connections currently checked out

    • :available - Connections available for checkout

    • :utilization - Usage percentage (0.0-1.0)

    • :status - Health status (:healthy, :warning, :critical, :exhausted)

    • :wait_timeout - Connection wait timeout (ms)



76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
# File 'lib/htm/observability.rb', line 76

def connection_pool_stats
  return { status: :unavailable, message: "Database not connected" } unless connected?

  pool = HTM.db.pool

  # Sequel's TimedQueueConnectionPool API:
  # - max_size: maximum pool size
  # - size: number of connections currently in pool (pre-allocated up to max_size)
  # - num_waiting: threads waiting for connections
  #
  # Unlike ActiveRecord, Sequel pre-allocates connections. The key health indicator
  # is num_waiting - if threads are waiting, the pool is under stress.
  max_size = pool.max_size
  current_size = pool.size
  waiting = pool.num_waiting

  # For Sequel's TimedQueueConnectionPool:
  # - Pool is healthy if no threads are waiting
  # - Pool is critical only if threads are waiting for connections
  # - size == max_size is normal (pre-allocated pool), not a problem
  status = if waiting.positive?
             waiting > max_size / 2 ? :exhausted : :critical
           else
             :healthy
           end

  # Utilization based on waiting threads (pool stress indicator)
  utilization = waiting.positive? ? ((waiting.to_f / max_size) * 100).round(2) : 0.0

  stats = {
    size: max_size,
    connections: current_size,
    in_use: 0,  # Sequel doesn't expose checked-out count; use waiting as stress indicator
    available: max_size,  # All connections are available when not waiting
    waiting: waiting,
    utilization: utilization,
    status: status
  }

  # Log warnings if pool is stressed
  log_pool_status(stats)

  stats
rescue StandardError => e
  { status: :error, message: e.message }
end

.health_checkHash

Perform comprehensive health check

Returns:

  • (Hash)

    Health check results:

    • :healthy - Boolean overall health status

    • :checks - Individual check results

    • :issues - Array of identified issues



252
253
254
255
256
257
258
259
260
# File 'lib/htm/observability.rb', line 252

def health_check
  checks = {}
  issues = []
  check_database(checks, issues)
  check_pool(checks, issues)
  check_circuit_breakers(checks, issues)
  check_extensions(checks, issues) if connected?
  { healthy: issues.empty?, checks: checks, issues: issues, checked_at: Time.now }
end

.healthy?Boolean

Quick health check - returns boolean

Returns:

  • (Boolean)

    true if system is healthy



266
267
268
# File 'lib/htm/observability.rb', line 266

def healthy?
  health_check[:healthy]
end

.memory_statsHash

Get memory usage statistics

Returns:

  • (Hash)

    Memory stats



236
237
238
239
240
241
242
243
# File 'lib/htm/observability.rb', line 236

def memory_stats
  {
    process_rss_mb: process_memory_mb,
    gc_stats: GC.stat.slice(:count, :heap_allocated_pages, :heap_live_slots)
  }
rescue StandardError
  { available: false }
end

.query_timing_statsHash

Get query timing statistics

Returns:

  • (Hash)

    Timing statistics including avg, min, max, p95



217
218
219
# File 'lib/htm/observability.rb', line 217

def query_timing_stats
  calculate_timing_stats(@query_timings, :query)
end

.record_embedding_timing(duration_ms) ⇒ Object

Record embedding generation timing

Parameters:

  • duration_ms (Float)

    Generation duration in milliseconds



189
190
191
192
193
194
195
196
197
# File 'lib/htm/observability.rb', line 189

def record_embedding_timing(duration_ms)
  @metrics_mutex.synchronize do
    @embedding_timings << {
      duration_ms: duration_ms,
      recorded_at: Time.now
    }
    @embedding_timings.shift if @embedding_timings.size > @max_timing_samples
  end
end

.record_query_timing(duration_ms, query_type: :unknown) ⇒ Object

Record query timing for metrics

Parameters:

  • duration_ms (Float)

    Query duration in milliseconds

  • query_type (Symbol) (defaults to: :unknown)

    Type of query (:vector, :fulltext, :hybrid)



172
173
174
175
176
177
178
179
180
181
182
183
# File 'lib/htm/observability.rb', line 172

def record_query_timing(duration_ms, query_type: :unknown)
  @metrics_mutex.synchronize do
    @query_timings << {
      duration_ms: duration_ms,
      query_type: query_type,
      recorded_at: Time.now
    }

    # Keep only recent samples
    @query_timings.shift if @query_timings.size > @max_timing_samples
  end
end

.record_tag_timing(duration_ms) ⇒ Object

Record tag extraction timing

Parameters:

  • duration_ms (Float)

    Extraction duration in milliseconds



203
204
205
206
207
208
209
210
211
# File 'lib/htm/observability.rb', line 203

def record_tag_timing(duration_ms)
  @metrics_mutex.synchronize do
    @tag_extraction_timings << {
      duration_ms: duration_ms,
      recorded_at: Time.now
    }
    @tag_extraction_timings.shift if @tag_extraction_timings.size > @max_timing_samples
  end
end

.reset_metrics!void

This method returns an undefined value.

Clear all collected timing metrics



274
275
276
277
278
279
280
# File 'lib/htm/observability.rb', line 274

def reset_metrics!
  @metrics_mutex.synchronize do
    @query_timings.clear
    @embedding_timings.clear
    @tag_extraction_timings.clear
  end
end

.service_timing_statsHash

Get service timing statistics (embedding and tag extraction)

Returns:

  • (Hash)

    Timing stats for embedding and tag services



225
226
227
228
229
230
# File 'lib/htm/observability.rb', line 225

def service_timing_stats
  {
    embedding: calculate_timing_stats(@embedding_timings, :embedding),
    tag_extraction: calculate_timing_stats(@tag_extraction_timings, :tag)
  }
end