Class: Pgbus::MCP::HealthAnalyzer

Inherits:
Object
  • Object
show all
Defined in:
lib/pgbus/mcp/health_analyzer.rb

Overview

Computes the top-level pgbus health verdict (OK / DEGRADED / STALLED) from the existing DataSource read layer. This is the single signal that catches the silent-worker-wedge class of incident (#179, #174, #181): a queue with visible messages and no claim progress while a subscribing worker is heart-beating with idle capacity.

Verdict semantics (issue #180 acceptance criteria):

STALLED  — backlog (visible > 0) AND at least one worker is heart-beating
           but its claim loop has stopped advancing (status :stalled),
           OR backlog with live-but-idle workers and zero claim progress.
DEGRADED — something is wrong but not the wedge: stale processes, a
           paused queue holding a backlog, growing DLQ, or MVCC horizon
           pinned by a long-running transaction.
OK       — draining normally / nothing actionable.

Constant Summary collapse

WORKER_KIND =

A worker is considered to have idle capacity unless its metadata explicitly reports it is saturated. We treat the presence of any live worker as “has capacity” because a wedged worker reports healthy heartbeats while doing no work — exactly the case we must catch.

"worker"

Instance Method Summary collapse

Constructor Details

#initialize(data_source) ⇒ HealthAnalyzer

Returns a new instance of HealthAnalyzer.



26
27
28
# File 'lib/pgbus/mcp/health_analyzer.rb', line 26

def initialize(data_source)
  @data_source = data_source
end

Instance Method Details

#verdictObject

Returns a machine-readable verdict hash suitable for both interactive agent use and automated alerting.



32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# File 'lib/pgbus/mcp/health_analyzer.rb', line 32

def verdict
  queues          = @data_source.queues_with_metrics
  processes       = @data_source.processes
  health, health_error = safe_queue_health

  # Partition queues once: non-DLQ (the operational set) and the subset
  # of those with visible, claimable backlog. Paused queues are removed
  # from the STALLED backlog (an intentional pause is not the wedge —
  # it's reported under DEGRADED), but kept in `non_dlq` for the summary.
  non_dlq = queues.reject { |q| dlq?(q) }
  backlog = non_dlq.select { |q| q[:queue_visible_length].to_i.positive? }
  active_backlog = backlog.reject { |q| q[:paused] }

  stalled  = stalled_reasons(active_backlog, processes)
  degraded = degraded_reasons(queues, backlog, processes, health, health_error)

  status = if stalled.any?
             "STALLED"
           elsif degraded.any?
             "DEGRADED"
           else
             "OK"
           end

  {
    status: status,
    reasons: stalled + degraded,
    checked_at: Time.now.utc.iso8601,
    summary: build_summary(queues, non_dlq, processes, health)
  }
end