Class: Pgbus::MCP::HealthAnalyzer
- Inherits:
-
Object
- Object
- Pgbus::MCP::HealthAnalyzer
- Defined in:
- lib/pgbus/mcp/health_analyzer.rb
Overview
Computes the top-level pgbus health verdict (OK / DEGRADED / STALLED) from the existing DataSource read layer. This is the single signal that catches the silent-worker-wedge class of incident (#179, #174, #181): a queue with visible messages and no claim progress while a subscribing worker is heart-beating with idle capacity.
Verdict semantics (issue #180 acceptance criteria):
STALLED — backlog (visible > 0) AND at least one worker is heart-beating
but its claim loop has stopped advancing (status :stalled),
OR backlog with live-but-idle workers and zero claim progress.
DEGRADED — something is wrong but not the wedge: stale processes, a
paused queue holding a backlog, growing DLQ, or MVCC horizon
pinned by a long-running transaction.
OK — draining normally / nothing actionable.
Constant Summary collapse
- WORKER_KIND =
A worker is considered to have idle capacity unless its metadata explicitly reports it is saturated. We treat the presence of any live worker as “has capacity” because a wedged worker reports healthy heartbeats while doing no work — exactly the case we must catch.
"worker"
Instance Method Summary collapse
-
#initialize(data_source) ⇒ HealthAnalyzer
constructor
A new instance of HealthAnalyzer.
-
#verdict ⇒ Object
Returns a machine-readable verdict hash suitable for both interactive agent use and automated alerting.
Constructor Details
#initialize(data_source) ⇒ HealthAnalyzer
Returns a new instance of HealthAnalyzer.
26 27 28 |
# File 'lib/pgbus/mcp/health_analyzer.rb', line 26 def initialize(data_source) @data_source = data_source end |
Instance Method Details
#verdict ⇒ Object
Returns a machine-readable verdict hash suitable for both interactive agent use and automated alerting.
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
# File 'lib/pgbus/mcp/health_analyzer.rb', line 32 def verdict queues = @data_source.queues_with_metrics processes = @data_source.processes health, health_error = safe_queue_health # Partition queues once: non-DLQ (the operational set) and the subset # of those with visible, claimable backlog. Paused queues are removed # from the STALLED backlog (an intentional pause is not the wedge — # it's reported under DEGRADED), but kept in `non_dlq` for the summary. non_dlq = queues.reject { |q| dlq?(q) } backlog = non_dlq.select { |q| q[:queue_visible_length].to_i.positive? } active_backlog = backlog.reject { |q| q[:paused] } stalled = stalled_reasons(active_backlog, processes) degraded = degraded_reasons(queues, backlog, processes, health, health_error) status = if stalled.any? "STALLED" elsif degraded.any? "DEGRADED" else "OK" end { status: status, reasons: stalled + degraded, checked_at: Time.now.utc.iso8601, summary: build_summary(queues, non_dlq, processes, health) } end |