Module: DispatchPolicy::InflightTracker
- Extended by:
- ActiveSupport::Concern
- Defined in:
- lib/dispatch_policy/inflight_tracker.rb
Overview
Around-perform that records each job execution in dispatch_policy_inflight_jobs while it runs, so the concurrency gate can count active jobs per partition.
While the job runs we spawn a heartbeat thread that bumps ‘heartbeat_at` every `config.inflight_heartbeat_interval` seconds. Without this, jobs longer than `inflight_stale_after` (default 5 min) get their inflight row prematurely swept and the concurrency gate over-admits.
Defined Under Namespace
Classes: Heartbeat, ThreadSafeFlag
Constant Summary collapse
- HEARTBEAT_KEY =
—– heartbeat thread —–
:__dispatch_policy_heartbeat_token__
Class Method Summary collapse
-
.lookup_admitted_at(active_job_id) ⇒ Object
Reads the admitted_at column from the inflight row that the Tick pre-inserted.
- .record_adaptive_observations(policy:, gates:, partition_key:, admitted_at:, perform_start:, succeeded:) ⇒ Object
- .start_heartbeat(active_job_id) ⇒ Object
- .stop_heartbeat(heartbeat) ⇒ Object
- .track(job) ⇒ Object
Class Method Details
.lookup_admitted_at(active_job_id) ⇒ Object
Reads the admitted_at column from the inflight row that the Tick pre-inserted. Used as the start-of-queue-wait reference for the adaptive_concurrency feedback signal (queue_lag = perform_start
-
admitted_at). nil if the row vanished or the lookup fails —
the observation is then skipped.
83 84 85 86 87 88 89 90 91 92 93 94 95 |
# File 'lib/dispatch_policy/inflight_tracker.rb', line 83 def self.lookup_admitted_at(active_job_id) result = ActiveRecord::Base.connection.exec_query( "SELECT admitted_at FROM dispatch_policy_inflight_jobs WHERE active_job_id = $1 LIMIT 1", "lookup_admitted_at", [active_job_id] ) row = result.first return nil unless row ts = row["admitted_at"] ts.is_a?(Time) ? ts : Time.parse(ts.to_s) rescue StandardError nil end |
.record_adaptive_observations(policy:, gates:, partition_key:, admitted_at:, perform_start:, succeeded:) ⇒ Object
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
# File 'lib/dispatch_policy/inflight_tracker.rb', line 97 def self.record_adaptive_observations(policy:, gates:, partition_key:, admitted_at:, perform_start:, succeeded:) return if gates.empty? queue_lag_ms = if admitted_at ((perform_start - admitted_at) * 1000).to_i else # No admitted_at means we can't measure queue wait. Treat as 0 # so the observation still increments sample_count and the # cap can grow if everything else is healthy. 0 end gates.each do |gate| gate.record_observation( policy_name: policy.name, partition_key: partition_key, queue_lag_ms: queue_lag_ms, succeeded: succeeded ) rescue StandardError => e DispatchPolicy.config.logger&.warn( "[dispatch_policy] adaptive observation failed for #{policy.name}/#{partition_key}: #{e.class}: #{e.}" ) end end |
.start_heartbeat(active_job_id) ⇒ Object
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
# File 'lib/dispatch_policy/inflight_tracker.rb', line 129 def self.start_heartbeat(active_job_id) interval = DispatchPolicy.config.inflight_heartbeat_interval.to_f return nil if interval <= 0 stop_flag = Concurrent::AtomicBoolean.new(false) if defined?(Concurrent::AtomicBoolean) stop_flag ||= ThreadSafeFlag.new thread = Thread.new do Thread.current.name = "dispatch_policy.heartbeat:#{active_job_id}" until stop_flag.true? # Sleep in small slices so stop is responsive without polling tight. slept = 0.0 slice = [interval, 1.0].min while slept < interval && !stop_flag.true? sleep(slice) slept += slice end break if stop_flag.true? begin ActiveRecord::Base.connection_pool.with_connection do Repository.heartbeat_inflight!(active_job_id: active_job_id) end rescue StandardError => e DispatchPolicy.config.logger&.warn("[dispatch_policy] heartbeat #{active_job_id} failed: #{e.class}: #{e.}") end end end Heartbeat.new(thread, stop_flag) end |
.stop_heartbeat(heartbeat) ⇒ Object
162 163 164 165 166 167 168 169 170 171 172 |
# File 'lib/dispatch_policy/inflight_tracker.rb', line 162 def self.stop_heartbeat(heartbeat) return if heartbeat.nil? heartbeat.stop_flag.make_true # Wake the thread out of any in-progress sleep so we don't wait the full slice. heartbeat.thread.wakeup if heartbeat.thread.alive? heartbeat.thread.join(1.0) rescue StandardError # Worst case: the thread is killed by GC; the inflight row gets a stale # heartbeat_at and the sweeper will reclaim it after inflight_stale_after. end |
.track(job) ⇒ Object
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
# File 'lib/dispatch_policy/inflight_tracker.rb', line 24 def self.track(job) policy_name = job.class.respond_to?(:dispatch_policy_name) && job.class.dispatch_policy_name return yield unless policy_name policy = DispatchPolicy.registry.fetch(policy_name) return yield unless policy # Mirror the stage-time fallback in JobExtension.around_enqueue_for: # when the job carries no explicit queue, use the policy's default. # Without this, a policy whose partition_by/shard_by reads queue_name # would compute a DIFFERENT partition_key here than at admission, so # the around_perform inflight row (and adaptive observations) would # land under the wrong scope and the concurrency gate's COUNT(*) would # miss them. queue_name = job.queue_name&.to_s || policy.queue_name ctx = policy.build_context(job.arguments, queue_name: queue_name) partition_key = policy.partition_key_for(ctx) Repository.insert_inflight!([{ policy_name: policy.name, partition_key: partition_key, active_job_id: job.job_id }]) adaptive_gates = policy.gates.select { |g| g.name == :adaptive_concurrency } admitted_at = adaptive_gates.any? ? lookup_admitted_at(job.job_id) : nil perform_start = Time.current heartbeat = start_heartbeat(job.job_id) succeeded = false begin yield succeeded = true ensure stop_heartbeat(heartbeat) record_adaptive_observations( policy: policy, gates: adaptive_gates, partition_key: partition_key, admitted_at: admitted_at, perform_start: perform_start, succeeded: succeeded ) begin Repository.delete_inflight!(active_job_id: job.job_id) rescue StandardError => e DispatchPolicy.config.logger&.warn("[dispatch_policy] failed to delete inflight row #{job.job_id}: #{e.class}: #{e.}") end end end |