Module: Tracelit::Metrics

Defined in:: lib/tracelit/metrics.rb

Class Method Summary collapse

.counter(name, description: "", unit: "") ⇒ Object

Exposes a counter for manual instrumentation in user code: Tracelit::Metrics.counter(“orders.placed”).add(1).
.gauge(name, description: "", unit: "") ⇒ Object
.histogram(name, description: "", unit: "") ⇒ Object
.install_connection_pool_poller ⇒ Object

Polls ActiveRecord connection pool stats every 30 seconds on a daemon thread and records them as gauges.
.install_cpu_poller ⇒ Object

Polls process CPU utilisation every 30 seconds on a daemon thread.
.install_memory_poller ⇒ Object

Polls process RSS memory every 60 seconds on a daemon thread.
.install_rails_subscriber ⇒ Object

Subscribes to Rails process_action.action_controller to emit: http.server.request.count — counter per request http.server.request.duration — histogram in milliseconds http.server.error.count — counter for 5xx responses db.query.duration — histogram for ActiveRecord time per request.
.install_sidekiq_middleware ⇒ Object

Installs a Sidekiq server middleware that emits per-job metrics.
.meter ⇒ Object
.read_cpu_time_s(pid, linux) ⇒ Object

Returns cumulative CPU time (user + system) for this process in seconds.
.restart_pollers(config) ⇒ Object

Fix 5 (support): Called from the Process._fork hook in Instrumentation to restart background polling threads inside each forked Puma/Unicorn worker.
.setup(config) ⇒ Object

Sets up the OpenTelemetry MeterProvider with OTLP exporter.

Class Method Details

.counter(name, description: "", unit: "") ⇒ `Object`

Exposes a counter for manual instrumentation in user code:

Tracelit::Metrics.counter("orders.placed").add(1)

# File 'lib/tracelit/metrics.rb', line 76

def self.counter(name, description: "", unit: "")
  @meter&.create_counter(name,
    description: description,
    unit: unit
  )
end

.gauge(name, description: "", unit: "") ⇒ `Object`

# File 'lib/tracelit/metrics.rb', line 90

def self.gauge(name, description: "", unit: "")
  @meter&.create_gauge(name,
    description: description,
    unit: unit
  )
end

.histogram(name, description: "", unit: "") ⇒ `Object`

# File 'lib/tracelit/metrics.rb', line 83

def self.histogram(name, description: "", unit: "")
  @meter&.create_histogram(name,
    description: description,
    unit: unit
  )
end

.install_connection_pool_poller ⇒ `Object`

Polls ActiveRecord connection pool stats every 30 seconds on a daemon thread and records them as gauges. Does not require a live connection at install time — errors during polling are silently retried next cycle.

Fix 11: version-safe pool access that works on Rails 6.0–8.x.

# File 'lib/tracelit/metrics.rb', line 235

def self.install_connection_pool_poller
  return if @connection_pool_poller_installed
  @connection_pool_poller_installed = true

  pool_size = @meter.create_gauge(
    "db.connection_pool.size",
    description: "Maximum connections in the pool",
    unit: "{connections}"
  )

  pool_busy = @meter.create_gauge(
    "db.connection_pool.busy",
    description: "Connections currently checked out",
    unit: "{connections}"
  )

  pool_idle = @meter.create_gauge(
    "db.connection_pool.idle",
    description: "Connections available for checkout",
    unit: "{connections}"
  )

  pool_waiting = @meter.create_gauge(
    "db.connection_pool.waiting",
    description: "Threads waiting for a connection",
    unit: "{threads}"
  )

  thread = Thread.new do
    Thread.current[:tracelit_pool_poller] = true
    loop do
      sleep 30
      begin
        # Fix 11a: Rails 7.2 soft-deprecated connection_pool on the base
        # class. Use the connection handler when available; fall back for
        # Rails 6.0 compatibility.
        pool = if ActiveRecord::Base.respond_to?(:connection_handler)
          ActiveRecord::Base.connection_handler
            .retrieve_connection_pool(ActiveRecord::Base.connection_specification_name) rescue
          ActiveRecord::Base.connection_pool
        else
          ActiveRecord::Base.connection_pool
        end

        next unless pool

        stat = pool.stat

        # Fix 11b: pool_config.db_config was added in Rails 6.1.
        # Fall back gracefully on older setups.
        adapter = if pool.respond_to?(:pool_config)
          pool.pool_config.db_config.adapter.to_s rescue "unknown"
        else
          "unknown"
        end

        attrs = { "db.system" => adapter }
        pool_size.record(stat[:size], attributes: attrs)
        pool_busy.record(stat[:busy], attributes: attrs)
        pool_idle.record(stat[:idle], attributes: attrs)
        pool_waiting.record(stat[:waiting], attributes: attrs)
      rescue StandardError
        # Pool may not be connected yet — retry next cycle
      end
    end
  end
  thread.abort_on_exception = false
  thread
rescue StandardError => e
  OpenTelemetry.logger.warn("[Tracelit] failed to install connection pool poller: #{e.message}")
end

.install_cpu_poller ⇒ `Object`

Polls process CPU utilisation every 30 seconds on a daemon thread. Computes a percentage by tracking the delta in CPU time (user + system) against wall-clock elapsed time — same approach as the Go and Node SDKs.

On Linux: reads /proc/self/stat (utime + stime in jiffies at 100 Hz). On macOS: reads ‘ps -o %cpu= -p <pid>` as a direct percentage.

Emits: process.runtime.cpu.usage (%) Attributes: process.pid, process.runtime

# File 'lib/tracelit/metrics.rb', line 365

def self.install_cpu_poller
  return if @cpu_poller_installed
  @cpu_poller_installed = true

  cpu_gauge = @meter.create_gauge(
    "process.runtime.cpu.usage",
    description: "Process CPU utilisation percentage",
    unit: "%"
  )

  pid      = Process.pid
  linux    = File.exist?("/proc/self/stat")
  interval = 30 # seconds

  thread = Thread.new do
    Thread.current[:tracelit_cpu_poller] = true

    last_cpu_time  = read_cpu_time_s(pid, linux)
    last_wall_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)

    loop do
      sleep interval
      begin
        now      = Process.clock_gettime(Process::CLOCK_MONOTONIC)
        elapsed  = now - last_wall_time
        cpu_time = read_cpu_time_s(pid, linux)

        next if elapsed <= 0 || cpu_time.nil? || last_cpu_time.nil?

        delta = cpu_time - last_cpu_time
        last_cpu_time  = cpu_time
        last_wall_time = now

        next if delta < 0

        pct = [[delta / elapsed * 100.0, 100.0].min, 0.0].max

        cpu_gauge.record(pct, attributes: {
          "process.pid"     => pid.to_s,
          "process.runtime" => "ruby",
        })
      rescue StandardError
        # Retry next cycle — never crash on a metric poll failure
      end
    end
  end
  thread.abort_on_exception = false
  thread
rescue StandardError => e
  OpenTelemetry.logger.warn("[Tracelit] failed to install CPU poller: #{e.message}")
end

.install_memory_poller ⇒ `Object`

Polls process RSS memory every 60 seconds on a daemon thread.

Fix 12: On Linux use /proc/self/status (always present, no subprocess). Fall back to ‘ps` on macOS/BSD. The previous implementation always used a shell backtick which spawns a child process and fails silently in minimal Docker containers that lack procps.

# File 'lib/tracelit/metrics.rb', line 313

def self.install_memory_poller
  return if @memory_poller_installed
  @memory_poller_installed = true

  memory_gauge = @meter.create_gauge(
    "process.memory.rss",
    description: "Process resident set size (RSS)",
    unit: "MB"
  )

  pid = Process.pid

  thread = Thread.new do
    Thread.current[:tracelit_memory_poller] = true
    loop do
      sleep 60
      begin
        rss_kb = if File.exist?("/proc/self/status")
          # Linux: read VmRSS from /proc — no subprocess, always available
          File.read("/proc/self/status")[/VmRSS:\s+(\d+)/, 1].to_i
        else
          # macOS / BSD fallback
          `ps -o rss= -p #{Integer(pid)} 2>/dev/null`.strip.to_i
        end

        next if rss_kb == 0

        rss_mb = rss_kb / 1024.0
        memory_gauge.record(rss_mb, attributes: {
          "process.pid"     => pid.to_s,
          "process.runtime" => "ruby",
        })
      rescue StandardError
        # Ignore — environment may not support RSS polling
      end
    end
  end
  thread.abort_on_exception = false
  thread
rescue StandardError => e
  OpenTelemetry.logger.warn("[Tracelit] failed to install memory poller: #{e.message}")
end

.install_rails_subscriber ⇒ `Object`

Subscribes to Rails process_action.action_controller to emit:

http.server.request.count    — counter per request
http.server.request.duration — histogram in milliseconds
http.server.error.count      — counter for 5xx responses
db.query.duration            — histogram for ActiveRecord time per request

Fix 6: guarded against double-registration so reset! + re-setup in tests or Rails code-reloading scenarios does not duplicate metric counts.

# File 'lib/tracelit/metrics.rb', line 105

def self.install_rails_subscriber
  return if @rails_subscriber_installed
  @rails_subscriber_installed = true

  request_counter = @meter.create_counter(
    "http.server.request.count",
    description: "Total HTTP requests processed",
    unit: "{requests}"
  )

  duration_histogram = @meter.create_histogram(
    "http.server.request.duration",
    description: "HTTP request duration",
    unit: "ms"
  )

  error_counter = @meter.create_counter(
    "http.server.error.count",
    description: "Total HTTP 5xx responses",
    unit: "{errors}"
  )

  db_duration_histogram = @meter.create_histogram(
    "db.query.duration",
    description: "Database query duration",
    unit: "ms"
  )

  ActiveSupport::Notifications.subscribe("process_action.action_controller") do |*args|
    event   = ActiveSupport::Notifications::Event.new(*args)
    payload = event.payload

    attrs = {
      # Fix 7: use controller#action (stable, low-cardinality route template)
      # instead of payload[:path] which contains raw IDs and causes metric
      # cardinality explosion on apps with resource IDs in URLs.
      "http.route"       => "#{payload[:controller]}##{payload[:action]}",
      "http.method"      => payload[:method].to_s,
      "http.status_code" => payload[:status].to_s,
      "controller"       => payload[:controller].to_s,
      "action"           => payload[:action].to_s,
    }

    request_counter.add(1, attributes: attrs)
    duration_histogram.record(event.duration, attributes: attrs)

    error_counter.add(1, attributes: attrs) if payload[:status].to_i >= 500

    if payload[:db_runtime]
      db_duration_histogram.record(
        payload[:db_runtime].to_f,
        attributes: { "controller" => payload[:controller].to_s }
      )
    end
  rescue StandardError
    # Never let metric errors surface to the application
  end
end

.install_sidekiq_middleware ⇒ `Object`

Installs a Sidekiq server middleware that emits per-job metrics. Uses a dynamically defined class so the instrument references are captured in the closure without global state.

Fix 6: guarded against double-registration.

# File 'lib/tracelit/metrics.rb', line 169

def self.install_sidekiq_middleware
  return if @sidekiq_middleware_installed
  @sidekiq_middleware_installed = true

  job_counter = @meter.create_counter(
    "sidekiq.job.count",
    description: "Total Sidekiq jobs processed",
    unit: "{jobs}"
  )

  job_duration = @meter.create_histogram(
    "sidekiq.job.duration",
    description: "Sidekiq job execution duration",
    unit: "ms"
  )

  job_error_counter = @meter.create_counter(
    "sidekiq.job.error.count",
    description: "Total Sidekiq jobs that raised an error",
    unit: "{jobs}"
  )

  _job_counter       = job_counter
  _job_duration      = job_duration
  _job_error_counter = job_error_counter

  middleware_class = Class.new do
    define_method(:call) do |_worker, msg, queue, &block|
      start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
      error_raised = false

      begin
        block.call
      rescue StandardError
        error_raised = true
        raise
      ensure
        elapsed_ms = (Process.clock_gettime(Process::CLOCK_MONOTONIC) - start) * 1000.0

        attrs = {
          "sidekiq.job.class" => msg["class"].to_s,
          "sidekiq.queue"     => queue.to_s,
          "sidekiq.status"    => error_raised ? "error" : "success",
        }

        _job_counter.add(1, attributes: attrs)
        _job_duration.record(elapsed_ms, attributes: attrs)
        _job_error_counter.add(1, attributes: attrs) if error_raised
      end
    end
  end

  Sidekiq.configure_server do |config|
    config.server_middleware do |chain|
      chain.add middleware_class
    end
  end
rescue StandardError => e
  OpenTelemetry.logger.warn("[Tracelit] failed to install Sidekiq middleware: #{e.message}")
end

.meter ⇒ `Object`



56
57
58

# File 'lib/tracelit/metrics.rb', line 56

def self.meter
  @meter
end

.read_cpu_time_s(pid, linux) ⇒ `Object`

Returns cumulative CPU time (user + system) for this process in seconds. On Linux reads /proc/self/stat; on macOS/BSD falls back to ps %cpu which gives an instantaneous percentage instead (treated as fractional seconds over a 1-second window — good enough for a 30 s gauge).

# File 'lib/tracelit/metrics.rb', line 421

def self.read_cpu_time_s(pid, linux)
  if linux
    stat = begin
      File.read("/proc/self/stat")
    rescue
      return nil
    end
    # Format: pid (comm) state ppid ... utime stime ...
    # comm can contain spaces — find last ')' and split from there.
    after_comm = stat[stat.rindex(")").to_i + 1..]
    return nil unless after_comm

    fields = after_comm.split
    # After ')': state(0) ppid(1) ... utime(11) stime(12)
    utime = fields[11]&.to_i
    stime = fields[12]&.to_i
    return nil unless utime && stime

    # Jiffies at 100 Hz → seconds
    (utime + stime) / 100.0
  else
    # macOS/BSD: `ps` gives current CPU % directly.
    # Return it as a fractional "seconds per second" proxy so the
    # delta calculation above yields the right percentage.
    out = `ps -o %cpu= -p #{Integer(pid)} 2>/dev/null`.strip
    return nil if out.empty?
    out.to_f / 100.0
  end
end

.restart_pollers(config) ⇒ `Object`

Fix 5 (support): Called from the Process._fork hook in Instrumentation to restart background polling threads inside each forked Puma/Unicorn worker. The parent-process threads are dead in the child; this revives them.

# File 'lib/tracelit/metrics.rb', line 63

def self.restart_pollers(config)
  @connection_pool_poller_installed = false
  @memory_poller_installed          = false
  @cpu_poller_installed             = false
  install_connection_pool_poller if defined?(::ActiveRecord)
  install_memory_poller
  install_cpu_poller
rescue StandardError => e
  OpenTelemetry.logger.warn("[Tracelit] failed to restart pollers after fork: #{e.message}")
end

.setup(config) ⇒ `Object`

Sets up the OpenTelemetry MeterProvider with OTLP exporter. Called once from Instrumentation.setup after trace setup.

# File 'lib/tracelit/metrics.rb', line 16

def self.setup(config)
  exporter = OpenTelemetry::Exporter::OTLP::Metrics::MetricsExporter.new(
    endpoint: "#{config.endpoint}/v1/metrics",
    headers: {
      "Authorization"  => "Bearer #{config.api_key}",
      "X-Service-Name" => config.resolved_service_name,
      "X-Environment"  => config.environment,
    }
  )

  reader = OpenTelemetry::SDK::Metrics::Export::PeriodicMetricReader.new(
    exporter: exporter,
    export_interval_millis: 60_000,
    export_timeout_millis:  10_000
  )

  tp = OpenTelemetry.tracer_provider
  resource = tp.respond_to?(:resource) ? tp.resource : OpenTelemetry::SDK::Resources::Resource.create({})

  provider = OpenTelemetry::SDK::Metrics::MeterProvider.new(
    resource: resource
  )
  provider.add_metric_reader(reader)

  OpenTelemetry.meter_provider = provider

  @meter = provider.meter(
    config.resolved_service_name,
    version: Tracelit::VERSION
  )

  install_rails_subscriber       if defined?(::Rails)
  install_sidekiq_middleware      if defined?(::Sidekiq)
  install_connection_pool_poller  if defined?(::ActiveRecord)
  install_memory_poller
  install_cpu_poller
rescue StandardError => e
  OpenTelemetry.logger.warn("[Tracelit] failed to set up metrics: #{e.message}")
end

Module: Tracelit::Metrics

Class Method Summary collapse

Class Method Details

.counter(name, description: "", unit: "") ⇒ Object

.gauge(name, description: "", unit: "") ⇒ Object

.histogram(name, description: "", unit: "") ⇒ Object

.install_connection_pool_poller ⇒ Object

.install_cpu_poller ⇒ Object

.install_memory_poller ⇒ Object

.install_rails_subscriber ⇒ Object

.install_sidekiq_middleware ⇒ Object

.meter ⇒ Object

.read_cpu_time_s(pid, linux) ⇒ Object

.restart_pollers(config) ⇒ Object

.setup(config) ⇒ Object