Module: EzLogsAgent::BulkSqlParser

Defined in:
lib/ez_logs_agent/bulk_sql_parser.rb

Overview

Pure-functional parser for the four ActiveRecord bulk operations that bypass per-row callbacks: delete_all, update_all, insert_all, upsert_all. Used by BulkDatabaseCapturer to turn an AS::Notifications “sql.active_record” payload into a structured wire shape the server can humanize.

## What it extracts

For delete_all:

{ operation: :delete_all, where_template: "status = $1", where_binds: [{column:, value:}] }

For update_all:

{ operation: :update_all, set: {"status" => "paid"}, where_template:, where_binds: }

For insert_all / upsert_all:

{ operation: :insert_all|:upsert_all, columns: ["name", "email"] }
(No values — per the product decision; column SHAPE only.)

For anything else (subqueries, joins, raw SQL the regex can’t parse, malformed binds): returns { unparseable: true }. BulkDatabaseCapturer falls back to shipping row_count + operation + model_class with no template / binds / set — the timeline still reads “Bulk delete: 50,000 orders” minus the WHERE detail.

## Why regex (not Arel / pg_query)

Both Arel and pg_query are adapter-specific and either slow or heavy to add as a runtime dependency on every customer host app. Regex on the standardized AR-emitted SQL string is fast (sub-millisecond on typical statements) and adapter-tolerant. The graceful “unparseable” branch covers the long tail.

## Adapter quoting handled

  • PostgreSQL: “orders”.“status” = $1 (double-quoted, $N placeholders)

  • SQLite: “orders”.“status” = ? (double-quoted, ? placeholders)

  • MySQL: ‘orders`.`status` = ? (backticks, ? placeholders)

Constant Summary collapse

IDENTIFIER =

Loose match for an identifier wrapped in any of the three quote styles AR uses. Captures the unquoted name.

/["`]([^"`]+)["`]/.freeze
COLUMN =

Either a “qualified” or “bare” identifier reference, used in WHERE/SET clause columns. AR prefixes the table name in some paths and not in others — accept both. Captures just the column.

/(?:#{IDENTIFIER}\.)?#{IDENTIFIER}/.freeze
PLACEHOLDER =

A bind placeholder — Postgres uses $N (1-indexed), SQLite/MySQL use ?. The order of placeholders in the SQL corresponds 1:1 with the order of values in ‘binds` / `type_casted_binds`.

/(?:\$\d+|\?)/.freeze

Class Method Summary collapse

Class Method Details

.build_template_and_binds(where_sql, type_casted_binds) ⇒ Object

Given a WHERE clause SQL and the binds in placeholder order, walk the placeholders and pair each one with the column to its LEFT (the standard ‘“table”.“col” = $1` shape AR emits). Returns the template (placeholder-preserved) and the binds tagged with column.

Unrecognized shapes (subselects, joins, NULL checks without binds) leave the template intact but produce no column for the bind, in which case the bind ships with column: nil. The display layer handles nil-column binds.



253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
# File 'lib/ez_logs_agent/bulk_sql_parser.rb', line 253

def build_template_and_binds(where_sql, type_casted_binds)
  return [nil, []] if where_sql.nil?

  template = where_sql
  binds = []
  bind_index = 0

  # Walk the template scanning each placeholder, looking backward
  # for the nearest column identifier to its left.
  template.scan(PLACEHOLDER) do
    match_data = Regexp.last_match
    next unless match_data

    column = nearest_column_left_of(template, match_data.begin(0))
    value = type_casted_binds[bind_index]
    binds << { column: column, value: value }
    bind_index += 1
  end

  [template, binds]
end

.detect_operation(sql) ⇒ Symbol?

Returns the symbolic operation name we expect downstream. Detected from SQL shape, independent of the ‘payload` Rails version variance (see plan §“insert_all/upsert_all payload :name varies”). Upsert detection requires the `ON CONFLICT` (PG) or `ON DUPLICATE KEY` (MySQL) clause — bare INSERTs are insert_all.

Returns:

  • (Symbol, nil)

    :delete_all, :update_all, :insert_all, :upsert_all, or nil if not a bulk op.



95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
# File 'lib/ez_logs_agent/bulk_sql_parser.rb', line 95

def detect_operation(sql)
  return nil if sql.nil?

  sql_up = sql.lstrip.upcase

  return :delete_all if sql_up.start_with?("DELETE FROM ")
  return :update_all if sql_up.start_with?("UPDATE ")

  if sql_up.start_with?("INSERT INTO ")
    # Disambiguate insert_all vs upsert_all:
  # - insert_all on PG/SQLite emits `ON CONFLICT DO NOTHING` (still insert).
  # - upsert_all on PG/SQLite emits `ON CONFLICT ... DO UPDATE SET ...`.
  # - upsert_all on MySQL emits `ON DUPLICATE KEY UPDATE ...`.
    if (sql_up.include?("ON CONFLICT") && sql_up.include?("DO UPDATE")) ||
       sql_up.include?("ON DUPLICATE KEY")
      return :upsert_all
    end

    :insert_all
  end
end

.extract_insert_columns(sql) ⇒ Object

— INSERT INTO “users” (“name”,“email”) VALUES … — Extracts the ordered column list. Returns nil if the open paren column list isn’t present (e.g., INSERT … DEFAULT VALUES).



291
292
293
294
295
296
297
298
299
300
# File 'lib/ez_logs_agent/bulk_sql_parser.rb', line 291

def extract_insert_columns(sql)
  # First parenthesized list after the table name.
  m = sql.match(/\AINSERT INTO\s+#{IDENTIFIER}\s*\(([^)]+)\)/i)
  return nil unless m

  columns_block = m[2]
  columns_block.split(",").map do |col|
    col.strip.gsub(/\A["`]|["`]\z/, "")
  end.reject(&:empty?)
end

.extract_set(sql) ⇒ Object

Returns the SET-clause SQL between “SET” and “WHERE”/end-of-string.



167
168
169
170
# File 'lib/ez_logs_agent/bulk_sql_parser.rb', line 167

def extract_set(sql)
  match = sql.match(/\sSET\s+(.+?)(?:\s+WHERE\s+|\s+RETURNING\b|\z)/i)
  match && match[1].strip
end

.extract_where(sql) ⇒ Object

Returns the WHERE-clause SQL (without the keyword), or nil if absent. Stops at end-of-string, RETURNING, ORDER BY, LIMIT (AR rarely emits these on bulk DML but cheap insurance).



161
162
163
164
# File 'lib/ez_logs_agent/bulk_sql_parser.rb', line 161

def extract_where(sql)
  match = sql.match(/\sWHERE\s+(.+?)(?:\s+(?:RETURNING|ORDER BY|LIMIT)\b|\z)/i)
  match && match[1].strip
end

.nearest_column_left_of(where_sql, pos) ⇒ Object

Finds the most recent column identifier to the left of ‘pos` in the WHERE clause. Used to attribute each bind to its column name so we can mask sensitive values by name downstream.



278
279
280
281
282
283
284
285
286
# File 'lib/ez_logs_agent/bulk_sql_parser.rb', line 278

def nearest_column_left_of(where_sql, pos)
  prefix = where_sql[0...pos]
  # Look for the last identifier before the operator (=, <, >, etc.)
  match = prefix.match(/#{COLUMN}\s*(?:=|<>|!=|<=|>=|<|>|LIKE|IN|IS)\s*\z/i)
  return nil unless match

  # COLUMN has two capture groups: [table, col] or [_, col].
  match[2] || match[1]
end

.parse(sql:, type_casted_binds:) ⇒ Hash

Returns Always returns a Hash. Either the parsed structure (keys above) or ‘{ unparseable: true }`.

Parameters:

  • sql (String)

    Raw SQL string from ‘payload`.

  • type_casted_binds (Array)

    payload` —already-typecast values in placeholder order. May be nil for raw SQL paths.

Returns:

  • (Hash)

    Always returns a Hash. Either the parsed structure (keys above) or ‘{ unparseable: true }`.



63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
# File 'lib/ez_logs_agent/bulk_sql_parser.rb', line 63

def parse(sql:, type_casted_binds:)
  return { unparseable: true } if sql.nil? || sql.empty?

  sql_stripped = sql.strip

  case sql_stripped
  when /\ADELETE FROM /i
    parse_delete(sql_stripped, type_casted_binds || [])
  when /\AUPDATE /i
    parse_update(sql_stripped, type_casted_binds || [])
  when /\AINSERT INTO /i
    parse_insert(sql_stripped)
  else
    { unparseable: true }
  end
rescue StandardError
  # If the regex engine, binds zipping, or any sub-parse step raises,
  # ship unparseable rather than crash the capture handler. The
  # BulkDatabaseCapturer's own rescue would also catch this, but
  # defending here means the rest of the capturer sees a uniform
  # return shape.
  { unparseable: true }
end

.parse_delete(sql, type_casted_binds) ⇒ Object

— DELETE FROM “orders” WHERE “orders”.“status” = $1 —



118
119
120
121
122
123
# File 'lib/ez_logs_agent/bulk_sql_parser.rb', line 118

def parse_delete(sql, type_casted_binds)
  where_sql = extract_where(sql)
  template, binds = build_template_and_binds(where_sql, type_casted_binds)

  { operation: :delete_all, where_template: template, where_binds: binds }
end

.parse_insert(sql) ⇒ Object

— INSERT INTO “users” (“name”,“email”) VALUES (…), (…) —



148
149
150
151
152
153
154
155
156
# File 'lib/ez_logs_agent/bulk_sql_parser.rb', line 148

def parse_insert(sql)
  operation = detect_operation(sql)
  return { unparseable: true } unless operation == :insert_all || operation == :upsert_all

  columns = extract_insert_columns(sql)
  return { unparseable: true } if columns.nil? || columns.empty?

  { operation: operation, columns: columns }
end

.parse_set_assignments(set_sql) ⇒ Object

SET clause is comma-separated ‘“col” = <placeholder|literal>` pairs. AR inlines non-symbol values as literals (no placeholder) — we still surface those in the result so the reader sees what changed. Returns [[col_name, placeholder_or_literal_str], …] + total bind count.



176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
# File 'lib/ez_logs_agent/bulk_sql_parser.rb', line 176

def parse_set_assignments(set_sql)
  pairs = []
  bind_count = 0

  split_top_level_commas(set_sql).each do |assignment|
    m = assignment.match(/\A#{COLUMN}\s*=\s*(.+)\z/)
    next unless m

    col = m[2] || m[1]
    rhs = m[3].strip
    pairs << [col, rhs]
    bind_count += 1 if rhs.match?(PLACEHOLDER)
  end

  [pairs, bind_count]
end

.parse_update(sql, type_casted_binds) ⇒ Object

— UPDATE “orders” SET “status” = $1 WHERE “orders”.“status” = $2 — SET binds come first in placeholder order, then WHERE binds.



127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
# File 'lib/ez_logs_agent/bulk_sql_parser.rb', line 127

def parse_update(sql, type_casted_binds)
  set_sql = extract_set(sql)
  where_sql = extract_where(sql)
  return { unparseable: true } if set_sql.nil?

  set_pairs, set_bind_count = parse_set_assignments(set_sql)
  set_binds = type_casted_binds.first(set_bind_count)
  where_binds_raw = type_casted_binds.drop(set_bind_count)

  set_hash = zip_set_values(set_pairs, set_binds)
  where_template, where_binds = build_template_and_binds(where_sql, where_binds_raw)

  {
    operation: :update_all,
    set: set_hash,
    where_template: where_template,
    where_binds: where_binds
  }
end

.split_top_level_commas(str) ⇒ Object

Splits on commas that are NOT inside (parens) or quotes. Bulk DML SET / WHERE almost never nests, but defensive against function calls like ‘coalesce(x, 0)`.



212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
# File 'lib/ez_logs_agent/bulk_sql_parser.rb', line 212

def split_top_level_commas(str)
  result = []
  depth = 0
  in_quote = false
  quote_char = nil
  buffer = +""

  str.each_char do |ch|
    if in_quote
      buffer << ch
      in_quote = false if ch == quote_char
    elsif ch == "'" || ch == '"' || ch == "`"
      in_quote = true
      quote_char = ch
      buffer << ch
    elsif ch == "("
      depth += 1
      buffer << ch
    elsif ch == ")"
      depth -= 1
      buffer << ch
    elsif ch == "," && depth.zero?
      result << buffer.strip unless buffer.empty?
      buffer = +""
    else
      buffer << ch
    end
  end
  result << buffer.strip unless buffer.empty?
  result
end

.zip_set_values(set_pairs, set_binds) ⇒ Object

Walks set_pairs in order, taking the next bind for each placeholder RHS and pulling the literal text for non-placeholder RHS (e.g., ‘updated_at = ’2026-06-05 12:00:00’‘).



196
197
198
199
200
201
202
203
204
205
206
207
# File 'lib/ez_logs_agent/bulk_sql_parser.rb', line 196

def zip_set_values(set_pairs, set_binds)
  bind_idx = 0
  set_pairs.each_with_object({}) do |(col, rhs), acc|
    if rhs.match?(PLACEHOLDER)
      acc[col] = set_binds[bind_idx]
      bind_idx += 1
    else
      # Strip surrounding quotes to surface the actual value.
      acc[col] = rhs.gsub(/\A['"]|['"]\z/, "")
    end
  end
end