Module: Rubino::Tools::FuzzyMatch
- Defined in:
- lib/rubino/tools/fuzzy_match.rb
Overview
Fuzzy fallback for the edit/multi_edit tools, ported from the ‘pi` coding agent’s edit-diff (the single biggest edit success-rate lever).
The edit tools first try a byte-EXACT match. On a miss, the model’s ‘old_string` usually drifted from the on-disk bytes in a few predictable, cosmetic ways while it retyped the snippet: smart quotes for ASCII quotes, en/em dashes for hyphens, exotic spaces for plain spaces, trailing whitespace, or a Unicode form difference. This module NORMALIZES both the file content and the needle so those drifts collapse to a canonical form, then LOCATES the normalized needle in the normalized content.
CRITICAL: normalization is used ONLY to locate the span. The returned offsets index back into the ORIGINAL bytes, so the caller splices the replacement into the untouched original buffer — the normalized text is NEVER written to disk. Matches are aligned to original-character boundaries, so a length-changing normalization (e.g. NFKC ligature expansion) can never cause a mid-character splice.
Constant Summary collapse
- SMART_QUOTES =
Smart/curly quotes → ASCII. Single: ‘ ’ ‚ ‛ → ‘ Double: “ ” „ ‟ → “
{ "‘" => "'", "’" => "'", "‚" => "'", "‛" => "'", "“" => '"', "”" => '"', "„" => '"', "‟" => '"' }.freeze
- DASHES =
En dash, em dash, figure dash, horizontal bar, minus sign, non-breaking hyphen → ASCII hyphen-minus.
{ "–" => "-", "—" => "-", "‒" => "-", "―" => "-", "−" => "-", "‑" => "-" }.freeze
- EXOTIC_SPACES =
Various non-ASCII spaces → a regular ASCII space. Covers NBSP, the en/em/thin/hair quad family, narrow NBSP, ideographic space, etc.
[ " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " " ].freeze
- EXOTIC_SPACES_RE =
Regexp.union(EXOTIC_SPACES).freeze
- UTF8_WIDTHS =
[1, 2, 3, 4].freeze
Class Method Summary collapse
-
.each_char_utf8(buf) ⇒ Object
Yields/collects each UTF-8 character of a binary buffer as [char_string, byte_offset, byte_length].
-
.find_spans(content, needle) ⇒ Object
Locate ‘needle` inside `content` using fuzzy normalization, returning the matching spans as ORIGINAL byte ranges.
-
.next_char(buf, off) ⇒ Object
Returns the next whole UTF-8 character (as a BINARY substring) starting at byte ‘off`, or the single raw byte when the bytes there are not a valid UTF-8 sequence.
-
.normalize_char(char) ⇒ Object
Canonicalizes a single character (or short cluster) the way the model is likely to have re-typed it.
-
.normalize_plain(buf) ⇒ Object
Normalize a buffer for plain comparison (the needle side), with no map.
-
.normalize_with_map(buf) ⇒ Object
Normalize a buffer AND build a parallel map: for each normalized-output character index, the [byte_offset, byte_length] of the ORIGINAL character it came from.
-
.space_like?(char) ⇒ Boolean
A space, tab, or exotic space (anything we’d canonicalize to a space), but NOT a newline/carriage return (those delimit lines).
-
.span_for(byte_map, idx, length) ⇒ Object
Map a normalized [idx, length) match back to an original byte span, but ONLY when it aligns to original-character boundaries: - the first matched normalized char must be the FIRST normalized char emitted by its original char, and - the char just past the match must start a NEW original char (or be end-of-content).
-
.splice(content, spans, new_bytes) ⇒ Object
Splices ‘new_bytes` into `content` at each [start, end) byte span returned by #find_spans, back-to-front so earlier offsets stay valid as the buffer length changes.
-
.trailing_whitespace?(chars, char, idx) ⇒ Boolean
True when ‘char` (at index `idx`) is whitespace AND every char up to the next newline / end-of-buffer is also whitespace — i.e.
Class Method Details
.each_char_utf8(buf) ⇒ Object
Yields/collects each UTF-8 character of a binary buffer as [char_string, byte_offset, byte_length]. An invalid byte is surfaced as a single-byte “character” carrying its own raw byte, so the byte offsets stay aligned with the ORIGINAL buffer and an invalid byte simply never matches a (valid) normalized needle.
185 186 187 188 189 190 191 192 193 194 195 |
# File 'lib/rubino/tools/fuzzy_match.rb', line 185 def each_char_utf8(buf) return enum_for(:each_char_utf8, buf) unless block_given? off = 0 len = buf.bytesize while off < len chunk = next_char(buf, off) yield chunk.force_encoding(Encoding::UTF_8), off, chunk.bytesize off += chunk.bytesize end end |
.find_spans(content, needle) ⇒ Object
Locate ‘needle` inside `content` using fuzzy normalization, returning the matching spans as ORIGINAL byte ranges.
‘content` and `needle` are binary (ASCII-8BIT) buffers carrying UTF-8 bytes, exactly as the edit tools hold them. Returns ALL non-overlapping match locations as an Array of [start_byte, end_byte] ranges (end exclusive), in order — the caller decides whether 0 (not found), 1 (apply), or >1 (ambiguous, unless replace_all) is acceptable.
Returns nil for an empty/blank-after-normalization needle so the caller never tries to splice at every boundary.
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 |
# File 'lib/rubino/tools/fuzzy_match.rb', line 75 def find_spans(content, needle) norm_content, byte_map = normalize_with_map(content) norm_needle = normalize_plain(needle) return nil if norm_needle.empty? spans = [] search_from = 0 while (idx = norm_content.index(norm_needle, search_from)) span = span_for(byte_map, idx, norm_needle.length) # nil span ⇒ the match straddles an original-character boundary # (length-changing normalization); skip it rather than splice # mid-character. spans << span if span # Advance past this match so overlapping matches aren't double-counted # (use the normalized end; for a nil/straddling span, advance by one). search_from = span ? idx + norm_needle.length : idx + 1 end spans end |
.next_char(buf, off) ⇒ Object
Returns the next whole UTF-8 character (as a BINARY substring) starting at byte ‘off`, or the single raw byte when the bytes there are not a valid UTF-8 sequence.
200 201 202 203 204 205 206 207 |
# File 'lib/rubino/tools/fuzzy_match.rb', line 200 def next_char(buf, off) UTF8_WIDTHS.each do |width| slice = buf.byteslice(off, width) return slice if slice && slice.bytesize == width && slice.dup.force_encoding(Encoding::UTF_8).valid_encoding? end buf.byteslice(off, 1) end |
.normalize_char(char) ⇒ Object
Canonicalizes a single character (or short cluster) the way the model is likely to have re-typed it. Operates per-character so the caller can keep a normalized-char → original-byte map.
50 51 52 53 54 55 56 57 58 59 60 61 62 |
# File 'lib/rubino/tools/fuzzy_match.rb', line 50 def normalize_char(char) # An invalid-UTF-8 byte (e.g. a lone \xC3 or Latin-1 \xE9) can never be a # smart quote/dash/exotic space, and unicode_normalize/gsub would raise # on it. Pass it through verbatim so the char→original-byte map stays 1:1 # and the invalid byte is written back unchanged. return char unless char.valid_encoding? char = char.unicode_normalize(:nfkc) char = char.gsub(EXOTIC_SPACES_RE, " ") SMART_QUOTES.each { |from, to| char = char.gsub(from, to) } DASHES.each { |from, to| char = char.gsub(from, to) } char end |
.normalize_plain(buf) ⇒ Object
Normalize a buffer for plain comparison (the needle side), with no map.
109 110 111 |
# File 'lib/rubino/tools/fuzzy_match.rb', line 109 def normalize_plain(buf) normalize_with_map(buf).first end |
.normalize_with_map(buf) ⇒ Object
Normalize a buffer AND build a parallel map: for each normalized-output character index, the [byte_offset, byte_length] of the ORIGINAL character it came from. One original char can emit 0..n normalized chars; all of them point back at the same original span.
TRAILING whitespace is stripped per line (matching pi): a run of whitespace whose next non-whitespace char is a newline (or end of buffer) emits nothing. Leading/interior indentation is preserved so the snippet still has to line up structurally.
122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
# File 'lib/rubino/tools/fuzzy_match.rb', line 122 def normalize_with_map(buf) chars = each_char_utf8(buf).to_a out = +"" map = [] chars.each_with_index do |(char, off, len), idx| next if trailing_whitespace?(chars, char, idx) normalize_char(char).each_char do |nc| out << nc map << [off, len] end end [out, map] end |
.space_like?(char) ⇒ Boolean
A space, tab, or exotic space (anything we’d canonicalize to a space), but NOT a newline/carriage return (those delimit lines).
153 154 155 |
# File 'lib/rubino/tools/fuzzy_match.rb', line 153 def space_like?(char) char == " " || char == "\t" || EXOTIC_SPACES.include?(char) end |
.span_for(byte_map, idx, length) ⇒ Object
Map a normalized [idx, length) match back to an original byte span, but ONLY when it aligns to original-character boundaries:
- the first matched normalized char must be the FIRST normalized char
emitted by its original char, and
- the char just past the match must start a NEW original char (or be
end-of-content).
Otherwise return nil (straddles a multi-char expansion).
164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 |
# File 'lib/rubino/tools/fuzzy_match.rb', line 164 def span_for(byte_map, idx, length) first_off, = byte_map[idx] last_off, last_len = byte_map[idx + length - 1] # Boundary at the start: previous normalized char (if any) came from a # DIFFERENT original char. return nil if idx.positive? && byte_map[idx - 1][0] == first_off # Boundary at the end: next normalized char (if any) came from a # DIFFERENT original char. next_idx = idx + length return nil if next_idx < byte_map.length && byte_map[next_idx][0] == last_off [first_off, last_off + last_len] end |
.splice(content, spans, new_bytes) ⇒ Object
Splices ‘new_bytes` into `content` at each [start, end) byte span returned by #find_spans, back-to-front so earlier offsets stay valid as the buffer length changes. `content`/`new_bytes` are binary buffers.
98 99 100 101 102 103 104 |
# File 'lib/rubino/tools/fuzzy_match.rb', line 98 def splice(content, spans, new_bytes) out = content.dup spans.sort_by(&:first).reverse_each do |(start, finish)| out[start...finish] = new_bytes end out end |
.trailing_whitespace?(chars, char, idx) ⇒ Boolean
True when ‘char` (at index `idx`) is whitespace AND every char up to the next newline / end-of-buffer is also whitespace — i.e. trailing-of-line whitespace that the model likely dropped. The newline itself is kept.
140 141 142 143 144 145 146 147 148 149 |
# File 'lib/rubino/tools/fuzzy_match.rb', line 140 def trailing_whitespace?(chars, char, idx) return false unless space_like?(char) ((idx + 1)...chars.length).each do |j| nxt = chars[j][0] return true if nxt == "\n" return false unless space_like?(nxt) end true # ran to end of buffer end |