Module: Clacky::Utils::StringMatcher
- Defined in:
- lib/clacky/utils/string_matcher.rb
Overview
Utilities for finding and matching strings in file content. Used by the Edit tool and edit preview to apply a consistent layered matching strategy: exact → trim → unescape → smart line match.
Class Method Summary collapse
-
.count_occurrences(haystack, needle) ⇒ Object
Count non-overlapping occurrences of ‘needle` in `haystack` without going through Regexp (safer on mixed-encoding strings and avoids an extra escape step).
-
.find_match(content, old_string) ⇒ Hash?
Find a matching string in content using a layered strategy.
-
.generate_candidates(old_string) ⇒ Array<String>
Generate candidate strings by applying different transformations.
-
.lines_match_normalized?(lines1, lines2) ⇒ Boolean
Compare two arrays of lines after normalising leading whitespace.
-
.try_smart_match(content, old_string) ⇒ Hash?
Try smart line-by-line matching that tolerates leading whitespace differences.
-
.unescape_over_escaped(str) ⇒ String
Convert over-escaped sequences back to their real characters.
Class Method Details
.count_occurrences(haystack, needle) ⇒ Object
Count non-overlapping occurrences of ‘needle` in `haystack` without going through Regexp (safer on mixed-encoding strings and avoids an extra escape step).
52 53 54 55 56 57 58 59 60 61 |
# File 'lib/clacky/utils/string_matcher.rb', line 52 def self.count_occurrences(haystack, needle) return 0 if needle.empty? count = 0 offset = 0 while (idx = haystack.index(needle, offset)) count += 1 offset = idx + needle.length end count end |
.find_match(content, old_string) ⇒ Hash?
Find a matching string in content using a layered strategy.
Strategy (applied in order):
1. Exact match (original old_string)
2. Trimmed match (leading/trailing whitespace stripped)
3. Unescaped match (over-escaped sequences normalised)
4. Combined trim + unescape
5. Smart line-by-line match (tolerates indent differences)
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
# File 'lib/clacky/utils/string_matcher.rb', line 22 def self.find_match(content, old_string) # Defensive: if either side contains invalid UTF-8 bytes (binary files, # mixed-encoding content, etc.), Regexp#scan / String#include? with a # UTF-8-tagged candidate can raise `ArgumentError: invalid byte sequence # in UTF-8`. Scrub once at the entry point so every matching layer — # including callers like the edit preview — is safe. content = Clacky::Utils::Encoding.to_utf8(content) unless content.nil? old_string = Clacky::Utils::Encoding.to_utf8(old_string) unless old_string.nil? candidates = generate_candidates(old_string) # Simple string matching for each candidate candidates.each do |candidate| next if candidate.empty? if content.include?(candidate) return { matched_string: candidate, occurrences: count_occurrences(content, candidate) } end end # Fall back to smart line-by-line matching (tabs vs spaces, etc.) try_smart_match(content, old_string) end |
.generate_candidates(old_string) ⇒ Array<String>
Generate candidate strings by applying different transformations.
67 68 69 70 71 72 73 74 75 76 77 78 |
# File 'lib/clacky/utils/string_matcher.rb', line 67 def self.generate_candidates(old_string) trimmed = old_string.strip unescaped = unescape_over_escaped(old_string) unescaped_trimmed = unescape_over_escaped(trimmed) [ old_string, # Original trimmed, # Trim leading/trailing whitespace unescaped, # Unescape over-escaped sequences unescaped_trimmed # Combined: trim + unescape ].uniq end |
.lines_match_normalized?(lines1, lines2) ⇒ Boolean
Compare two arrays of lines after normalising leading whitespace.
146 147 148 149 150 151 152 153 154 155 |
# File 'lib/clacky/utils/string_matcher.rb', line 146 def self.lines_match_normalized?(lines1, lines2) return false unless lines1.length == lines2.length lines1.zip(lines2).all? do |line1, line2| norm1 = line1.sub(/^\s+/, " ").chomp norm2 = line2.sub(/^\s+/, " ").chomp norm1 == norm2 || norm1 == unescape_over_escaped(norm2) end end |
.try_smart_match(content, old_string) ⇒ Hash?
Try smart line-by-line matching that tolerates leading whitespace differences.
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
# File 'lib/clacky/utils/string_matcher.rb', line 109 def self.try_smart_match(content, old_string) candidates = generate_candidates(old_string) candidates.each do |candidate| next if candidate.empty? candidate_lines = candidate.lines next if candidate_lines.empty? content_lines = content.lines matches = [] (0..content_lines.length - candidate_lines.length).each do |start_idx| slice = content_lines[start_idx, candidate_lines.length] next unless slice if lines_match_normalized?(slice, candidate_lines) matches << { start: start_idx, matched_string: slice.join } end end unless matches.empty? return { matched_string: matches.first[:matched_string], occurrences: matches.length } end end nil end |
.unescape_over_escaped(str) ⇒ String
Convert over-escaped sequences back to their real characters. This handles the common case where LLMs double-escape backslashes.
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
# File 'lib/clacky/utils/string_matcher.rb', line 85 def self.unescape_over_escaped(str) result = str.dup # Unicode escapes: \uXXXX → actual Unicode character result = result.gsub(/\\u([0-9a-fA-F]{4})/) { [$1.hex].pack("U") } # Common escape sequences result = result.gsub('\\n', "\n") result = result.gsub('\\t', "\t") result = result.gsub('\\r', "\r") result = result.gsub('\\f', "\f") result = result.gsub('\\b', "\b") result = result.gsub('\\v', "\v") result = result.gsub('\\"', '"') result = result.gsub('\\\\', "\\") result end |