Module: Clacky::Utils::StringMatcher

Defined in:
lib/clacky/utils/string_matcher.rb

Overview

Utilities for finding and matching strings in file content. Used by the Edit tool and edit preview to apply a consistent layered matching strategy: exact → trim → unescape → smart line match.

Class Method Summary collapse

Class Method Details

.count_occurrences(haystack, needle) ⇒ Object

Count non-overlapping occurrences of ‘needle` in `haystack` without going through Regexp (safer on mixed-encoding strings and avoids an extra escape step).



52
53
54
55
56
57
58
59
60
61
# File 'lib/clacky/utils/string_matcher.rb', line 52

def self.count_occurrences(haystack, needle)
  return 0 if needle.empty?
  count = 0
  offset = 0
  while (idx = haystack.index(needle, offset))
    count += 1
    offset = idx + needle.length
  end
  count
end

.find_match(content, old_string) ⇒ Hash?

Find a matching string in content using a layered strategy.

Strategy (applied in order):

1. Exact match (original old_string)
2. Trimmed match (leading/trailing whitespace stripped)
3. Unescaped match (over-escaped sequences normalised)
4. Combined trim + unescape
5. Smart line-by-line match (tolerates indent differences)

Parameters:

  • content (String)

    File content to search in

  • old_string (String)

    String to locate

Returns:

  • (Hash, nil)

    { matched_string: String, occurrences: Integer } or nil when nothing matches



22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# File 'lib/clacky/utils/string_matcher.rb', line 22

def self.find_match(content, old_string)
  # Defensive: if either side contains invalid UTF-8 bytes (binary files,
  # mixed-encoding content, etc.), Regexp#scan / String#include? with a
  # UTF-8-tagged candidate can raise `ArgumentError: invalid byte sequence
  # in UTF-8`. Scrub once at the entry point so every matching layer —
  # including callers like the edit preview — is safe.
  content    = Clacky::Utils::Encoding.to_utf8(content)    unless content.nil?
  old_string = Clacky::Utils::Encoding.to_utf8(old_string) unless old_string.nil?

  candidates = generate_candidates(old_string)

  # Simple string matching for each candidate
  candidates.each do |candidate|
    next if candidate.empty?

    if content.include?(candidate)
      return {
        matched_string: candidate,
        occurrences: count_occurrences(content, candidate)
      }
    end
  end

  # Fall back to smart line-by-line matching (tabs vs spaces, etc.)
  try_smart_match(content, old_string)
end

.generate_candidates(old_string) ⇒ Array<String>

Generate candidate strings by applying different transformations.

Parameters:

  • old_string (String)

Returns:

  • (Array<String>)

    Unique list of candidates



67
68
69
70
71
72
73
74
75
76
77
78
# File 'lib/clacky/utils/string_matcher.rb', line 67

def self.generate_candidates(old_string)
  trimmed           = old_string.strip
  unescaped         = unescape_over_escaped(old_string)
  unescaped_trimmed = unescape_over_escaped(trimmed)

  [
    old_string,        # Original
    trimmed,           # Trim leading/trailing whitespace
    unescaped,         # Unescape over-escaped sequences
    unescaped_trimmed  # Combined: trim + unescape
  ].uniq
end

.lines_match_normalized?(lines1, lines2) ⇒ Boolean

Compare two arrays of lines after normalising leading whitespace.

Parameters:

  • lines1 (Array<String>)
  • lines2 (Array<String>)

Returns:

  • (Boolean)


146
147
148
149
150
151
152
153
154
155
# File 'lib/clacky/utils/string_matcher.rb', line 146

def self.lines_match_normalized?(lines1, lines2)
  return false unless lines1.length == lines2.length

  lines1.zip(lines2).all? do |line1, line2|
    norm1 = line1.sub(/^\s+/, " ").chomp
    norm2 = line2.sub(/^\s+/, " ").chomp

    norm1 == norm2 || norm1 == unescape_over_escaped(norm2)
  end
end

.try_smart_match(content, old_string) ⇒ Hash?

Try smart line-by-line matching that tolerates leading whitespace differences.

Parameters:

  • content (String)
  • old_string (String)

Returns:

  • (Hash, nil)


109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# File 'lib/clacky/utils/string_matcher.rb', line 109

def self.try_smart_match(content, old_string)
  candidates = generate_candidates(old_string)

  candidates.each do |candidate|
    next if candidate.empty?

    candidate_lines = candidate.lines
    next if candidate_lines.empty?

    content_lines = content.lines
    matches = []

    (0..content_lines.length - candidate_lines.length).each do |start_idx|
      slice = content_lines[start_idx, candidate_lines.length]
      next unless slice

      if lines_match_normalized?(slice, candidate_lines)
        matches << { start: start_idx, matched_string: slice.join }
      end
    end

    unless matches.empty?
      return {
        matched_string: matches.first[:matched_string],
        occurrences: matches.length
      }
    end
  end

  nil
end

.unescape_over_escaped(str) ⇒ String

Convert over-escaped sequences back to their real characters. This handles the common case where LLMs double-escape backslashes.

Parameters:

  • str (String)

Returns:

  • (String)


85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
# File 'lib/clacky/utils/string_matcher.rb', line 85

def self.unescape_over_escaped(str)
  result = str.dup

  # Unicode escapes: \uXXXX → actual Unicode character
  result = result.gsub(/\\u([0-9a-fA-F]{4})/) { [$1.hex].pack("U") }

  # Common escape sequences
  result = result.gsub('\\n',  "\n")
  result = result.gsub('\\t',  "\t")
  result = result.gsub('\\r',  "\r")
  result = result.gsub('\\f',  "\f")
  result = result.gsub('\\b',  "\b")
  result = result.gsub('\\v',  "\v")
  result = result.gsub('\\"',  '"')
  result = result.gsub('\\\\', "\\")

  result
end