Module: Canon::Comparison::WhitespaceSensitivity

Defined in:
lib/canon/comparison/whitespace_sensitivity.rb

Overview

Whitespace sensitivity utilities for element-level control

This module provides three-way classification of whitespace behaviour at the element level:

  • :preserve — every whitespace character is significant. ‘“ ”` ≠ `“n”`. Configured via preserve_whitespace_elements (HTML default: pre, code, textarea, script, style; XML default: none).

  • :collapse — presence ≠ absence, but all whitespace forms are equivalent: ‘“ ”` == `“n ”`. Configured via collapse_whitespace_elements (HTML default: p, li, dt, dd, td, th, h1-h6, caption, figcaption, label, legend, summary, blockquote, address; XML default: none).

  • :strip — all whitespace is structural formatting noise and is dropped. Default for XML; HTML elements not in the above lists.

Classification is ancestor-based: the closest matching ancestor determines the class. The strip blacklist (strip_whitespace_elements) overrides any sensitive ancestor.

Priority Order

  1. respect_xml_space: false → User config only (ignore xml:space)

  2. Ancestor walk (strip blacklist wins; then preserve; then collapse)

  3. xml:space=“preserve” → preserve

  4. xml:space=“default” → use configured behaviour

  5. Format defaults (HTML: collapse for most elements; XML: strip)

Usage

WhitespaceSensitivity.classify_element(element, match_opts)
=> :preserve, :collapse, or :strip

WhitespaceSensitivity.element_sensitive?(node, opts)
=> true if whitespace should be preserved (preserve or collapse)

Constant Summary collapse

HTML_COLLAPSE_ELEMENTS =

HTML mixed-content “leaf block” elements where whitespace presence matters but all forms are equivalent (CSS block whitespace collapsing).

%w[
  p li dt dd td th caption figcaption label legend summary
  h1 h2 h3 h4 h5 h6
  blockquote address button
].freeze
HTML_PRESERVE_ELEMENTS =

HTML elements where every whitespace character is significant.

%w[pre code textarea script style].freeze
INLINE_ELEMENTS =

HTML inline elements — whitespace between these is semantically significant (renders as a visible space). Whitespace-only text nodes that sit between two inline siblings must not be stripped.

%w[
  a abbr acronym b bdo big br button cite code dfn em i img input kbd
  label map object output q s samp select small span strong sub sup
  time tt u var wbr
].freeze

Class Method Summary collapse

Class Method Details

.classify_element(element, match_opts) ⇒ Symbol

Classify the whitespace behaviour for an element using ancestor walk.

Parameters:

  • element (Object)

    The element node to classify

  • match_opts (Hash)

    Resolved match options

Returns:

  • (Symbol)

    :preserve, :collapse, or :strip



68
69
70
71
72
73
74
75
76
77
78
79
80
# File 'lib/canon/comparison/whitespace_sensitivity.rb', line 68

def classify_element(element, match_opts)
  return :strip unless element
  return :strip unless element.respond_to?(:name)

  preserve_set  = resolved_preserve_elements_set(match_opts)
  collapse_set  = resolved_collapse_elements_set(match_opts)
  strip_set = resolved_strip_elements_set(match_opts)

  # Ancestor walk: start at the element itself, walk up.
  # Strip blacklist wins over any sensitive ancestor.
  walk_ancestor_classification(element, preserve_set, collapse_set,
                               strip_set, match_opts)
end

.classify_text_node(node, opts) ⇒ Symbol

Return the whitespace class for a text node used during comparison.

:preserve → preserve all whitespace character-by-character :collapse → preserve presence (normalize to single space) :strip → drop whitespace-only text nodes

Parameters:

  • node (Object)

    Text node to classify

  • opts (Hash)

    Comparison options containing match_opts

Returns:

  • (Symbol)

    :preserve, :collapse, or :strip



132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# File 'lib/canon/comparison/whitespace_sensitivity.rb', line 132

def classify_text_node(node, opts)
  match_opts = opts[:match_opts]
  return :strip unless match_opts
  return :strip unless text_node_parent?(node)

  parent = node.parent

  unless respect_xml_space?(match_opts)
    return user_config_sensitive?(parent,
                                  match_opts) ? :preserve : :strip
  end

  return :preserve if xml_space_preserve?(parent)
  return :strip if xml_space_default?(parent)

  classify_element(parent, match_opts)
end

.contains_nbsp?(text) ⇒ Boolean

Check if text content contains a non-breaking space (U+00A0). NBSP is NOT collapsible whitespace in HTML — it always renders as a visible space and must never be stripped.

Parameters:

  • text (String)

    Text content to check

Returns:

  • (Boolean)

    true if text contains U+00A0



284
285
286
# File 'lib/canon/comparison/whitespace_sensitivity.rb', line 284

def contains_nbsp?(text)
  text.to_s.include?("\u00A0")
end

.default_sensitive_element?(element_name, match_opts) ⇒ Boolean

Check if an element is in the default sensitive list for its format

Parameters:

  • element_name (String, Symbol)

    The element name to check

  • match_opts (Hash)

    Resolved match options

Returns:

  • (Boolean)

    true if element is in default sensitive list



220
221
222
223
# File 'lib/canon/comparison/whitespace_sensitivity.rb', line 220

def default_sensitive_element?(element_name, match_opts)
  format_default_preserve_elements(match_opts)
    .include?(element_name.to_sym)
end

.element_sensitive?(node, opts) ⇒ Boolean

Check if an element is whitespace-sensitive based on configuration. Returns true for :preserve or :collapse classification.

Parameters:

  • node (Object)

    The element node to check

  • opts (Hash)

    Comparison options containing match_opts

Returns:

  • (Boolean)

    true if whitespace should be preserved for this element



88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
# File 'lib/canon/comparison/whitespace_sensitivity.rb', line 88

def element_sensitive?(node, opts)
  match_opts = opts[:match_opts]
  return false unless match_opts
  return false unless text_node_parent?(node)

  parent = node.parent

  # 1. Check if we should ignore xml:space (user override)
  unless respect_xml_space?(match_opts)
    return user_config_sensitive?(parent, match_opts)
  end

  # 2. Check xml:space="preserve" (document declaration)
  return true if xml_space_preserve?(parent)

  # 3. Check xml:space="default" (use configured behavior)
  return false if xml_space_default?(parent)

  # 4. Three-way classification (ancestor-based)
  classification = classify_element(parent, match_opts)
  %i[preserve collapse].include?(classification)
end

.format_default_collapse_elements(match_opts) ⇒ Array<Symbol>

Get format-specific default collapse elements.

Parameters:

  • match_opts (Hash)

    Resolved match options

Returns:

  • (Array<Symbol>)

    Default collapse element names



205
206
207
208
209
210
211
212
213
# File 'lib/canon/comparison/whitespace_sensitivity.rb', line 205

def format_default_collapse_elements(match_opts)
  format = match_opts[:format] || :xml
  case format
  when :html, :html4, :html5
    HTML_COLLAPSE_ELEMENTS.map(&:to_sym).freeze
  else
    [].freeze
  end
end

.format_default_preserve_elements(match_opts) ⇒ Array<Symbol>

Get format-specific default preserve (exact-whitespace) elements. This is the SINGLE SOURCE OF TRUTH for default preserve-whitespace elements.

Parameters:

  • match_opts (Hash)

    Resolved match options

Returns:

  • (Array<Symbol>)

    Default preserve element names



191
192
193
194
195
196
197
198
199
# File 'lib/canon/comparison/whitespace_sensitivity.rb', line 191

def format_default_preserve_elements(match_opts)
  format = match_opts[:format] || :xml
  case format
  when :html, :html4, :html5
    HTML_PRESERVE_ELEMENTS.map(&:to_sym).freeze
  else
    [].freeze
  end
end

.inline_whitespace_significant?(text_node) ⇒ Boolean

Check if a whitespace-only text node sits between two inline element siblings, making the whitespace semantically significant.

In HTML rendering, a space between <span>A</span> <span>B</span> produces visible output. Stripping such nodes produces false equivalence.

Works with any parent type (element, DocumentFragment, RootNode) since the check is about sibling context, not parent type.

Parameters:

  • text_node (Object)

    Text node (Nokogiri or Canon::Xml::Node)

Returns:

  • (Boolean)

    true if whitespace is between inline siblings



237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
# File 'lib/canon/comparison/whitespace_sensitivity.rb', line 237

def inline_whitespace_significant?(text_node)
  return false unless text_node.respond_to?(:parent)

  parent = text_node.parent
  return false unless parent
  return false unless parent.respond_to?(:children)

  siblings = parent.children
  idx = siblings.index(text_node)
  return false unless idx

  # Look at the IMMEDIATE non-whitespace-text neighbour on each
  # side. Whitespace at a block boundary is collapsed per CSS,
  # so both immediate neighbours must be inline for the
  # whitespace to be significant. Walking all siblings (the
  # earlier behaviour) misclassified whitespace at a block
  # boundary as significant whenever any inline element existed
  # elsewhere among the siblings.
  prev_neighbour = nearest_non_whitespace_sibling(siblings, idx, -1)
  next_neighbour = nearest_non_whitespace_sibling(siblings, idx,  1)

  inline_element?(prev_neighbour) && inline_element?(next_neighbour)
end

.nearest_non_whitespace_sibling(siblings, idx, direction) ⇒ Object

Walk outward from idx in direction (+1 forward, -1 back), skipping whitespace-only text nodes, and return the first non-whitespace sibling found. Returns nil if none.



264
265
266
267
268
269
270
271
272
273
274
275
276
# File 'lib/canon/comparison/whitespace_sensitivity.rb', line 264

def nearest_non_whitespace_sibling(siblings, idx, direction)
  i = idx + direction
  while i >= 0 && i < siblings.length
    s = siblings[i]
    unless s.respond_to?(:text?) && s.text? &&
        s.respond_to?(:content) && s.content.to_s.strip.empty?
      return s
    end

    i += direction
  end
  nil
end

.preserve_whitespace_node?(node, opts) ⇒ Boolean

Check if whitespace-only text node should be filtered

Parameters:

  • node (Object)

    The text node to check

  • opts (Hash)

    Comparison options

Returns:

  • (Boolean)

    true if node should be preserved (not filtered)



116
117
118
119
120
121
# File 'lib/canon/comparison/whitespace_sensitivity.rb', line 116

def preserve_whitespace_node?(node, opts)
  return false unless node.respond_to?(:parent)
  return false unless node.parent

  element_sensitive?(node, opts)
end

.resolved_collapse_elements(match_opts) ⇒ Array<String>

Get resolved list of collapse whitespace element names (strings).

Parameters:

  • match_opts (Hash)

    Resolved match options

Returns:

  • (Array<String>)

    Collapse element names



182
183
184
# File 'lib/canon/comparison/whitespace_sensitivity.rb', line 182

def resolved_collapse_elements(match_opts)
  resolved_collapse_elements_set(match_opts).to_a
end

.resolved_preserve_elements(match_opts) ⇒ Array<String>

Get resolved list of preserve whitespace element names (strings).

Parameters:

  • match_opts (Hash)

    Resolved match options

Returns:

  • (Array<String>)

    Preserve element names



174
175
176
# File 'lib/canon/comparison/whitespace_sensitivity.rb', line 174

def resolved_preserve_elements(match_opts)
  resolved_preserve_elements_set(match_opts).to_a
end

.whitespace_preserved?(element, match_opts) ⇒ Boolean

Check if structural whitespace is preserved (not stripped) for an element.

Uses the same priority chain as element_sensitive? / classify_text_node:

1. xml:space="preserve" → always preserved
2. xml:space="default"  → use configured behaviour
3. ancestor-walk classification (strip = dropped)

Parameters:

  • element (Object)

    Element node to check

  • match_opts (Hash)

    Resolved match options

Returns:

  • (Boolean)

    true if whitespace is preserved (not stripped)



160
161
162
163
164
165
166
167
168
# File 'lib/canon/comparison/whitespace_sensitivity.rb', line 160

def whitespace_preserved?(element, match_opts)
  if respect_xml_space?(match_opts)
    return true  if xml_space_preserve?(element)
    return false if xml_space_default?(element)
  end

  classification = classify_element(element, match_opts)
  %i[preserve collapse].include?(classification)
end