Class: Canon::Comparison::HtmlParser

Inherits:
Object
  • Object
show all
Defined in:
lib/canon/comparison/html_parser.rb

Overview

HTML parsing service with version detection and fragment support

Provides HTML parsing capabilities with automatic HTML4/HTML5 version detection. Handles both full documents and fragments.

Examples:

Parse HTML string

HtmlParser.parse("<div>content</div>", :html5)

Auto-detect and parse

HtmlParser.detect_and_parse("<!DOCTYPE html><html>...</html>")

Class Method Summary collapse

Class Method Details

.already_parsed?(content) ⇒ Boolean

Check if content is already a parsed HTML document/fragment

Parameters:

  • content (Object)

    Content to check

Returns:

  • (Boolean)

    true if already parsed



54
55
56
57
58
59
# File 'lib/canon/comparison/html_parser.rb', line 54

def already_parsed?(content)
  content.is_a?(Nokogiri::HTML::Document) ||
    content.is_a?(Nokogiri::HTML5::Document) ||
    content.is_a?(Nokogiri::HTML::DocumentFragment) ||
    content.is_a?(Nokogiri::HTML5::DocumentFragment)
end

.detect_and_parse(content) ⇒ Nokogiri::HTML::DocumentFragment

Detect HTML version from content and parse with appropriate parser

Parameters:

  • content (String)

    HTML content to parse

Returns:

  • (Nokogiri::HTML::DocumentFragment)

    Parsed fragment



65
66
67
68
69
70
71
72
# File 'lib/canon/comparison/html_parser.rb', line 65

def detect_and_parse(content)
  version = detect_version(content)
  if version == :html5
    Nokogiri::HTML5.fragment(content)
  else
    Nokogiri::HTML4.fragment(content)
  end
end

.detect_version(content) ⇒ Symbol

Detect HTML version from content string

Parameters:

  • content (String)

    HTML content

Returns:

  • (Symbol)

    :html5 or :html4



78
79
80
81
# File 'lib/canon/comparison/html_parser.rb', line 78

def detect_version(content)
  # Check for HTML5 DOCTYPE (case-insensitive)
  content.include?("<!DOCTYPE html>") ? :html5 : :html4
end

.normalize_html_for_parsing(content) ⇒ String

Normalize HTML to ensure consistent parsing by HTML4.fragment

The key issue is that HTML4.fragment treats whitespace after </head> differently than no whitespace, causing inconsistent parsing:

  • “</head>n<body>” parses to [body, …] (body is treated as content)

  • “</head><body>” parses to [meta, div, …] (wrapper tags stripped)

This method normalizes the HTML to ensure consistent parsing.

Parameters:

  • content (String)

    HTML content

Returns:

  • (String)

    Normalized HTML content



94
95
96
97
98
# File 'lib/canon/comparison/html_parser.rb', line 94

def normalize_html_for_parsing(content)
  # Remove whitespace between </head> and <body> to ensure consistent parsing
  # This makes formatted and minified HTML parse the same way
  content.gsub(%r{</head>\s*<body>}i, "</head><body>")
end

.parse(content, format) ⇒ Nokogiri::HTML::Document, ...

Parse HTML string into Nokogiri document with the correct parser

Parameters:

  • content (String, Object)

    Content to parse (returns as-is if not a string)

  • format (Symbol)

    HTML format (:html, :html4, :html5)

Returns:

  • (Nokogiri::HTML::Document, Nokogiri::HTML5::Document, Nokogiri::HTML::DocumentFragment, Object)


24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# File 'lib/canon/comparison/html_parser.rb', line 24

def parse(content, format)
  return content unless content.is_a?(String)
  return content if already_parsed?(content)

  # Normalize HTML to ensure consistent parsing by HTML4.fragment
  # The key issue is that HTML4.fragment treats newlines after </head>
  # differently than no newlines, causing inconsistent parsing
  content = normalize_html_for_parsing(content)

  begin
    case format
    when :html5
      Nokogiri::HTML5.fragment(content)
    when :html4
      Nokogiri::HTML4.fragment(content)
    when :html
      detect_and_parse(content)
    else
      content
    end
  rescue StandardError
    # Fallback to raw string if parsing fails (maintains backward compatibility)
    content
  end
end