Class: Canon::Comparison::HtmlParser

Inherits:
Object
  • Object
show all
Defined in:
lib/canon/comparison/html_parser.rb

Overview

HTML parsing service with version detection and fragment support

Provides HTML parsing capabilities with automatic HTML4/HTML5 version detection. Handles both full documents and fragments.

Examples:

Parse HTML string

HtmlParser.parse("<div>content</div>", :html5)

Auto-detect and parse

HtmlParser.detect_and_parse("<!DOCTYPE html><html>...</html>")

Class Method Summary collapse

Class Method Details

.already_parsed?(content) ⇒ Boolean

Check if content is already a parsed HTML document/fragment

Parameters:

  • content (Object)

    Content to check

Returns:

  • (Boolean)

    true if already parsed



49
50
51
52
53
54
# File 'lib/canon/comparison/html_parser.rb', line 49

def already_parsed?(content)
  content.is_a?(Nokogiri::HTML::Document) ||
    content.is_a?(Nokogiri::HTML5::Document) ||
    content.is_a?(Nokogiri::HTML::DocumentFragment) ||
    content.is_a?(Nokogiri::HTML5::DocumentFragment)
end

.detect_and_parse(content) ⇒ Nokogiri::HTML::DocumentFragment

Detect HTML version from content and parse with appropriate parser

Parameters:

  • content (String)

    HTML content to parse

Returns:

  • (Nokogiri::HTML::DocumentFragment)

    Parsed fragment



60
61
62
63
64
65
66
67
# File 'lib/canon/comparison/html_parser.rb', line 60

def detect_and_parse(content)
  version = detect_version(content)
  if version == :html5
    Nokogiri::HTML5.fragment(content)
  else
    Nokogiri::HTML4.fragment(content)
  end
end

.detect_version(content) ⇒ Symbol

Detect HTML version from content string

Parameters:

  • content (String)

    HTML content

Returns:

  • (Symbol)

    :html5 or :html4



73
74
75
76
# File 'lib/canon/comparison/html_parser.rb', line 73

def detect_version(content)
  # Check for HTML5 DOCTYPE (case-insensitive)
  content.include?("<!DOCTYPE html>") ? :html5 : :html4
end

.parse(content, format) ⇒ Nokogiri::HTML::Document, ...

Parse HTML string into Nokogiri document with the correct parser

Parameters:

  • content (String, Object)

    Content to parse (returns as-is if not a string)

  • format (Symbol)

    HTML format (:html, :html4, :html5)

Returns:

  • (Nokogiri::HTML::Document, Nokogiri::HTML5::Document, Nokogiri::HTML::DocumentFragment, Object)


24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# File 'lib/canon/comparison/html_parser.rb', line 24

def parse(content, format)
  return content unless content.is_a?(String)
  return content if already_parsed?(content)

  begin
    case format
    when :html5
      Nokogiri::HTML5.fragment(content)
    when :html4
      Nokogiri::HTML4.fragment(content)
    when :html
      detect_and_parse(content)
    else
      content
    end
  rescue StandardError
    # Fallback to raw string if parsing fails (maintains backward compatibility)
    content
  end
end