Class: Canon::Comparison::HtmlParser

Inherits:

Object

Object
Canon::Comparison::HtmlParser

show all

Defined in:: lib/canon/comparison/html_parser.rb

Overview

HTML parsing service with version detection and fragment support

Provides HTML parsing capabilities with automatic HTML4/HTML5 version detection. Handles both full documents and fragments.

Examples:

Parse HTML string

HtmlParser.parse("<div>content</div>", :html5)

Auto-detect and parse

HtmlParser.detect_and_parse("<!DOCTYPE html><html>...</html>")

Class Method Summary collapse

.already_parsed?(content) ⇒ Boolean

Check if content is already a parsed HTML document/fragment.
.detect_and_parse(content) ⇒ Nokogiri::HTML::DocumentFragment

Detect HTML version from content and parse with appropriate parser.
.detect_version(content) ⇒ Symbol

Detect HTML version from content string.
.normalize_html_for_parsing(content) ⇒ String

Normalize HTML to ensure consistent parsing by HTML4.fragment.
.parse(content, format) ⇒ Nokogiri::HTML::Document, ...

Parse HTML string into Nokogiri document with the correct parser.

Class Method Details

.already_parsed?(content) ⇒ `Boolean`

Check if content is already a parsed HTML document/fragment

Parameters:

content (Object) —

Content to check

Returns:

(Boolean) —

true if already parsed

# File 'lib/canon/comparison/html_parser.rb', line 54

def already_parsed?(content)
  content.is_a?(Nokogiri::HTML::Document) ||
    content.is_a?(Nokogiri::HTML5::Document) ||
    content.is_a?(Nokogiri::HTML::DocumentFragment) ||
    content.is_a?(Nokogiri::HTML5::DocumentFragment)
end

.detect_and_parse(content) ⇒ `Nokogiri::HTML::DocumentFragment`

Detect HTML version from content and parse with appropriate parser

Parameters:

content (String) —

HTML content to parse

Returns:

(Nokogiri::HTML::DocumentFragment) —

Parsed fragment

# File 'lib/canon/comparison/html_parser.rb', line 65

def detect_and_parse(content)
  version = detect_version(content)
  if version == :html5
    Nokogiri::HTML5.fragment(content)
  else
    Nokogiri::HTML4.fragment(content)
  end
end

.detect_version(content) ⇒ `Symbol`

Detect HTML version from content string

Parameters:

content (String) —

HTML content

Returns:

(Symbol) —

:html5 or :html4

# File 'lib/canon/comparison/html_parser.rb', line 78

def detect_version(content)
  # Check for HTML5 DOCTYPE (case-insensitive)
  content.include?("<!DOCTYPE html>") ? :html5 : :html4
end

.normalize_html_for_parsing(content) ⇒ `String`

Normalize HTML to ensure consistent parsing by HTML4.fragment

The key issue is that HTML4.fragment treats whitespace after </head> differently than no whitespace, causing inconsistent parsing:

“</head>n<body>” parses to [body, …] (body is treated as content)
“</head><body>” parses to [meta, div, …] (wrapper tags stripped)

This method normalizes the HTML to ensure consistent parsing.

Parameters:

content (String) —

HTML content

Returns:

(String) —

Normalized HTML content

# File 'lib/canon/comparison/html_parser.rb', line 94

def normalize_html_for_parsing(content)
  # Remove whitespace between </head> and <body> to ensure consistent parsing
  # This makes formatted and minified HTML parse the same way
  content.gsub(%r{</head>\s*<body>}i, "</head><body>")
end

.parse(content, format) ⇒ `Nokogiri::HTML::Document`, ...

Parse HTML string into Nokogiri document with the correct parser

Parameters:

content (String, Object) —

Content to parse (returns as-is if not a string)
format (Symbol) —

HTML format (:html, :html4, :html5)

Returns:

(Nokogiri::HTML::Document, Nokogiri::HTML5::Document, Nokogiri::HTML::DocumentFragment, Object)

# File 'lib/canon/comparison/html_parser.rb', line 24

def parse(content, format)
  return content unless content.is_a?(String)
  return content if already_parsed?(content)

  # Normalize HTML to ensure consistent parsing by HTML4.fragment
  # The key issue is that HTML4.fragment treats newlines after </head>
  # differently than no newlines, causing inconsistent parsing
  content = normalize_html_for_parsing(content)

  begin
    case format
    when :html5
      Nokogiri::HTML5.fragment(content)
    when :html4
      Nokogiri::HTML4.fragment(content)
    when :html
      detect_and_parse(content)
    else
      content
    end
  rescue StandardError
    # Fallback to raw string if parsing fails (maintains backward compatibility)
    content
  end
end

Class: Canon::Comparison::HtmlParser

Overview

Examples:

Parse HTML string

Auto-detect and parse

Class Method Summary collapse

Class Method Details

.already_parsed?(content) ⇒ Boolean

.detect_and_parse(content) ⇒ Nokogiri::HTML::DocumentFragment

.detect_version(content) ⇒ Symbol

.normalize_html_for_parsing(content) ⇒ String

.parse(content, format) ⇒ Nokogiri::HTML::Document, ...

.already_parsed?(content) ⇒ `Boolean`

.detect_and_parse(content) ⇒ `Nokogiri::HTML::DocumentFragment`

.detect_version(content) ⇒ `Symbol`

.normalize_html_for_parsing(content) ⇒ `String`

.parse(content, format) ⇒ `Nokogiri::HTML::Document`, ...