Class: Canon::Comparison::HtmlParser
- Inherits:
-
Object
- Object
- Canon::Comparison::HtmlParser
- Defined in:
- lib/canon/comparison/html_parser.rb
Overview
HTML parsing service with version detection and fragment support
Provides HTML parsing capabilities with automatic HTML4/HTML5 version detection. Handles both full documents and fragments.
Class Method Summary collapse
-
.already_parsed?(content) ⇒ Boolean
Check if content is already a parsed HTML document/fragment.
-
.detect_and_parse(content) ⇒ Nokogiri::HTML::DocumentFragment
Detect HTML version from content and parse with appropriate parser.
-
.detect_version(content) ⇒ Symbol
Detect HTML version from content string.
-
.normalize_html_for_parsing(content) ⇒ String
Normalize HTML to ensure consistent parsing by HTML4.fragment.
-
.parse(content, format) ⇒ Nokogiri::HTML::Document, ...
Parse HTML string into Nokogiri document with the correct parser.
Class Method Details
.already_parsed?(content) ⇒ Boolean
Check if content is already a parsed HTML document/fragment
54 55 56 57 58 59 |
# File 'lib/canon/comparison/html_parser.rb', line 54 def already_parsed?(content) content.is_a?(Nokogiri::HTML::Document) || content.is_a?(Nokogiri::HTML5::Document) || content.is_a?(Nokogiri::HTML::DocumentFragment) || content.is_a?(Nokogiri::HTML5::DocumentFragment) end |
.detect_and_parse(content) ⇒ Nokogiri::HTML::DocumentFragment
Detect HTML version from content and parse with appropriate parser
65 66 67 68 69 70 71 72 |
# File 'lib/canon/comparison/html_parser.rb', line 65 def detect_and_parse(content) version = detect_version(content) if version == :html5 Nokogiri::HTML5.fragment(content) else Nokogiri::HTML4.fragment(content) end end |
.detect_version(content) ⇒ Symbol
Detect HTML version from content string
78 79 80 81 |
# File 'lib/canon/comparison/html_parser.rb', line 78 def detect_version(content) # Check for HTML5 DOCTYPE (case-insensitive) content.include?("<!DOCTYPE html>") ? :html5 : :html4 end |
.normalize_html_for_parsing(content) ⇒ String
Normalize HTML to ensure consistent parsing by HTML4.fragment
The key issue is that HTML4.fragment treats whitespace after </head> differently than no whitespace, causing inconsistent parsing:
-
“</head>n<body>” parses to [body, …] (body is treated as content)
-
“</head><body>” parses to [meta, div, …] (wrapper tags stripped)
This method normalizes the HTML to ensure consistent parsing.
94 95 96 97 98 |
# File 'lib/canon/comparison/html_parser.rb', line 94 def normalize_html_for_parsing(content) # Remove whitespace between </head> and <body> to ensure consistent parsing # This makes formatted and minified HTML parse the same way content.gsub(%r{</head>\s*<body>}i, "</head><body>") end |
.parse(content, format) ⇒ Nokogiri::HTML::Document, ...
Parse HTML string into Nokogiri document with the correct parser
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
# File 'lib/canon/comparison/html_parser.rb', line 24 def parse(content, format) return content unless content.is_a?(String) return content if already_parsed?(content) # Normalize HTML to ensure consistent parsing by HTML4.fragment # The key issue is that HTML4.fragment treats newlines after </head> # differently than no newlines, causing inconsistent parsing content = normalize_html_for_parsing(content) begin case format when :html5 Nokogiri::HTML5.fragment(content) when :html4 Nokogiri::HTML4.fragment(content) when :html detect_and_parse(content) else content end rescue StandardError # Fallback to raw string if parsing fails (maintains backward compatibility) content end end |