Class: Canon::Comparison::HtmlParser
- Inherits:
-
Object
- Object
- Canon::Comparison::HtmlParser
- Defined in:
- lib/canon/comparison/html_parser.rb
Overview
HTML parsing service with version detection and fragment support
Provides HTML parsing capabilities with automatic HTML4/HTML5 version detection. Handles both full documents and fragments.
Class Method Summary collapse
-
.already_parsed?(content) ⇒ Boolean
Check if content is already a parsed HTML document/fragment.
-
.detect_and_parse(content) ⇒ Nokogiri::HTML::DocumentFragment
Detect HTML version from content and parse with appropriate parser.
-
.detect_version(content) ⇒ Symbol
Detect HTML version from content string.
-
.parse(content, format) ⇒ Nokogiri::HTML::Document, ...
Parse HTML string into Nokogiri document with the correct parser.
Class Method Details
.already_parsed?(content) ⇒ Boolean
Check if content is already a parsed HTML document/fragment
49 50 51 52 53 54 |
# File 'lib/canon/comparison/html_parser.rb', line 49 def already_parsed?(content) content.is_a?(Nokogiri::HTML::Document) || content.is_a?(Nokogiri::HTML5::Document) || content.is_a?(Nokogiri::HTML::DocumentFragment) || content.is_a?(Nokogiri::HTML5::DocumentFragment) end |
.detect_and_parse(content) ⇒ Nokogiri::HTML::DocumentFragment
Detect HTML version from content and parse with appropriate parser
60 61 62 63 64 65 66 67 |
# File 'lib/canon/comparison/html_parser.rb', line 60 def detect_and_parse(content) version = detect_version(content) if version == :html5 Nokogiri::HTML5.fragment(content) else Nokogiri::HTML4.fragment(content) end end |
.detect_version(content) ⇒ Symbol
Detect HTML version from content string
73 74 75 76 |
# File 'lib/canon/comparison/html_parser.rb', line 73 def detect_version(content) # Check for HTML5 DOCTYPE (case-insensitive) content.include?("<!DOCTYPE html>") ? :html5 : :html4 end |
.parse(content, format) ⇒ Nokogiri::HTML::Document, ...
Parse HTML string into Nokogiri document with the correct parser
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
# File 'lib/canon/comparison/html_parser.rb', line 24 def parse(content, format) return content unless content.is_a?(String) return content if already_parsed?(content) begin case format when :html5 Nokogiri::HTML5.fragment(content) when :html4 Nokogiri::HTML4.fragment(content) when :html detect_and_parse(content) else content end rescue StandardError # Fallback to raw string if parsing fails (maintains backward compatibility) content end end |