Class: Markbridge::Parsers::HTML::Parser
- Inherits:
-
Object
- Object
- Markbridge::Parsers::HTML::Parser
- Defined in:
- lib/markbridge/parsers/html/parser.rb
Overview
Parses HTML into an AST using Nokogiri
Constant Summary collapse
- IGNORED_TAGS =
Tags whose contents should be dropped entirely (not emitted as text). These are raw-text/metadata elements whose children are either CSS, JavaScript, or document metadata that shouldn’t appear in output.
%w[style script head title noscript template].freeze
- WHITESPACE_RUN =
/[ \t\r\n\f]+/
Instance Attribute Summary collapse
-
#unknown_tags ⇒ Object
readonly
Returns the value of attribute unknown_tags.
Instance Method Summary collapse
-
#initialize(handlers: nil) {|HandlerRegistry| ... } ⇒ Parser
constructor
Create a new parser with optional custom handlers.
-
#parse(input) ⇒ AST::Document
Parse HTML into an AST.
-
#process_children(node, parent) ⇒ Object
Process child nodes of an element (used by handlers).
Constructor Details
#initialize(handlers: nil) {|HandlerRegistry| ... } ⇒ Parser
Create a new parser with optional custom handlers
20 21 22 23 24 25 26 27 28 |
# File 'lib/markbridge/parsers/html/parser.rb', line 20 def initialize(handlers: nil, &block) @handlers = if block_given? HandlerRegistry.build_from_default(&block) else handlers || HandlerRegistry.default end @unknown_tags = Hash.new(0) end |
Instance Attribute Details
#unknown_tags ⇒ Object (readonly)
Returns the value of attribute unknown_tags.
15 16 17 |
# File 'lib/markbridge/parsers/html/parser.rb', line 15 def @unknown_tags end |
Instance Method Details
#parse(input) ⇒ AST::Document
Parse HTML into an AST.
Accepts either a String of HTML source or a pre-parsed Nokogiri node (typically a DocumentFragment from Nokogiri::HTML.fragment or a full Document from Nokogiri::HTML.parse). Passing a pre-parsed tree lets a caller run their own Nokogiri-driven pre-processing without forcing Markbridge to re-parse the same bytes.
A Nokogiri::HTML::Document is unwrapped to its <body> children so the <html> / <body> / <head> wrappers don’t pollute #unknown_tags; fragments and bare elements iterate their own children directly.
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
# File 'lib/markbridge/parsers/html/parser.rb', line 47 def parse(input) @unknown_tags.clear # Parse HTML with Nokogiri. Using the generic HTML (HTML4) parser rather # than HTML5 because Nokogiri::HTML5 is not available on JRuby # (see sparklemotion/nokogiri#2227). Table support treats thead/tbody/tfoot # as transparent, so the parse-tree difference (HTML5 auto-inserts tbody, # HTML4 does not) has no effect on the AST. doc = if input.is_a?(Nokogiri::XML::Node) input else Nokogiri::HTML.fragment(input.to_s) end children = doc.is_a?(Nokogiri::HTML::Document) ? body_children(doc) : doc.children # Create root AST document document = AST::Document.new # Process all nodes children.each { |node| process_node(node, document) } trim_trailing_whitespace(document) document end |
#process_children(node, parent) ⇒ Object
Process child nodes of an element (used by handlers)
77 78 79 |
# File 'lib/markbridge/parsers/html/parser.rb', line 77 def process_children(node, parent) node.children.each { |child| process_node(child, parent) } end |