Class: Markbridge::Parsers::HTML::Parser

Inherits:
Object
  • Object
show all
Defined in:
lib/markbridge/parsers/html/parser.rb

Overview

Parses HTML into an AST using Nokogiri

Constant Summary collapse

IGNORED_TAGS =

Tags whose contents should be dropped entirely (not emitted as text). These are raw-text/metadata elements whose children are either CSS, JavaScript, or document metadata that shouldn’t appear in output.

%w[style script head title noscript template].freeze
WHITESPACE_RUN =
/[ \t\r\n\f]+/

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(handlers: nil) {|HandlerRegistry| ... } ⇒ Parser

Create a new parser with optional custom handlers

Parameters:

  • handlers (HandlerRegistry, nil) (defaults to: nil)

    custom handler registry, defaults to HandlerRegistry.default

Yields:



20
21
22
23
24
25
26
27
28
# File 'lib/markbridge/parsers/html/parser.rb', line 20

def initialize(handlers: nil, &block)
  @handlers =
    if block_given?
      HandlerRegistry.build_from_default(&block)
    else
      handlers || HandlerRegistry.default
    end
  @unknown_tags = Hash.new(0)
end

Instance Attribute Details

#unknown_tagsObject (readonly)

Returns the value of attribute unknown_tags.



15
16
17
# File 'lib/markbridge/parsers/html/parser.rb', line 15

def unknown_tags
  @unknown_tags
end

Instance Method Details

#parse(input) ⇒ AST::Document

Parse HTML into an AST.

Accepts either a String of HTML source or a pre-parsed Nokogiri node (typically a DocumentFragment from Nokogiri::HTML.fragment or a full Document from Nokogiri::HTML.parse). Passing a pre-parsed tree lets a caller run their own Nokogiri-driven pre-processing without forcing Markbridge to re-parse the same bytes.

A Nokogiri::HTML::Document is unwrapped to its <body> children so the <html> / <body> / <head> wrappers don’t pollute #unknown_tags; fragments and bare elements iterate their own children directly.

Parameters:

  • input (String, Nokogiri::XML::Node)

    HTML source or pre-parsed Nokogiri tree

Returns:



47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# File 'lib/markbridge/parsers/html/parser.rb', line 47

def parse(input)
  @unknown_tags.clear

  # Parse HTML with Nokogiri. Using the generic HTML (HTML4) parser rather
  # than HTML5 because Nokogiri::HTML5 is not available on JRuby
  # (see sparklemotion/nokogiri#2227). Table support treats thead/tbody/tfoot
  # as transparent, so the parse-tree difference (HTML5 auto-inserts tbody,
  # HTML4 does not) has no effect on the AST.
  doc =
    if input.is_a?(Nokogiri::XML::Node)
      input
    else
      Nokogiri::HTML.fragment(input.to_s)
    end

  children = doc.is_a?(Nokogiri::HTML::Document) ? body_children(doc) : doc.children

  # Create root AST document
  document = AST::Document.new

  # Process all nodes
  children.each { |node| process_node(node, document) }
  trim_trailing_whitespace(document)

  document
end

#process_children(node, parent) ⇒ Object

Process child nodes of an element (used by handlers)

Parameters:



77
78
79
# File 'lib/markbridge/parsers/html/parser.rb', line 77

def process_children(node, parent)
  node.children.each { |child| process_node(child, parent) }
end