Class: Markbridge::Parsers::HTML::Parser

Inherits:
Object
  • Object
show all
Defined in:
lib/markbridge/parsers/html/parser.rb

Overview

Parses HTML into an AST using Nokogiri

Constant Summary collapse

IGNORED_TAGS =

Tags whose contents should be dropped entirely (not emitted as text). These are raw-text/metadata elements whose children are either CSS, JavaScript, or document metadata that shouldn’t appear in output.

%w[style script head title noscript template].freeze

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(handlers: nil) {|HandlerRegistry| ... } ⇒ Parser

Create a new parser with optional custom handlers

Parameters:

  • handlers (HandlerRegistry, nil) (defaults to: nil)

    custom handler registry, defaults to HandlerRegistry.default

Yields:



18
19
20
21
22
23
24
25
26
# File 'lib/markbridge/parsers/html/parser.rb', line 18

def initialize(handlers: nil, &block)
  @handlers =
    if block_given?
      HandlerRegistry.build_from_default(&block)
    else
      handlers || HandlerRegistry.default
    end
  @unknown_tags = Hash.new(0)
end

Instance Attribute Details

#unknown_tagsObject (readonly)

Returns the value of attribute unknown_tags.



13
14
15
# File 'lib/markbridge/parsers/html/parser.rb', line 13

def unknown_tags
  @unknown_tags
end

Instance Method Details

#parse(input) ⇒ AST::Document

Parse HTML string into an AST

Parameters:

  • input (String)

    HTML source

Returns:



31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# File 'lib/markbridge/parsers/html/parser.rb', line 31

def parse(input)
  @unknown_tags.clear

  # Parse HTML with Nokogiri. Using the generic HTML (HTML4) parser rather
  # than HTML5 because Nokogiri::HTML5 is not available on JRuby
  # (see sparklemotion/nokogiri#2227). Table support treats thead/tbody/tfoot
  # as transparent, so the parse-tree difference (HTML5 auto-inserts tbody,
  # HTML4 does not) has no effect on the AST.
  doc = Nokogiri::HTML.fragment(input)

  # Create root AST document
  document = AST::Document.new

  # Process all nodes
  doc.children.each { |node| process_node(node, document) }

  document
end

#process_children(node, parent) ⇒ Object

Process child nodes of an element (used by handlers)

Parameters:



53
54
55
# File 'lib/markbridge/parsers/html/parser.rb', line 53

def process_children(node, parent)
  node.children.each { |child| process_node(child, parent) }
end