Class: RubyUIConverter::HtmlTokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/ruby_ui_converter/html_tokenizer.rb

Overview

A small, forgiving HTML tokenizer. It does not validate markup; it emits a flat stream of tokens that the Parser turns into a tree. ERB placeholders (from the Lexer) are treated as ordinary text/attribute characters.

Token shapes:

[:text, string]
[:html_comment, inner]
[:doctype, raw]
[:open, name, attrs]        attrs => [[name, value_or_nil], ...]
[:selfclose, name, attrs]
[:close, name]
[:raw_element, name, attrs, inner_text]   (script/style)

Constant Summary collapse

VOID =
%w[area base br col embed hr img input link meta param source track wbr].freeze
RAW =
%w[script style].freeze

Instance Method Summary collapse

Constructor Details

#initialize(html) ⇒ HtmlTokenizer

Returns a new instance of HtmlTokenizer.



22
23
24
# File 'lib/ruby_ui_converter/html_tokenizer.rb', line 22

def initialize(html)
  @s = StringScanner.new(html.to_s)
end

Instance Method Details

#tokensObject



26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# File 'lib/ruby_ui_converter/html_tokenizer.rb', line 26

def tokens
  out = []

  until @s.eos?
    if @s.scan(/<!--(.*?)-->/m)
      out << [:html_comment, @s[1]]
    elsif @s.scan(/<!\[CDATA\[.*?\]\]>/m)
      out << [:text, @s.matched]
    elsif @s.scan(/<![^>]*>/m)
      out << [:doctype, @s.matched]
    elsif @s.scan(%r{</\s*([a-zA-Z][\w:-]*)\s*>})
      out << [:close, @s[1]]
    elsif @s.scan(/<([a-zA-Z][\w:-]*)/)
      out << scan_tag(@s[1])
    else
      text = @s.scan(/[^<]+/) || @s.getch
      out << [:text, text]
    end
  end

  out
end