Class: RubyUIConverter::HtmlTokenizer

Inherits:

Object

Object
RubyUIConverter::HtmlTokenizer

show all

Defined in:: lib/ruby_ui_converter/html_tokenizer.rb

Overview

A small, forgiving HTML tokenizer. It does not validate markup; it emits a flat stream of tokens that the Parser turns into a tree. ERB placeholders (from the Lexer) are treated as ordinary text/attribute characters.

Token shapes:

[:text, string]
[:html_comment, inner]
[:doctype, raw]
[:open, name, attrs]        attrs => [[name, value_or_nil], ...]
[:selfclose, name, attrs]
[:close, name]
[:raw_element, name, attrs, inner_text]   (script/style)

Constant Summary collapse

VOID =

%w[area base br col embed hr img input link meta param source track wbr].freeze

RAW =

%w[script style].freeze

Instance Method Summary collapse

#initialize(html) ⇒ HtmlTokenizer constructor

A new instance of HtmlTokenizer.
#tokens ⇒ Object

Constructor Details

#initialize(html) ⇒ `HtmlTokenizer`

Returns a new instance of HtmlTokenizer.



22
23
24

# File 'lib/ruby_ui_converter/html_tokenizer.rb', line 22

def initialize(html)
  @s = StringScanner.new(html.to_s)
end

Instance Method Details

#tokens ⇒ `Object`

# File 'lib/ruby_ui_converter/html_tokenizer.rb', line 26

def tokens
  out = []

  until @s.eos?
    if @s.scan(/<!--(.*?)-->/m)
      out << [:html_comment, @s[1]]
    elsif @s.scan(/<!\[CDATA\[.*?\]\]>/m)
      out << [:text, @s.matched]
    elsif @s.scan(/<![^>]*>/m)
      out << [:doctype, @s.matched]
    elsif @s.scan(%r{</\s*([a-zA-Z][\w:-]*)\s*>})
      out << [:close, @s[1]]
    elsif @s.scan(/<([a-zA-Z][\w:-]*)/)
      out << scan_tag(@s[1])
    else
      text = @s.scan(/[^<]+/) || @s.getch
      out << [:text, text]
    end
  end

  out
end