Class: RubyUIConverter::HtmlTokenizer
- Inherits:
-
Object
- Object
- RubyUIConverter::HtmlTokenizer
- Defined in:
- lib/ruby_ui_converter/html_tokenizer.rb
Overview
A small, forgiving HTML tokenizer. It does not validate markup; it emits a flat stream of tokens that the Parser turns into a tree. ERB placeholders (from the Lexer) are treated as ordinary text/attribute characters.
Token shapes:
[:text, string]
[:html_comment, inner]
[:doctype, raw]
[:open, name, attrs] attrs => [[name, value_or_nil], ...]
[:selfclose, name, attrs]
[:close, name]
[:raw_element, name, attrs, inner_text] (script/style)
Constant Summary collapse
- VOID =
%w[area base br col embed hr img input link meta param source track wbr].freeze
- RAW =
%w[script style].freeze
Instance Method Summary collapse
-
#initialize(html) ⇒ HtmlTokenizer
constructor
A new instance of HtmlTokenizer.
- #tokens ⇒ Object
Constructor Details
#initialize(html) ⇒ HtmlTokenizer
Returns a new instance of HtmlTokenizer.
22 23 24 |
# File 'lib/ruby_ui_converter/html_tokenizer.rb', line 22 def initialize(html) @s = StringScanner.new(html.to_s) end |
Instance Method Details
#tokens ⇒ Object
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
# File 'lib/ruby_ui_converter/html_tokenizer.rb', line 26 def tokens out = [] until @s.eos? if @s.scan(/<!--(.*?)-->/m) out << [:html_comment, @s[1]] elsif @s.scan(/<!\[CDATA\[.*?\]\]>/m) out << [:text, @s.matched] elsif @s.scan(/<![^>]*>/m) out << [:doctype, @s.matched] elsif @s.scan(%r{</\s*([a-zA-Z][\w:-]*)\s*>}) out << [:close, @s[1]] elsif @s.scan(/<([a-zA-Z][\w:-]*)/) out << scan_tag(@s[1]) else text = @s.scan(/[^<]+/) || @s.getch out << [:text, text] end end out end |