Class: Scrapetor::SAX::Tokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/scrapetor/sax.rb

Overview

Standalone tokenizer — yields events without going through a handler. Useful when you just want an enumerator:

Scrapetor::SAX::Tokenizer.new(html).each_event do |type, *args|
  # ...
end

Constant Summary collapse

VOID =
%w[
  area base br col embed hr img input link meta source track wbr
].freeze
RAW_TEXT =
%w[script style].freeze

Instance Method Summary collapse

Constructor Details

#initialize(html) ⇒ Tokenizer

Returns a new instance of Tokenizer.



80
81
82
83
84
# File 'lib/scrapetor/sax.rb', line 80

def initialize(html)
  @html = Scrapetor::Encoding.to_utf8(html)
  @pos  = 0
  @len  = @html.bytesize
end

Instance Method Details

#each_event(&block) ⇒ Object



86
87
88
89
90
91
92
93
94
95
96
97
98
99
# File 'lib/scrapetor/sax.rb', line 86

def each_event(&block)
  return enum_for(:each_event) unless block_given?
  block.call([:doc_start])
  while @pos < @len
    ch = byte(@pos)
    if ch == 0x3C # '<'
      handle_open(&block)
    else
      handle_text(&block)
    end
  end
  block.call([:doc_end])
  self
end