Module: Scrapetor::SAX
- Defined in:
- lib/scrapetor/sax.rb
Overview
Pure-Ruby SAX-style streaming HTML parser.
The hot path for production extraction is the C streaming engine behind ‘doc.extract`. This module exists for the cases where you genuinely want token-by-token control — debugging, custom incremental processors, conversion to other formats.
Usage:
class MyHandler < Scrapetor::SAX::Document
def start_element(name, attrs); puts "<#{name}>"; end
def end_element(name); puts "</#{name}>"; end
def characters(text); puts text; end
def comment(text); puts "<!--#{text}-->"; end
def doctype(name); puts "<!DOCTYPE #{name}>"; end
end
Scrapetor::SAX::Parser.new(MyHandler.new).parse(html)