Module: Scrapetor::SAX

Defined in:
lib/scrapetor/sax.rb

Overview

Pure-Ruby SAX-style streaming HTML parser.

The hot path for production extraction is the C streaming engine behind ‘doc.extract`. This module exists for the cases where you genuinely want token-by-token control — debugging, custom incremental processors, conversion to other formats.

Usage:

class MyHandler < Scrapetor::SAX::Document
  def start_element(name, attrs); puts "<#{name}>"; end
  def end_element(name);          puts "</#{name}>"; end
  def characters(text);            puts text; end
  def comment(text);               puts "<!--#{text}-->"; end
  def doctype(name);               puts "<!DOCTYPE #{name}>"; end
end

Scrapetor::SAX::Parser.new(MyHandler.new).parse(html)

Defined Under Namespace

Classes: Document, Parser, Tokenizer