Module: Pikuri::Tool::Scraper::PDF
- Defined in:
- lib/pikuri/tool/scraper/pdf.rb
Overview
PDF → text extractor used by Simple.visit when the fetched response carries application/pdf. Wraps the pdf-reader gem: walk every page, concatenate the extracted text, hand the result back as a single string the LLM can read.
Best-effort by design. pdf-reader produces clean text from PDFs generated from a digital source (LaTeX, Word export, …) but returns nothing useful from scanned documents — there is no OCR in this path. When extraction yields no text we still return an empty string rather than raising, so the caller’s cache stores a consistent result and the LLM sees an empty observation it can react to.
Pure parser — no I/O. PDF.extract takes PDF bytes and returns text, so tests can drive it against an in-memory fixture without touching the network.
Class Method Summary collapse
-
.extract(bytes) ⇒ String
Render
bytesas plain text, one page per paragraph.
Class Method Details
.extract(bytes) ⇒ String
Render bytes as plain text, one page per paragraph.
pdf-reader raises a handful of typed exceptions for documents it cannot parse — broken xrefs (PDF::Reader::MalformedPDFError), invalid page references (PDF::Reader::InvalidPageError), encrypted/XFA files (PDF::Reader::UnsupportedFeatureError). All three describe a property of the PDF the LLM can react to (“try a different URL”), so we re-raise them as FetchError —same convention as the HTTP layer in Simple.fetch. Genuine bugs in pdf-reader itself surface as their own classes and crash loud.
43 44 45 46 47 48 49 50 |
# File 'lib/pikuri/tool/scraper/pdf.rb', line 43 def self.extract(bytes) reader = ::PDF::Reader.new(StringIO.new(bytes)) reader.pages.map { |p| p.text.strip }.reject(&:empty?).join("\n\n") rescue ::PDF::Reader::MalformedPDFError, ::PDF::Reader::InvalidPageError, ::PDF::Reader::UnsupportedFeatureError => e raise FetchError, "PDF rendering failed: #{e.class.name.split('::').last}: #{e.}" end |