pikuri-pdf
PDF text extraction for the pikuri AI-assistant toolkit: in-process, pure Ruby, and lazy — paged reads parse only the pages the window needs, so showing the first page of a 500-page PDF never pays for the other 499.
Provides:
Pikuri::Extractors::PDF— an extractor for thePikuri::Extractorregistry, wrapping the pure-Ruby pdf-reader gem. Once registered, every pikuri surface that routes through the registry picks PDFs up for free: thereadtool pages through a local.pdfwith--- Page N ---markers,web_scrapeextracts a downloaded paper, the pikuri-vectordb indexer ingests a PDF corpus.
Why a separate gem
pikuri-core's pitch is a dependency tree you can audit in an evening. pdf-reader brings five transitive gems (Ascii85, afm, hashery, ruby-rc4, ttfunk) that serve nothing else in core — the largest single bite in that tree, for one file format. So PDF support is an opt-in sibling instead: install it when your agent needs PDFs, skip it (and its whole subtree) when it doesn't.
Everything is pure Ruby, so the worst a malicious PDF can do to the parser is burn CPU and memory — there's no native code to corrupt.
This gem or pikuri-extractors — pick one
per wiring. pikuri-extractors' converter container also has a PDF
arm (poppler's pdftotext, sandboxed, same --- Page N ---
markers): on an agent that fetches untrusted documents from the
web, parsing them in the networkless container is the stronger
posture. This gem is the no-infrastructure wiring — in-process
means no docker and no host CLIs, and it's what makes the lazy
page-windowed reads possible (a subprocess converter must convert
the whole document before emitting anything, and re-converts it on
every paged read). The guide wires this gem in chapter 3 and
supersedes it with pikuri-extractors in chapter 7's assistant.
Install
# Gemfile
gem 'pikuri-pdf'
Usage
Requiring the gem changes nothing — registration is an explicit
opt-in your script makes, same philosophy as c.add_extension:
require 'pikuri-core'
require 'pikuri-pdf'
Pikuri::Extractors::PDF.register
# From here on, the registry handles PDFs everywhere:
text = Pikuri::FileType.read_as_text(Pathname.new('paper.pdf'))
register inserts the extractor at the front of the registry: the
%PDF- magic-byte sniff is the strongest signal there — it never
misfires on text, and it must win over the HTML extractor's
content-type match so a PDF served under a lying Content-Type
header still extracts.
Limits
Best-effort by design: pdf-reader produces clean text from PDFs
generated from a digital source (LaTeX, Word export, ...) but
nothing useful from scanned documents — those extract to the empty
string, and the read tool words that as a scanned-image hint to
the model. No OCR. Encrypted and XFA-form PDFs surface as
Error: ... observations the model can react to.