pikuri-extractors
Additional document extractors for the pikuri AI-assistant toolkit: office documents and PDFs → Markdown, converted preferably inside a one-shot networkless docker container, so a malicious document downloaded from the web is parsed somewhere it can't phone home or read your files.
Provides:
Pikuri::Extractors::DOCUMENTS— an extractor for thePikuri::Extractorregistry covering DOCX, ODT, XLSX, legacy XLS, PPTX, EPUB, RTF, and PDF. Once registered, every pikuri surface that routes through the registry picks the formats up for free: thereadtool pages through a local.docx,web_scrape/fetchconvert a downloaded.odt, the pikuri-vectordb indexer ingests an.epubor a paper PDF (with--- Page N ---page markers preserved for citations).
The actual conversion is done by pandoc (ODF,
RTF, EPUB, DOCX — its readers preserve the most structure),
markitdown (the XLSX /
XLS / PPTX arms), and poppler's pdftotext (the PDF arm — its \f
page separators are rebuilt into --- Page N --- markers on the
Ruby side), dispatched per format. One stdin→stdout contract,
two ways to run it:
- Container (preferred). A small, locally-built docker image
(pinned pandoc + pinned markitdown; built from
docker/Dockerfile on first use — read it,
it's short) run as
docker run --rm -i --network=none --read-only --cap-drop=ALL, bytes in via stdin, Markdown out via stdout, no volume mounts. Office-format parsers are large, complex codebases and the documents they parse are exactly the bytes an attacker controls; in the container, the blast radius of a parser exploit is one throwaway process that can see neither your network nor your filesystem. - Host CLI (fallback). When docker is absent or the daemon is
down, a host-installed
pandoc/markitdown/pdftotextis used directly — convenient, but unpinned and unsandboxed.
Deliberately not covered: ODS / ODP (neither converter reads them; the only one that does is LibreOffice, a 2 GB+ image), and image OCR / audio transcription (need model downloads — point a multi-modal main LLM at images instead).
PDF: this gem or pikuri-pdf — pick one per wiring. This gem parses PDFs inside the sandbox (poppler is native code chewing attacker-controlled bytes — exactly what the container is for) but re-converts the whole document on every paged read; pikuri-pdf is in-process pure Ruby with lazy page-windowed reads and no infrastructure to set up. The guide wires pikuri-pdf in chapter 3 and supersedes it with this gem in chapter 7's assistant.
Install
# Gemfile
gem 'pikuri-extractors'
Plus one of: a working docker (recommended; the image builds
itself on first use, network is only needed for that build), or
host pandoc / markitdown CLIs.
Usage
Requiring the gem changes nothing — registration is an explicit
opt-in your script makes, same philosophy as c.add_extension:
require 'pikuri-core'
require 'pikuri-extractors'
Pikuri::Extractors::DOCUMENTS.register
# From here on, the registry handles the new formats everywhere:
text = Pikuri::FileType.read_as_text(Pathname.new('report.docx'))
# Optional: pay the one-time image build (~minutes) at boot instead
# of mid-conversation. Requires docker; skip when relying on host CLIs.
Pikuri::Extractors::DOCUMENTS.ensure_image!
Performance posture
A subprocess converter has no lazy mode: every paged read of a
long document re-runs the full conversion (a cold docker run plus
the converter itself — roughly a second for pandoc, a few for
markitdown). That's accepted v1 behavior — no result cache — and it
is well inside what an LLM tool call tolerates.
Format detection
Content-type when the transport provides one (HTTP header), byte
sniff otherwise: RTF by its {\rtf prefix; ODT / EPUB by the
uncompressed mimetype zip entry their specs mandate first; DOCX /
XLSX / PPTX by [Content_Types].xml plus an entry-name scan.
Legacy .xls is recognised by content-type only (its OLE2 container
isn't sniffable from the leading bytes), so a local .xls file —
no content-type — keeps pikuri-core's binary refusal.