Class: Pikuri::Extractors::Documents
- Inherits:
-
Object
- Object
- Pikuri::Extractors::Documents
- Defined in:
- lib/pikuri/extractors/documents.rb
Overview
Document extractor for the Pikuri::Extractor registry: DOCX / ODT / XLSX / legacy XLS / PPTX / EPUB / RTF / PDF →Markdown, by piping the document bytes through pandoc (ODF, RTF, EPUB, DOCX), markitdown (the OOXML spreadsheet / presentation arms), or pdftotext (PDF), selected per format.
Container first, host CLI second
The preferred converter is a one-shot, locally-built docker container (IMAGE, built from this gem’s docker/ directory): docker run –rm -i –network=none –read-only –cap-drop=ALL, bytes in via stdin, Markdown out via stdout, **no volume mounts**. Two reasons this beats running a host-installed converter directly:
-
Security. These documents typically arrive via
fetch/web_scrape— untrusted bytes — and complex format parsers are a classic exploitation surface. In the container the parser sees no network and no host filesystem; the worst a malicious document can do is produce garbage Markdown. -
Reproducibility. The Dockerfile pins pandoc (via the base image’s apt) and markitdown (exact pip version); a host install is whatever version the machine happens to have.
When docker is unavailable (binary absent or daemon down), the extractor falls back to host-installed pandoc / markitdown / pdftotext CLIs — same stdin→stdout contract, one code path with two argv builders. Which arm was picked is logged once via Pikuri.logger_for(‘Extractors’).
Registration is explicit
Requiring pikuri-extractors defines this class and the shared DOCUMENTS instance but registers nothing. A host script opts in with Pikuri::Extractors::DOCUMENTS.register, which inserts the instance before the registry’s terminal Passthrough entry (and after core’s HTML — and pikuri-pdf’s front-inserted PDF, when the host registers that too — which keep winning their formats). Same opt-in philosophy as c.add_extension —no behavior changes by require alone.
Format detection
#matches? claims content by normalized content-type (CONTENT_TYPES) or by byte sniff: PDF’s %PDF- prefix, RTF’s {\rtf prefix; for zip-based formats, ODF and EPUB mandate an uncompressed mimetype first entry (so the literal mime string sits inside the leading sample), and OOXML is recognised by the [Content_Types].xml entry plus a word/ / ppt/ / xl/ entry-name scan. {#extract} re-sniffs (the registry duck type doesn’t pass content_type to extract); when content was claimed by content-type but the sniff is blind (legacy XLS — an OLE2 container whose discriminating directory sits at the end of the file, past the sample), the bytes go to markitdown with no format hint and its own magic-byte detection takes over. The consequence: a local .xls (no transport content-type, sniff blind) is not claimed at all and keeps today’s binary refusal. One ordering edge vs pikuri-pdf: this instance sits after core’s HTML in the registry, so a PDF served under a lying text/html header goes to the HTML extractor (pikuri-pdf front-inserts and wins that case). Accepted — lying-header PDFs under specifically text/html are rare.
PDF: this gem or pikuri-pdf — pick one per wiring
The PDF arm (pdftotext, with {#pdf_page_lines} restoring the “— Page N —” markers from pdftotext’s \f separators) makes this extractor a complete superset of pikuri-pdf’s formats, so a host that registers {DOCUMENTS} does NOT also register Extractors::PDF — one extractor per format keeps the registry’s first-match-wins semantics legible. The trade per wiring:
-
*This gem* — PDF parsing happens inside the sandbox (poppler is native code parsing attacker-controlled bytes; the container is exactly the right place for it), one gem covers every document format. Costs: docker (or host CLIs), no lazy paging (each paged read re-converts the whole PDF), and the generic
:documentkind (the Read tools say “End of file”, not “End of PDF”, and a scanned PDF reads as “(Empty file)” rather than the scanned-image hint). -
pikuri-pdf — in-process pure Ruby (no infrastructure), lazy
extract_linespaging (a windowed read of a 500-page PDF parses only its window), PDF-specific Read-tool wording. Costs: pdf-reader’s dependency subtree, parsing untrusted bytes in-process (pure Ruby, so DoS at worst).
The guide walks this as a progression: chapter 3 wires pikuri-pdf (no docker yet), chapter 7’s assistant supersedes it with this extractor.
Deliberately out of scope
-
*ODS / ODP* — neither pandoc nor markitdown reads them; the only converter that does (LibreOffice headless) costs a 2 GB+ image. Excluded rather than half-supported.
-
*Image OCR / audio transcription* — markitdown’s optional arms need model downloads; the converter image stays networkless and small. A multi-modal main LLM is the pikuri answer to images.
Paging economics
A subprocess converter needs the whole document before it can emit anything, so there is no lazy parse: every Extractor.extract_paged call (each Read page of a long DOCX) re-runs the full conversion. Accepted — no result cache in v1. Both legs of one conversion still stream, though: the source io is handed to {Pikuri::Subprocess.run} and copied straight into the converter’s stdin (IO.copy_stream — a big local file never loads into the Ruby heap), and the converter’s stdout lands in a Tempfile (also what makes the stdin/stdout pumping deadlock-free — see {Pikuri::Subprocess.run}) whose lines {#extract_lines} yields from disk — so neither the document nor the full Markdown String is ever resident during paging.
Constant Summary collapse
- LOGGER =
Returns gem-wide diagnostics logger.
Pikuri.logger_for('Extractors')
- IMAGE =
Returns converter image tag. Version-tied so a gem upgrade rebuilds with the new pins;
pikuri-internal-prefix matches the container-naming convention of the vectordb/memory supervisors. "pikuri-internal-extractors:#{Pikuri::VERSION}"- DOCKER_DIR =
Returns absolute path to the shipped docker build context (Dockerfile + convert.sh).
File.('../../../docker', __dir__)
- CONVERT_TIMEOUT =
Returns coreutils-
timeoutbudget for one conversion. Generous — a huge PPTX through markitdown can take a while — but bounded, so a wedged converter can’t hang the agent loop. '300s'- AUTO =
Returns sentinel format meaning “let markitdown’s magic-byte detection decide” — the fallback when content was claimed by content-type but the byte sniff is blind.
'auto'- PDF =
Returns the PDF format tag. Singled out as a constant because PDF is the one format whose converter output gets a post-processing pass: pdftotext emits
\fbetween pages, and #extract / #extract_lines turn those into the same “— Page N —” marker lines pikuri-pdf’s extractor emits, so page provenance (vectordb chunk citations, the Read tools’ page references) survives whichever PDF extractor a host wires. 'pdf'- CONTENT_TYPES =
Returns normalized content-type →format tag (the tag doubles as the container entrypoint’s dispatch argument and pandoc’s
-f/ markitdown’s-xvalue). { 'application/vnd.oasis.opendocument.text' => 'odt', 'application/rtf' => 'rtf', 'text/rtf' => 'rtf', 'application/epub+zip' => 'epub', 'application/pdf' => PDF, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document' => 'docx', 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' => 'xlsx', 'application/vnd.ms-excel' => 'xls', 'application/vnd.openxmlformats-officedocument.presentationml.presentation' => 'pptx' }.freeze
- HOST_CONVERTERS =
Returns format tag → host CLIs that can convert it, in preference order. Mirrors the container entrypoint’s dispatch (
docker/convert.sh) — keep the two in sync. pandoc leads where both could serve (DOCX, EPUB): its readers preserve more structure. { 'odt' => %i[pandoc], 'rtf' => %i[pandoc], 'epub' => %i[pandoc markitdown], 'docx' => %i[pandoc markitdown], 'xlsx' => %i[markitdown], 'xls' => %i[markitdown], 'pptx' => %i[markitdown], PDF => %i[pdftotext], AUTO => %i[markitdown] }.freeze
- ZIP_MAGIC =
Returns zip local-file-header magic, shared by every OOXML / ODF / EPUB document.
"PK\x03\x04".b
- VERSION_PROBE_FLAGS =
Returns host-CLI name → the flag that makes it print a version and exit 0, where
--version(the default probe, see #cli?) doesn’t work: poppler’spdftotextparses--versionas a filename and exits 1, but accepts-v. { 'pdftotext' => '-v' }.freeze
Instance Method Summary collapse
-
#ensure_image! ⇒ void
Build the converter image now if it isn’t present — for host scripts that prefer paying the one-time build (pip install + apt, minutes) at boot rather than mid-conversation.
-
#extract(io) ⇒ String
Convert the whole document behind
ioto one Markdown String. -
#extract_lines(io) ⇒ Enumerator<String>
Same content as #extract, as a stream of chomped lines read off the converter’s stdout Tempfile — the whole-document conversion still runs up front (subprocess converters can’t parse lazily), but neither the document nor the Markdown ever materialises as one String: the conversion fires on first consumption, streaming
iointo the converter. -
#kind ⇒ Symbol
Kind tag carried on Extractor::Page#kind.
-
#matches?(sample:, content_type:) ⇒ Boolean
Claim content this extractor can convert: a recognised content-type, or a positive byte sniff (see “Format detection” in the class docs).
-
#register ⇒ Documents
Plug this extractor into Pikuri::Extractor.registry, before the terminal
Passthroughentry.
Instance Method Details
#ensure_image! ⇒ void
This method returns an undefined value.
Build the converter image now if it isn’t present — for host scripts that prefer paying the one-time build (pip install + apt, minutes) at boot rather than mid-conversation. Entirely optional: #extract builds lazily on first use otherwise.
279 280 281 282 283 284 |
# File 'lib/pikuri/extractors/documents.rb', line 279 def ensure_image! raise Pikuri::Extractor::Error, '`docker` is unavailable; cannot build the converter image' unless docker? image_ready! nil end |
#extract(io) ⇒ String
Convert the whole document behind io to one Markdown String. PDFs come back as one “— Page N —”-headed block per text-carrying page (see PDF); a fully scanned PDF extracts to the empty String — same contract as pikuri-pdf’s extractor.
228 229 230 231 232 |
# File 'lib/pikuri/extractors/documents.rb', line 228 def extract(io) with_converted(io) do |file, format| format == PDF ? pdf_page_lines(file).to_a.join("\n") : file.read end end |
#extract_lines(io) ⇒ Enumerator<String>
Same content as #extract, as a stream of chomped lines read off the converter’s stdout Tempfile — the whole-document conversion still runs up front (subprocess converters can’t parse lazily), but neither the document nor the Markdown ever materialises as one String: the conversion fires on first consumption, streaming io into the converter. The enumerator owns the Tempfile and deletes it when iteration ends.
248 249 250 251 252 253 254 255 256 257 258 |
# File 'lib/pikuri/extractors/documents.rb', line 248 def extract_lines(io) Enumerator.new do |yielder| with_converted(io) do |file, format| if format == PDF pdf_page_lines(file).each { |line| yielder << line } else file.each_line { |line| yielder << line.chomp } end end end end |
#kind ⇒ Symbol
Returns kind tag carried on Extractor::Page#kind.
203 204 205 |
# File 'lib/pikuri/extractors/documents.rb', line 203 def kind :document end |
#matches?(sample:, content_type:) ⇒ Boolean
Claim content this extractor can convert: a recognised content-type, or a positive byte sniff (see “Format detection” in the class docs).
215 216 217 |
# File 'lib/pikuri/extractors/documents.rb', line 215 def matches?(sample:, content_type:) CONTENT_TYPES.key?(content_type) || !sniff(sample).nil? end |
#register ⇒ Documents
Plug this extractor into Pikuri::Extractor.registry, before the terminal Passthrough entry. Idempotent — a second call is a no-op.
265 266 267 268 269 |
# File 'lib/pikuri/extractors/documents.rb', line 265 def register registry = Pikuri::Extractor.registry registry.insert(-2, self) unless registry.include?(self) self end |