Class: Pikuri::Extractors::Documents

Inherits:
Object
  • Object
show all
Defined in:
lib/pikuri/extractors/documents.rb

Overview

Document extractor for the Pikuri::Extractor registry: DOCX / ODT / XLSX / legacy XLS / PPTX / EPUB / RTF / PDF →Markdown, by piping the document bytes through pandoc (ODF, RTF, EPUB, DOCX), markitdown (the OOXML spreadsheet / presentation arms), or pdftotext (PDF), selected per format.

Container first, host CLI second

The preferred converter is a one-shot, locally-built docker container (IMAGE, built from this gem’s docker/ directory): docker run –rm -i –network=none –read-only –cap-drop=ALL, bytes in via stdin, Markdown out via stdout, **no volume mounts**. Two reasons this beats running a host-installed converter directly:

  • Security. These documents typically arrive via fetch / web_scrape — untrusted bytes — and complex format parsers are a classic exploitation surface. In the container the parser sees no network and no host filesystem; the worst a malicious document can do is produce garbage Markdown.

  • Reproducibility. The Dockerfile pins pandoc (via the base image’s apt) and markitdown (exact pip version); a host install is whatever version the machine happens to have.

When docker is unavailable (binary absent or daemon down), the extractor falls back to host-installed pandoc / markitdown / pdftotext CLIs — same stdin→stdout contract, one code path with two argv builders. Which arm was picked is logged once via Pikuri.logger_for(‘Extractors’).

Registration is explicit

Requiring pikuri-extractors defines this class and the shared DOCUMENTS instance but registers nothing. A host script opts in with Pikuri::Extractors::DOCUMENTS.register, which inserts the instance before the registry’s terminal Passthrough entry (and after core’s HTML — and pikuri-pdf’s front-inserted PDF, when the host registers that too — which keep winning their formats). Same opt-in philosophy as c.add_extension —no behavior changes by require alone.

Format detection

#matches? claims content by normalized content-type (CONTENT_TYPES) or by byte sniff: PDF’s %PDF- prefix, RTF’s {\rtf prefix; for zip-based formats, ODF and EPUB mandate an uncompressed mimetype first entry (so the literal mime string sits inside the leading sample), and OOXML is recognised by the [Content_Types].xml entry plus a word/ / ppt/ / xl/ entry-name scan. {#extract} re-sniffs (the registry duck type doesn’t pass content_type to extract); when content was claimed by content-type but the sniff is blind (legacy XLS — an OLE2 container whose discriminating directory sits at the end of the file, past the sample), the bytes go to markitdown with no format hint and its own magic-byte detection takes over. The consequence: a local .xls (no transport content-type, sniff blind) is not claimed at all and keeps today’s binary refusal. One ordering edge vs pikuri-pdf: this instance sits after core’s HTML in the registry, so a PDF served under a lying text/html header goes to the HTML extractor (pikuri-pdf front-inserts and wins that case). Accepted — lying-header PDFs under specifically text/html are rare.

PDF: this gem or pikuri-pdf — pick one per wiring

The PDF arm (pdftotext, with {#pdf_page_lines} restoring the “— Page N —” markers from pdftotext’s \f separators) makes this extractor a complete superset of pikuri-pdf’s formats, so a host that registers {DOCUMENTS} does NOT also register Extractors::PDF — one extractor per format keeps the registry’s first-match-wins semantics legible. The trade per wiring:

  • *This gem* — PDF parsing happens inside the sandbox (poppler is native code parsing attacker-controlled bytes; the container is exactly the right place for it), one gem covers every document format. Costs: docker (or host CLIs), no lazy paging (each paged read re-converts the whole PDF), and the generic :document kind (the Read tools say “End of file”, not “End of PDF”, and a scanned PDF reads as “(Empty file)” rather than the scanned-image hint).

  • pikuri-pdf — in-process pure Ruby (no infrastructure), lazy extract_lines paging (a windowed read of a 500-page PDF parses only its window), PDF-specific Read-tool wording. Costs: pdf-reader’s dependency subtree, parsing untrusted bytes in-process (pure Ruby, so DoS at worst).

The guide walks this as a progression: chapter 3 wires pikuri-pdf (no docker yet), chapter 7’s assistant supersedes it with this extractor.

Deliberately out of scope

  • *ODS / ODP* — neither pandoc nor markitdown reads them; the only converter that does (LibreOffice headless) costs a 2 GB+ image. Excluded rather than half-supported.

  • *Image OCR / audio transcription* — markitdown’s optional arms need model downloads; the converter image stays networkless and small. A multi-modal main LLM is the pikuri answer to images.

Paging economics

A subprocess converter needs the whole document before it can emit anything, so there is no lazy parse: every Extractor.extract_paged call (each Read page of a long DOCX) re-runs the full conversion. Accepted — no result cache in v1. Both legs of one conversion still stream, though: the source io is handed to {Pikuri::Subprocess.run} and copied straight into the converter’s stdin (IO.copy_stream — a big local file never loads into the Ruby heap), and the converter’s stdout lands in a Tempfile (also what makes the stdin/stdout pumping deadlock-free — see {Pikuri::Subprocess.run}) whose lines {#extract_lines} yields from disk — so neither the document nor the full Markdown String is ever resident during paging.

Constant Summary collapse

LOGGER =

Returns gem-wide diagnostics logger.

Returns:

  • (Logger)

    gem-wide diagnostics logger.

Pikuri.logger_for('Extractors')
IMAGE =

Returns converter image tag. Version-tied so a gem upgrade rebuilds with the new pins; pikuri-internal- prefix matches the container-naming convention of the vectordb/memory supervisors.

Returns:

  • (String)

    converter image tag. Version-tied so a gem upgrade rebuilds with the new pins; pikuri-internal- prefix matches the container-naming convention of the vectordb/memory supervisors.

"pikuri-internal-extractors:#{Pikuri::VERSION}"
DOCKER_DIR =

Returns absolute path to the shipped docker build context (Dockerfile + convert.sh).

Returns:

  • (String)

    absolute path to the shipped docker build context (Dockerfile + convert.sh).

File.expand_path('../../../docker', __dir__)
CONVERT_TIMEOUT =

Returns coreutils-timeout budget for one conversion. Generous — a huge PPTX through markitdown can take a while — but bounded, so a wedged converter can’t hang the agent loop.

Returns:

  • (String)

    coreutils-timeout budget for one conversion. Generous — a huge PPTX through markitdown can take a while — but bounded, so a wedged converter can’t hang the agent loop.

'300s'
AUTO =

Returns sentinel format meaning “let markitdown’s magic-byte detection decide” — the fallback when content was claimed by content-type but the byte sniff is blind.

Returns:

  • (String)

    sentinel format meaning “let markitdown’s magic-byte detection decide” — the fallback when content was claimed by content-type but the byte sniff is blind.

'auto'
PDF =

Returns the PDF format tag. Singled out as a constant because PDF is the one format whose converter output gets a post-processing pass: pdftotext emits \f between pages, and #extract / #extract_lines turn those into the same “— Page N —” marker lines pikuri-pdf’s extractor emits, so page provenance (vectordb chunk citations, the Read tools’ page references) survives whichever PDF extractor a host wires.

Returns:

  • (String)

    the PDF format tag. Singled out as a constant because PDF is the one format whose converter output gets a post-processing pass: pdftotext emits \f between pages, and #extract / #extract_lines turn those into the same “— Page N —” marker lines pikuri-pdf’s extractor emits, so page provenance (vectordb chunk citations, the Read tools’ page references) survives whichever PDF extractor a host wires.

'pdf'
CONTENT_TYPES =

Returns normalized content-type →format tag (the tag doubles as the container entrypoint’s dispatch argument and pandoc’s -f / markitdown’s -x value).

Returns:

  • (Hash{String => String})

    normalized content-type →format tag (the tag doubles as the container entrypoint’s dispatch argument and pandoc’s -f / markitdown’s -x value).

{
  'application/vnd.oasis.opendocument.text' => 'odt',
  'application/rtf' => 'rtf',
  'text/rtf' => 'rtf',
  'application/epub+zip' => 'epub',
  'application/pdf' => PDF,
  'application/vnd.openxmlformats-officedocument.wordprocessingml.document' => 'docx',
  'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' => 'xlsx',
  'application/vnd.ms-excel' => 'xls',
  'application/vnd.openxmlformats-officedocument.presentationml.presentation' => 'pptx'
}.freeze
HOST_CONVERTERS =

Returns format tag → host CLIs that can convert it, in preference order. Mirrors the container entrypoint’s dispatch (docker/convert.sh) — keep the two in sync. pandoc leads where both could serve (DOCX, EPUB): its readers preserve more structure.

Returns:

  • (Hash{String => Array<Symbol>})

    format tag → host CLIs that can convert it, in preference order. Mirrors the container entrypoint’s dispatch (docker/convert.sh) — keep the two in sync. pandoc leads where both could serve (DOCX, EPUB): its readers preserve more structure.

{
  'odt'  => %i[pandoc],
  'rtf'  => %i[pandoc],
  'epub' => %i[pandoc markitdown],
  'docx' => %i[pandoc markitdown],
  'xlsx' => %i[markitdown],
  'xls'  => %i[markitdown],
  'pptx' => %i[markitdown],
  PDF    => %i[pdftotext],
  AUTO   => %i[markitdown]
}.freeze
ZIP_MAGIC =

Returns zip local-file-header magic, shared by every OOXML / ODF / EPUB document.

Returns:

  • (String)

    zip local-file-header magic, shared by every OOXML / ODF / EPUB document.

"PK\x03\x04".b
VERSION_PROBE_FLAGS =

Returns host-CLI name → the flag that makes it print a version and exit 0, where --version (the default probe, see #cli?) doesn’t work: poppler’s pdftotext parses --version as a filename and exits 1, but accepts -v.

Returns:

  • (Hash{String => String})

    host-CLI name → the flag that makes it print a version and exit 0, where --version (the default probe, see #cli?) doesn’t work: poppler’s pdftotext parses --version as a filename and exits 1, but accepts -v.

{ 'pdftotext' => '-v' }.freeze

Instance Method Summary collapse

Instance Method Details

#ensure_image!void

This method returns an undefined value.

Build the converter image now if it isn’t present — for host scripts that prefer paying the one-time build (pip install + apt, minutes) at boot rather than mid-conversation. Entirely optional: #extract builds lazily on first use otherwise.

Raises:

  • (Pikuri::Extractor::Error)

    when docker is unavailable or the build fails.



279
280
281
282
283
284
# File 'lib/pikuri/extractors/documents.rb', line 279

def ensure_image!
  raise Pikuri::Extractor::Error, '`docker` is unavailable; cannot build the converter image' unless docker?

  image_ready!
  nil
end

#extract(io) ⇒ String

Convert the whole document behind io to one Markdown String. PDFs come back as one “— Page N —”-headed block per text-carrying page (see PDF); a fully scanned PDF extracts to the empty String — same contract as pikuri-pdf’s extractor.

Parameters:

  • io (IO, StringIO)

    seekable IO positioned at the start.

Returns:

  • (String)

    Markdown-flavoured UTF-8 text.

Raises:

  • (Pikuri::Extractor::Error)

    when no converter is available, the conversion exits non-zero, or it times out.



228
229
230
231
232
# File 'lib/pikuri/extractors/documents.rb', line 228

def extract(io)
  with_converted(io) do |file, format|
    format == PDF ? pdf_page_lines(file).to_a.join("\n") : file.read
  end
end

#extract_lines(io) ⇒ Enumerator<String>

Same content as #extract, as a stream of chomped lines read off the converter’s stdout Tempfile — the whole-document conversion still runs up front (subprocess converters can’t parse lazily), but neither the document nor the Markdown ever materialises as one String: the conversion fires on first consumption, streaming io into the converter. The enumerator owns the Tempfile and deletes it when iteration ends.

Parameters:

  • io (IO, StringIO)

    seekable IO positioned at the start; must remain open until the enumerator is consumed (same contract as pikuri-pdf’s lazy extract_lines).

Returns:

  • (Enumerator<String>)

Raises:

  • (Pikuri::Extractor::Error)

    as for #extract, raised on first consumption.



248
249
250
251
252
253
254
255
256
257
258
# File 'lib/pikuri/extractors/documents.rb', line 248

def extract_lines(io)
  Enumerator.new do |yielder|
    with_converted(io) do |file, format|
      if format == PDF
        pdf_page_lines(file).each { |line| yielder << line }
      else
        file.each_line { |line| yielder << line.chomp }
      end
    end
  end
end

#kindSymbol

Returns kind tag carried on Extractor::Page#kind.

Returns:

  • (Symbol)

    kind tag carried on Extractor::Page#kind.



203
204
205
# File 'lib/pikuri/extractors/documents.rb', line 203

def kind
  :document
end

#matches?(sample:, content_type:) ⇒ Boolean

Claim content this extractor can convert: a recognised content-type, or a positive byte sniff (see “Format detection” in the class docs).

Parameters:

  • sample (String)

    leading bytes of the content.

  • content_type (String, nil)

    normalized content-type, or nil when the transport carries none (local files).

Returns:

  • (Boolean)


215
216
217
# File 'lib/pikuri/extractors/documents.rb', line 215

def matches?(sample:, content_type:)
  CONTENT_TYPES.key?(content_type) || !sniff(sample).nil?
end

#registerDocuments

Plug this extractor into Pikuri::Extractor.registry, before the terminal Passthrough entry. Idempotent — a second call is a no-op.

Returns:



265
266
267
268
269
# File 'lib/pikuri/extractors/documents.rb', line 265

def register
  registry = Pikuri::Extractor.registry
  registry.insert(-2, self) unless registry.include?(self)
  self
end