Module: Pikuri::Extractor

Defined in:
lib/pikuri/extractor.rb,
lib/pikuri/extractor/html.rb,
lib/pikuri/extractor/passthrough.rb

Overview

The format→text extraction seam: one registry of extractors that turn an IO of some recognised format (HTML and plain text out of the box; PDF / office formats via the pikuri-pdf / pikuri-extractors plug-in gems) into Markdown-flavoured UTF-8 text, consumed through two front doors:

  • Extractor.extract — the whole document as one String. The shape the indexing / caching callers want (VectorDb‘s indexer, Tool::WebScrape’s URL cache): no windowing, no presentation.

  • Extractor.extract_paged — the LLM-tool shape: the same extraction, windowed to a line range with a byte cap, returned as a Page the caller renders. Backs Workspace::Read and VectorDb::Tools::Read (via the FileType path wrappers) so the offset/limit/byte-cap logic lives in one tested place.

Both front doors — Tool::Scraper dispatching on the HTTP Content-Type header for the web tools, and FileType resolving local paths — route through this one registry, so both share one set of format truths and “support a new format” is a registry entry (pikuri-pdf and pikuri-extractors plug PDF and office formats in without pikuri-core knowing), not a new special case in two dispatchers.

The extractor duck type

Each Extractor.registry entry implements three methods:

  • matches?(sample:, content_type:)Boolean — claim the content. sample is the leading FileType::SAMPLE_BYTES bytes (for magic-byte sniffs); content_type is the normalized HTTP Content-Type for web content, the FileType.detect_mime result for local files, and may be nil (“no transport metadata — sniff if you can”).

  • extract(io)String — the whole document as Markdown-flavoured UTF-8 text. Raises Error on content the extractor claimed but cannot parse (malformed PDF, …).

  • kindSymbol — a short tag (:text / :pdf / :html) carried on Page#kind so rendering callers can word format-specific trailers (“End of PDF”, the scanned-image hint) without re-sniffing.

plus one optional method for formats whose lines can be produced incrementally:

  • extract_lines(io)Enumerator<String> — the same content as extract, as a lazy stream of already-+chomp+ed lines. Extractor.extract_paged prefers this when present and stops consuming the moment the window fills, so the rest of the document is never parsed (pikuri-pdf’s extractor: pdf-reader’s page list parses on access; Passthrough: the IO is read line-by-line). The enumerator must be consumed while io is still open, and may raise Error mid-iteration. Extractors that need the whole document to produce anything (HTML: Readability walks the full DOM —true of any subprocess-based extractor too) simply omit it; Extractor.extract_paged then extracts in full and windows the result.

Windowing itself (offset / limit / byte cap / line truncation) is presentation and deliberately lives once in Extractor.extract_paged, not per extractor — extract_lines is line production, the only genuinely format-specific half of paging.

Errors

Both failure modes are failures the *caller’s* LLM can react to, so they share one rescuable root:

  • Unsupported — nothing in Extractor.registry claimed the content (opaque binary, an unhandled content-type).

  • Error (the root) — an extractor claimed the content but the parse failed (malformed PDF, …).

Callers map them to their own conventions: Tool::Scraper re-raises both as FetchError; FileType.read_as_text maps Unsupported to the ArgumentError binary refusal and Error to a RuntimeError carrying the path.

Defined Under Namespace

Modules: HTML, Passthrough Classes: Page

Constant Summary collapse

Error =

Raised when an extractor claims content but fails to parse it (e.g. a malformed PDF). Message is LLM-presentable.

Class.new(StandardError)
Unsupported =

Raised by extract / extract_paged when no registry entry claims the content. Subclass of Error so callers that don’t care about the distinction rescue one class.

Class.new(Error)
PAGE_DEFAULT_LIMIT =

Returns default line-window size for extract_paged when the caller omits limit.

Returns:

  • (Integer)

    default line-window size for extract_paged when the caller omits limit.

2000
PAGE_MAX_BYTES =

Returns default hard byte cap on the content collected by a single extract_paged call. Bypassable by paging via offset. The rendered output is slightly larger (line numbering, trailer) — that’s the caller’s concern.

Returns:

  • (Integer)

    default hard byte cap on the content collected by a single extract_paged call. Bypassable by paging via offset. The rendered output is slightly larger (line numbering, trailer) — that’s the caller’s concern.

50 * 1024
PAGE_MAX_LINE_LENGTH =

Returns default per-line character cap; extract_paged truncates longer lines and appends PAGE_LINE_TRUNCATION_MARKER.

Returns:

2000
PAGE_LINE_TRUNCATION_MARKER =

Returns suffix appended to a line truncated at PAGE_MAX_LINE_LENGTH.

Returns:

"... (line truncated to #{PAGE_MAX_LINE_LENGTH} chars)"

Class Method Summary collapse

Class Method Details

.extract(io, content_type: nil) ⇒ String

Extract the whole document behind io as one Markdown-flavoured UTF-8 String. May be empty (empty text file, scanned-image PDF with no extractable text).

Parameters:

  • io (IO, StringIO)

    seekable IO positioned at the start of the content; this method reads a leading sample for the matches? sniff and rewinds before extracting.

  • content_type (String, nil) (defaults to: nil)

    normalized content-type when the transport supplies one (HTTP header, FileType.detect_mime result); nil when unknown — extractors then rely on their byte sniffs.

Returns:

  • (String)

Raises:

  • (Unsupported)

    when no registry entry claims the content.

  • (Error)

    when the matched extractor cannot parse it.



179
180
181
# File 'lib/pikuri/extractor.rb', line 179

def extract(io, content_type: nil)
  extractor_for(io, content_type).extract(io)
end

.extract_paged(io, content_type: nil, offset: 1, limit: PAGE_DEFAULT_LIMIT, max_bytes: PAGE_MAX_BYTES, max_line_length: PAGE_MAX_LINE_LENGTH) ⇒ Page

Extract io and return a windowed Page: the lines from offset (1-indexed) up to limit of them, stopping early if max_bytes is reached, with over-long lines truncated at max_line_length.

Lazy where the format allows: extractors that implement extract_lines (plain text; pikuri-pdf’s PDF) are consumed only until the window fills — reading the first window of a 500-page PDF parses a handful of pages, and the first page of a gigabyte log never loads it. Extractors without it (HTML) are extracted in full and then windowed, which is also what makes their total_lines always exact.

Parameters:

  • io (IO, StringIO)

    seekable IO positioned at the start.

  • content_type (String, nil) (defaults to: nil)

    as for extract.

  • offset (Integer) (defaults to: 1)

    1-indexed first line to include. The caller is responsible for validating offset >= 1.

  • limit (Integer) (defaults to: PAGE_DEFAULT_LIMIT)

    maximum lines to collect. Caller validates limit >= 1.

  • max_bytes (Integer) (defaults to: PAGE_MAX_BYTES)

    hard byte cap on collected content.

  • max_line_length (Integer) (defaults to: PAGE_MAX_LINE_LENGTH)

    per-line truncation threshold.

Returns:

  • (Page)

    the windowed slice.

Raises:

  • (Unsupported)

    when no registry entry claims the content.

  • (Error)

    when the matched extractor cannot parse it.



207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
# File 'lib/pikuri/extractor.rb', line 207

def extract_paged(io, content_type: nil, offset: 1, limit: PAGE_DEFAULT_LIMIT,
                  max_bytes: PAGE_MAX_BYTES, max_line_length: PAGE_MAX_LINE_LENGTH)
  extractor = extractor_for(io, content_type)
  if extractor.respond_to?(:extract_lines)
    # count_tail is a per-format economics call: once the window
    # fills, counting the rest of a plain-text stream is a cheap
    # sequential read (so the trailer can say "of N"), while for
    # a PDF it would mean parsing every remaining page — exactly
    # what extract_lines exists to avoid. Plugged-in extractors
    # (pikuri-pdf's included) get the conservative default (stop
    # early, total unknown).
    window(extractor.extract_lines(io),
           offset: offset, limit: limit, max_bytes: max_bytes,
           max_line_length: max_line_length, kind: extractor.kind,
           known_total: nil, count_tail: extractor.equal?(Passthrough))
  else
    lines = extractor.extract(io).split("\n")
    window(lines, offset: offset, limit: limit, max_bytes: max_bytes,
                  max_line_length: max_line_length, kind: extractor.kind,
                  known_total: lines.length)
  end
end

.registryArray<#matches?>

The extractor registry, consulted in order — first match wins. Core ships two entries: HTML matches on content-type, and Passthrough is the terminal plain-text arm. A gem adding a format picks its insertion point by the strength of its claim: a magic-byte sniff that never misfires on text goes at the front so it beats HTML‘s content-type match even under a lying header (+registry.unshift(X)+ — pikuri-pdf does this); a content-type / weaker-sniff claimer inserts before the terminal entry (+registry.insert(-2, X)+ — pikuri-extractors does this).

Returns:

  • (Array<#matches?>)

    mutable, deliberately — this is the plug-in seam.



161
162
163
# File 'lib/pikuri/extractor.rb', line 161

def registry
  @registry ||= [HTML, Passthrough]
end