Module: Pikuri::Extractor

Defined in:: lib/pikuri/extractor.rb,
lib/pikuri/extractor/html.rb,
lib/pikuri/extractor/passthrough.rb

Overview

The format→text extraction seam: one registry of extractors that turn an IO of some recognised format (HTML and plain text out of the box; PDF / office formats via the pikuri-pdf / pikuri-extractors plug-in gems) into Markdown-flavoured UTF-8 text, consumed through two front doors:

Extractor.extract — the whole document as one String. The shape the indexing / caching callers want (VectorDb‘s indexer, Tool::WebScrape’s URL cache): no windowing, no presentation.
Extractor.extract_paged — the LLM-tool shape: the same extraction, windowed to a line range with a byte cap, returned as a Page the caller renders. Backs Workspace::Read and VectorDb::Tools::Read (via the FileType path wrappers) so the offset/limit/byte-cap logic lives in one tested place.

Both front doors — Tool::Scraper dispatching on the HTTP Content-Type header for the web tools, and FileType resolving local paths — route through this one registry, so both share one set of format truths and “support a new format” is a registry entry (pikuri-pdf and pikuri-extractors plug PDF and office formats in without pikuri-core knowing), not a new special case in two dispatchers.

The extractor duck type

Each Extractor.registry entry implements three methods:

matches?(sample:, content_type:) → Boolean — claim the content. sample is the leading FileType::SAMPLE_BYTES bytes (for magic-byte sniffs); content_type is the normalized HTTP Content-Type for web content, the FileType.detect_mime result for local files, and may be nil (“no transport metadata — sniff if you can”).
extract(io) → String — the whole document as Markdown-flavoured UTF-8 text. Raises Error on content the extractor claimed but cannot parse (malformed PDF, …).
kind → Symbol — a short tag (:text / :pdf / :html) carried on Page#kind so rendering callers can word format-specific trailers (“End of PDF”, the scanned-image hint) without re-sniffing.

plus one optional method for formats whose lines can be produced incrementally:

extract_lines(io) → Enumerator<String> — the same content as extract, as a lazy stream of already-+chomp+ed lines. Extractor.extract_paged prefers this when present and stops consuming the moment the window fills, so the rest of the document is never parsed (pikuri-pdf’s extractor: pdf-reader’s page list parses on access; Passthrough: the IO is read line-by-line). The enumerator must be consumed while io is still open, and may raise Error mid-iteration. Extractors that need the whole document to produce anything (HTML: Readability walks the full DOM —true of any subprocess-based extractor too) simply omit it; Extractor.extract_paged then extracts in full and windows the result.

Windowing itself (offset / limit / byte cap / line truncation) is presentation and deliberately lives once in Extractor.extract_paged, not per extractor — extract_lines is line production, the only genuinely format-specific half of paging.

Errors

Both failure modes are failures the *caller’s* LLM can react to, so they share one rescuable root:

Unsupported — nothing in Extractor.registry claimed the content (opaque binary, an unhandled content-type).
Error (the root) — an extractor claimed the content but the parse failed (malformed PDF, …).

Callers map them to their own conventions: Tool::Scraper re-raises both as FetchError; FileType.read_as_text maps Unsupported to the ArgumentError binary refusal and Error to a RuntimeError carrying the path.

Defined Under Namespace

Modules: HTML, Passthrough Classes: Page

Constant Summary collapse

Error = Raised when an extractor claims content but fails to parse it (e.g. a malformed PDF). Message is LLM-presentable.

Class.new(StandardError)

Unsupported = Raised by extract / extract_paged when no registry entry claims the content. Subclass of Error so callers that don’t care about the distinction rescue one class.

Class.new(Error)

PAGE_DEFAULT_LIMIT = Returns default line-window size for extract_paged when the caller omits limit. Returns: (Integer) — default line-window size for extract_paged when the caller omits limit.

PAGE_MAX_BYTES = Returns default hard byte cap on the content collected by a single extract_paged call. Bypassable by paging via offset. The rendered output is slightly larger (line numbering, trailer) — that’s the caller’s concern. Returns: (Integer) — default hard byte cap on the content collected by a single extract_paged call. Bypassable by paging via offset. The rendered output is slightly larger (line numbering, trailer) — that’s the caller’s concern.

50 * 1024

PAGE_MAX_LINE_LENGTH = Returns default per-line character cap; extract_paged truncates longer lines and appends PAGE_LINE_TRUNCATION_MARKER. Returns: (Integer) — default per-line character cap; extract_paged truncates longer lines and appends PAGE_LINE_TRUNCATION_MARKER.

PAGE_LINE_TRUNCATION_MARKER = Returns suffix appended to a line truncated at PAGE_MAX_LINE_LENGTH. Returns: (String) — suffix appended to a line truncated at PAGE_MAX_LINE_LENGTH.

"... (line truncated to #{PAGE_MAX_LINE_LENGTH} chars)"

Class Method Summary collapse

.extract(io, content_type: nil) ⇒ String

Extract the whole document behind io as one Markdown-flavoured UTF-8 String.
.extract_paged(io, content_type: nil, offset: 1, limit: PAGE_DEFAULT_LIMIT, max_bytes: PAGE_MAX_BYTES, max_line_length: PAGE_MAX_LINE_LENGTH) ⇒ Page

Extract io and return a windowed Page: the lines from offset (1-indexed) up to limit of them, stopping early if max_bytes is reached, with over-long lines truncated at max_line_length.
.registry ⇒ Array<#matches?>

The extractor registry, consulted in order — first match wins.

Class Method Details

.extract(io, content_type: nil) ⇒ `String`

Extract the whole document behind io as one Markdown-flavoured UTF-8 String. May be empty (empty text file, scanned-image PDF with no extractable text).

Parameters:

io (IO, StringIO) —

seekable IO positioned at the start of the content; this method reads a leading sample for the matches? sniff and rewinds before extracting.
content_type (String, nil) (defaults to: nil) —

normalized content-type when the transport supplies one (HTTP header, FileType.detect_mime result); nil when unknown — extractors then rely on their byte sniffs.

Returns:

(String)

Raises:

(Unsupported) —

when no registry entry claims the content.
(Error) —

when the matched extractor cannot parse it.



179
180
181

# File 'lib/pikuri/extractor.rb', line 179

def extract(io, content_type: nil)
  extractor_for(io, content_type).extract(io)
end

.extract_paged(io, content_type: nil, offset: 1, limit: PAGE_DEFAULT_LIMIT, max_bytes: PAGE_MAX_BYTES, max_line_length: PAGE_MAX_LINE_LENGTH) ⇒ `Page`

Extract io and return a windowed Page: the lines from offset (1-indexed) up to limit of them, stopping early if max_bytes is reached, with over-long lines truncated at max_line_length.

Lazy where the format allows: extractors that implement extract_lines (plain text; pikuri-pdf’s PDF) are consumed only until the window fills — reading the first window of a 500-page PDF parses a handful of pages, and the first page of a gigabyte log never loads it. Extractors without it (HTML) are extracted in full and then windowed, which is also what makes their total_lines always exact.

Parameters:

io (IO, StringIO) —

seekable IO positioned at the start.
content_type (String, nil) (defaults to: nil) —

as for extract.
offset (Integer) (defaults to: 1) —

1-indexed first line to include. The caller is responsible for validating offset >= 1.
limit (Integer) (defaults to: PAGE_DEFAULT_LIMIT) —

maximum lines to collect. Caller validates limit >= 1.
max_bytes (Integer) (defaults to: PAGE_MAX_BYTES) —

hard byte cap on collected content.
max_line_length (Integer) (defaults to: PAGE_MAX_LINE_LENGTH) —

per-line truncation threshold.

Returns:

(Page) —

the windowed slice.

Raises:

(Unsupported) —

when no registry entry claims the content.
(Error) —

when the matched extractor cannot parse it.

# File 'lib/pikuri/extractor.rb', line 207

def extract_paged(io, content_type: nil, offset: 1, limit: PAGE_DEFAULT_LIMIT,
                  max_bytes: PAGE_MAX_BYTES, max_line_length: PAGE_MAX_LINE_LENGTH)
  extractor = extractor_for(io, content_type)
  if extractor.respond_to?(:extract_lines)
    # count_tail is a per-format economics call: once the window
    # fills, counting the rest of a plain-text stream is a cheap
    # sequential read (so the trailer can say "of N"), while for
    # a PDF it would mean parsing every remaining page — exactly
    # what extract_lines exists to avoid. Plugged-in extractors
    # (pikuri-pdf's included) get the conservative default (stop
    # early, total unknown).
    window(extractor.extract_lines(io),
           offset: offset, limit: limit, max_bytes: max_bytes,
           max_line_length: max_line_length, kind: extractor.kind,
           known_total: nil, count_tail: extractor.equal?(Passthrough))
  else
    lines = extractor.extract(io).split("\n")
    window(lines, offset: offset, limit: limit, max_bytes: max_bytes,
                  max_line_length: max_line_length, kind: extractor.kind,
                  known_total: lines.length)
  end
end

.registry ⇒ `Array<#matches?>`

The extractor registry, consulted in order — first match wins. Core ships two entries: HTML matches on content-type, and Passthrough is the terminal plain-text arm. A gem adding a format picks its insertion point by the strength of its claim: a magic-byte sniff that never misfires on text goes at the front so it beats HTML‘s content-type match even under a lying header (+registry.unshift(X)+ — pikuri-pdf does this); a content-type / weaker-sniff claimer inserts before the terminal entry (+registry.insert(-2, X)+ — pikuri-extractors does this).

Returns:

(Array<#matches?>) —

mutable, deliberately — this is the plug-in seam.



161
162
163

# File 'lib/pikuri/extractor.rb', line 161

def registry
  @registry ||= [HTML, Passthrough]
end

Module: Pikuri::Extractor

Overview

The extractor duck type

Errors

Defined Under Namespace

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.extract(io, content_type: nil) ⇒ String

.extract_paged(io, content_type: nil, offset: 1, limit: PAGE_DEFAULT_LIMIT, max_bytes: PAGE_MAX_BYTES, max_line_length: PAGE_MAX_LINE_LENGTH) ⇒ Page

.registry ⇒ Array<#matches?>

.extract(io, content_type: nil) ⇒ `String`

.extract_paged(io, content_type: nil, offset: 1, limit: PAGE_DEFAULT_LIMIT, max_bytes: PAGE_MAX_BYTES, max_line_length: PAGE_MAX_LINE_LENGTH) ⇒ `Page`

.registry ⇒ `Array<#matches?>`