Module: Pikuri::Extractor
- Defined in:
- lib/pikuri/extractor.rb,
lib/pikuri/extractor/html.rb,
lib/pikuri/extractor/passthrough.rb
Overview
The format→text extraction seam: one registry of extractors that turn an IO of some recognised format (HTML and plain text out of the box; PDF / office formats via the pikuri-pdf / pikuri-extractors plug-in gems) into Markdown-flavoured UTF-8 text, consumed through two front doors:
-
Extractor.extract — the whole document as one String. The shape the indexing / caching callers want (VectorDb‘s indexer, Tool::WebScrape’s URL cache): no windowing, no presentation.
-
Extractor.extract_paged — the LLM-tool shape: the same extraction, windowed to a line range with a byte cap, returned as a Page the caller renders. Backs
Workspace::ReadandVectorDb::Tools::Read(via the FileType path wrappers) so the offset/limit/byte-cap logic lives in one tested place.
Both front doors — Tool::Scraper dispatching on the HTTP Content-Type header for the web tools, and FileType resolving local paths — route through this one registry, so both share one set of format truths and “support a new format” is a registry entry (pikuri-pdf and pikuri-extractors plug PDF and office formats in without pikuri-core knowing), not a new special case in two dispatchers.
The extractor duck type
Each Extractor.registry entry implements three methods:
-
matches?(sample:, content_type:) →
Boolean— claim the content.sampleis the leading FileType::SAMPLE_BYTES bytes (for magic-byte sniffs);content_typeis the normalized HTTPContent-Typefor web content, the FileType.detect_mime result for local files, and may benil(“no transport metadata — sniff if you can”). -
extract(io) →
String— the whole document as Markdown-flavoured UTF-8 text. Raises Error on content the extractor claimed but cannot parse (malformed PDF, …). -
kind→Symbol— a short tag (:text/:pdf/:html) carried on Page#kind so rendering callers can word format-specific trailers (“End of PDF”, the scanned-image hint) without re-sniffing.
plus one optional method for formats whose lines can be produced incrementally:
-
extract_lines(io) → Enumerator<String> — the same content as
extract, as a lazy stream of already-+chomp+ed lines. Extractor.extract_paged prefers this when present and stops consuming the moment the window fills, so the rest of the document is never parsed (pikuri-pdf’s extractor: pdf-reader’s page list parses on access; Passthrough: the IO is read line-by-line). The enumerator must be consumed whileiois still open, and may raise Error mid-iteration. Extractors that need the whole document to produce anything (HTML: Readability walks the full DOM —true of any subprocess-based extractor too) simply omit it; Extractor.extract_paged then extracts in full and windows the result.
Windowing itself (offset / limit / byte cap / line truncation) is presentation and deliberately lives once in Extractor.extract_paged, not per extractor — extract_lines is line production, the only genuinely format-specific half of paging.
Errors
Both failure modes are failures the *caller’s* LLM can react to, so they share one rescuable root:
-
Unsupported — nothing in Extractor.registry claimed the content (opaque binary, an unhandled content-type).
-
Error (the root) — an extractor claimed the content but the parse failed (malformed PDF, …).
Callers map them to their own conventions: Tool::Scraper re-raises both as FetchError; FileType.read_as_text maps Unsupported to the ArgumentError binary refusal and Error to a RuntimeError carrying the path.
Defined Under Namespace
Modules: HTML, Passthrough Classes: Page
Constant Summary collapse
- Error =
Raised when an extractor claims content but fails to parse it (e.g. a malformed PDF). Message is LLM-presentable.
Class.new(StandardError)
- Unsupported =
Raised by extract / extract_paged when no registry entry claims the content. Subclass of Error so callers that don’t care about the distinction rescue one class.
Class.new(Error)
- PAGE_DEFAULT_LIMIT =
Returns default line-window size for extract_paged when the caller omits
limit. 2000- PAGE_MAX_BYTES =
Returns default hard byte cap on the content collected by a single extract_paged call. Bypassable by paging via
offset. The rendered output is slightly larger (line numbering, trailer) — that’s the caller’s concern. 50 * 1024
- PAGE_MAX_LINE_LENGTH =
Returns default per-line character cap; extract_paged truncates longer lines and appends PAGE_LINE_TRUNCATION_MARKER.
2000- PAGE_LINE_TRUNCATION_MARKER =
Returns suffix appended to a line truncated at PAGE_MAX_LINE_LENGTH.
"... (line truncated to #{PAGE_MAX_LINE_LENGTH} chars)"
Class Method Summary collapse
-
.extract(io, content_type: nil) ⇒ String
Extract the whole document behind
ioas one Markdown-flavoured UTF-8 String. -
.extract_paged(io, content_type: nil, offset: 1, limit: PAGE_DEFAULT_LIMIT, max_bytes: PAGE_MAX_BYTES, max_line_length: PAGE_MAX_LINE_LENGTH) ⇒ Page
Extract
ioand return a windowed Page: the lines fromoffset(1-indexed) up tolimitof them, stopping early ifmax_bytesis reached, with over-long lines truncated atmax_line_length. -
.registry ⇒ Array<#matches?>
The extractor registry, consulted in order — first match wins.
Class Method Details
.extract(io, content_type: nil) ⇒ String
Extract the whole document behind io as one Markdown-flavoured UTF-8 String. May be empty (empty text file, scanned-image PDF with no extractable text).
179 180 181 |
# File 'lib/pikuri/extractor.rb', line 179 def extract(io, content_type: nil) extractor_for(io, content_type).extract(io) end |
.extract_paged(io, content_type: nil, offset: 1, limit: PAGE_DEFAULT_LIMIT, max_bytes: PAGE_MAX_BYTES, max_line_length: PAGE_MAX_LINE_LENGTH) ⇒ Page
Extract io and return a windowed Page: the lines from offset (1-indexed) up to limit of them, stopping early if max_bytes is reached, with over-long lines truncated at max_line_length.
Lazy where the format allows: extractors that implement extract_lines (plain text; pikuri-pdf’s PDF) are consumed only until the window fills — reading the first window of a 500-page PDF parses a handful of pages, and the first page of a gigabyte log never loads it. Extractors without it (HTML) are extracted in full and then windowed, which is also what makes their total_lines always exact.
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 |
# File 'lib/pikuri/extractor.rb', line 207 def extract_paged(io, content_type: nil, offset: 1, limit: PAGE_DEFAULT_LIMIT, max_bytes: PAGE_MAX_BYTES, max_line_length: PAGE_MAX_LINE_LENGTH) extractor = extractor_for(io, content_type) if extractor.respond_to?(:extract_lines) # count_tail is a per-format economics call: once the window # fills, counting the rest of a plain-text stream is a cheap # sequential read (so the trailer can say "of N"), while for # a PDF it would mean parsing every remaining page — exactly # what extract_lines exists to avoid. Plugged-in extractors # (pikuri-pdf's included) get the conservative default (stop # early, total unknown). window(extractor.extract_lines(io), offset: offset, limit: limit, max_bytes: max_bytes, max_line_length: max_line_length, kind: extractor.kind, known_total: nil, count_tail: extractor.equal?(Passthrough)) else lines = extractor.extract(io).split("\n") window(lines, offset: offset, limit: limit, max_bytes: max_bytes, max_line_length: max_line_length, kind: extractor.kind, known_total: lines.length) end end |
.registry ⇒ Array<#matches?>
The extractor registry, consulted in order — first match wins. Core ships two entries: HTML matches on content-type, and Passthrough is the terminal plain-text arm. A gem adding a format picks its insertion point by the strength of its claim: a magic-byte sniff that never misfires on text goes at the front so it beats HTML‘s content-type match even under a lying header (+registry.unshift(X)+ — pikuri-pdf does this); a content-type / weaker-sniff claimer inserts before the terminal entry (+registry.insert(-2, X)+ — pikuri-extractors does this).
161 162 163 |
# File 'lib/pikuri/extractor.rb', line 161 def registry @registry ||= [HTML, Passthrough] end |