Module: Pikuri::Extractor::Passthrough
- Defined in:
- lib/pikuri/extractor/passthrough.rb
Overview
The terminal plain-text arm of the registry: content that is already text needs no extraction, so it passes through verbatim (forced to UTF-8 — invalid bytes are left in for downstream to deal with, matching what File.read with a UTF-8 encoding does). Markdown, source files, JSON, robots.txt all land here.
Matching is split by whether the transport supplied a content-type:
-
With a content-type (the web path): claim
text/*only. A non-text type that no earlier extractor claimed is not second-guessed by sniffing — a server declaringapplication/octet-streamgets the Unsupported refusal the LLM can react to, same as before this registry existed. -
Without one (the local-file path, where FileType.detect_mime returned
nilfor “unrecognised”): claim anything that passes the FileType.binary? heuristic on the sample. Opaque binaries stay unclaimed and surface as Unsupported.
Class Method Summary collapse
-
.extract(io) ⇒ String
The content, tagged UTF-8.
-
.extract_lines(io) ⇒ Enumerator::Lazy<String>
The lazy line stream for extract_paged: the IO is read line-by-line, so a window over the head of a gigabyte log never loads the rest.
- .kind ⇒ Symbol
- .matches?(sample:, content_type:) ⇒ Boolean
Class Method Details
.extract(io) ⇒ String
Returns the content, tagged UTF-8. Deliberately NOT derived from extract_lines — a passthrough must stay verbatim (trailing newline, CRLF line endings), which a join of chomped lines would silently normalize away.
44 45 46 |
# File 'lib/pikuri/extractor/passthrough.rb', line 44 def self.extract(io) io.read.force_encoding(Encoding::UTF_8) end |
.extract_lines(io) ⇒ Enumerator::Lazy<String>
The lazy line stream for Pikuri::Extractor.extract_paged: the IO is read line-by-line, so a window over the head of a gigabyte log never loads the rest. Consuming the whole stream is a cheap sequential read — which is why the paging window counts this stream’s tail for an exact total_lines (see Pikuri::Extractor.extract_paged).
59 60 61 |
# File 'lib/pikuri/extractor/passthrough.rb', line 59 def self.extract_lines(io) io.each_line.lazy.map { |raw| raw.chomp.force_encoding(Encoding::UTF_8) } end |
.kind ⇒ Symbol
Returns Pikuri::Extractor::Page#kind tag.
25 26 27 |
# File 'lib/pikuri/extractor/passthrough.rb', line 25 def self.kind :text end |