Module: Pikuri::Extractor::Passthrough

Defined in:: lib/pikuri/extractor/passthrough.rb

Overview

The terminal plain-text arm of the registry: content that is already text needs no extraction, so it passes through verbatim (forced to UTF-8 — invalid bytes are left in for downstream to deal with, matching what File.read with a UTF-8 encoding does). Markdown, source files, JSON, robots.txt all land here.

Matching is split by whether the transport supplied a content-type:

With a content-type (the web path): claim text/* only. A non-text type that no earlier extractor claimed is not second-guessed by sniffing — a server declaring application/octet-stream gets the Unsupported refusal the LLM can react to, same as before this registry existed.
Without one (the local-file path, where FileType.detect_mime returned nil for “unrecognised”): claim anything that passes the FileType.binary? heuristic on the sample. Opaque binaries stay unclaimed and surface as Unsupported.

Class Method Summary collapse

.extract(io) ⇒ String

The content, tagged UTF-8.
.extract_lines(io) ⇒ Enumerator::Lazy<String>

The lazy line stream for extract_paged: the IO is read line-by-line, so a window over the head of a gigabyte log never loads the rest.
.kind ⇒ Symbol

Pikuri::Extractor::Page#kind tag.
.matches?(sample:, content_type:) ⇒ Boolean

Class Method Details

.extract(io) ⇒ `String`

Returns the content, tagged UTF-8. Deliberately NOT derived from extract_lines — a passthrough must stay verbatim (trailing newline, CRLF line endings), which a join of chomped lines would silently normalize away.

Parameters:

io (IO, StringIO) —

IO over the text content.

Returns:

(String) —

the content, tagged UTF-8. Deliberately NOT derived from extract_lines — a passthrough must stay verbatim (trailing newline, CRLF line endings), which a join of chomped lines would silently normalize away.



44
45
46

# File 'lib/pikuri/extractor/passthrough.rb', line 44

def self.extract(io)
  io.read.force_encoding(Encoding::UTF_8)
end

.extract_lines(io) ⇒ `Enumerator::Lazy<String>`

The lazy line stream for Pikuri::Extractor.extract_paged: the IO is read line-by-line, so a window over the head of a gigabyte log never loads the rest. Consuming the whole stream is a cheap sequential read — which is why the paging window counts this stream’s tail for an exact total_lines (see Pikuri::Extractor.extract_paged).

Parameters:

io (IO, StringIO) —

IO over the text content; must remain open while the enumerator is consumed.

Returns:

(Enumerator::Lazy<String>) —

chomped lines, tagged UTF-8.



59
60
61

# File 'lib/pikuri/extractor/passthrough.rb', line 59

def self.extract_lines(io)
  io.each_line.lazy.map { |raw| raw.chomp.force_encoding(Encoding::UTF_8) }
end

.kind ⇒ `Symbol`

Returns Pikuri::Extractor::Page#kind tag.

Returns:

(Symbol) —

Pikuri::Extractor::Page#kind tag.



25
26
27

# File 'lib/pikuri/extractor/passthrough.rb', line 25

def self.kind
  :text
end

.matches?(sample:, content_type:) ⇒ `Boolean`

Parameters:

sample (String) —

leading bytes of the content.
content_type (String, nil) —

normalized content-type, nil when the transport has none.

Returns:

(Boolean)

# File 'lib/pikuri/extractor/passthrough.rb', line 33

def self.matches?(sample:, content_type:)
  return content_type.start_with?('text/') unless content_type.nil?

  !FileType.binary?(sample)
end

Module: Pikuri::Extractor::Passthrough

Overview

Class Method Summary collapse

Class Method Details

.extract(io) ⇒ String

.extract_lines(io) ⇒ Enumerator::Lazy<String>

.kind ⇒ Symbol

.matches?(sample:, content_type:) ⇒ Boolean

.extract(io) ⇒ `String`

.extract_lines(io) ⇒ `Enumerator::Lazy<String>`

.kind ⇒ `Symbol`

.matches?(sample:, content_type:) ⇒ `Boolean`