Module: Pikuri::FileType

Defined in:: lib/pikuri/file_type.rb

Overview

Magic-byte content sniffing, plus the path-aware front over the Extractor registry. Two responsibilities:

FileType.detect_mime — recognise a file from its leading bytes. Returns a MIME String for formats pikuri knows how to handle specially (<code>application/pdf</code>, the four image formats), or nil for “unrecognised — could be text, could be opaque binary; caller decides”.
FileType.binary? — heuristic text-vs-binary classifier. Independent of FileType.detect_mime: a file can be both recognised (e.g. PDF) and binary. FileType.detect_mime tells you what the bytes are; FileType.binary? tells you whether they’re safe to render as text.

On top of those sit the two Pathname conveniences, FileType.read_as_text (whole document, the VectorDb indexer’s shape) and FileType.read_as_text_paged (line-windowed, the Read tools’ shape). Both are thin wrappers: they own the path-level refusals (missing file, directory, image) and the exception mapping, then hand the opened IO to Extractor.extract / Extractor.extract_paged — which format the bytes are and how they become text is entirely the registry’s business, so a gem plugging a new extractor in extends these wrappers for free.

FileType.detect_mime and FileType.binary? accept either a String of bytes (sample taken by the caller) or a Pathname — when given a path, the module opens the file in binary mode and reads SAMPLE_BYTES for the sniff itself. The Pathname form is the convenience path; the bytes form is for callers that already have the sample or are calling both methods on the same file and want to avoid a second open.

Why a separate module

Without this module, magic-byte tables and the binary heuristic ended up scattered through whichever tool needed them — first PDF in Workspace::Read, then images alongside it, then a copy of FileType.binary? reached for by Workspace::Edit. Collecting the detection logic here lets Read focus on routing (mime-to-formatter), Edit drop its cross-tool reach, and new tools share one set of magic-byte truths.

Deliberate non-goals

*Not a full MIME database.* The set grows when a pikuri tool needs a new format, not speculatively. Keeps the “audit in an evening” ceiling honest.
*No path / extension fallback.* Extensions lie (a renamed .png → opaque garbage); magic-byte detection on the actual content is the source of truth. Callers that need extension-based behaviour can layer it themselves.
*No convenience predicates* like image? / pdf?. Callers do mime == ‘application/pdf’ or mime&.start_with?(‘image/’) —one extra character, zero added API surface.

Constant Summary collapse

SAMPLE_BYTES = Returns recommended number of bytes to sample for detect_mime and binary?. Big enough to catch every prefix pikuri sniffs today (the largest is WebP’s 12-byte container header) with comfortable slack; small enough that reading it off any reasonable filesystem is effectively free. Returns: (Integer) — recommended number of bytes to sample for detect_mime and binary?. Big enough to catch every prefix pikuri sniffs today (the largest is WebP’s 12-byte container header) with comfortable slack; small enough that reading it off any reasonable filesystem is effectively free.

BINARY_NONPRINTABLE_THRESHOLD = Returns fraction of the sample that may be non-printable before binary? flags the bytes as binary. Matches opencode’s threshold. Returns: (Float) — fraction of the sample that may be non-printable before binary? flags the bytes as binary. Matches opencode’s threshold.

0.30

IMAGE_MAGIC_BYTES = Returns magic-byte prefixes → MIME types for the image formats with flat (offset-zero, fixed-length) signatures. WebP isn’t here — its signature is split across the RIFF container header — and is handled directly in detect_mime. Returns: (Hash{String => String}) — magic-byte prefixes → MIME types for the image formats with flat (offset-zero, fixed-length) signatures. WebP isn’t here — its signature is split across the RIFF container header — and is handled directly in detect_mime.

{
  "\x89PNG\r\n\x1a\n".b => 'image/png',
  "\xff\xd8\xff".b      => 'image/jpeg',
  "GIF87a".b            => 'image/gif',
  "GIF89a".b            => 'image/gif'
}.freeze

PDF_MAGIC = Returns PDF magic prefix. Every conformant PDF starts with this five-byte ASCII sequence per ISO 32000-1 §7.5.2. Returns: (String) — PDF magic prefix. Every conformant PDF starts with this five-byte ASCII sequence per ISO 32000-1 §7.5.2.

'%PDF-'

Class Method Summary collapse

.binary?(input) ⇒ Boolean

Heuristic text-vs-binary classifier matching opencode’s: any NUL byte forces true; otherwise count bytes outside the printable t n v f r + ASCII-32..126 range and ratio against the sample size.
.detect_mime(input) ⇒ String^?

Recognise a file from its leading bytes.
.read_as_text(path) ⇒ String

Read path and return its content as plain UTF-8 text, routed through the Extractor registry: anything unrecognised-but-textual passes through verbatim (Extractor::Passthrough); with pikuri-pdf registered, PDFs are extracted with “— Page N —” markers (a scanned-image PDF with no extractable text comes back as the empty String, a deliberate silent skip callers detect by length if they care).
.read_as_text_paged(path, offset: 1, limit: Extractor::PAGE_DEFAULT_LIMIT, max_bytes: Extractor::PAGE_MAX_BYTES, max_line_length: Extractor::PAGE_MAX_LINE_LENGTH) ⇒ Extractor::Page

Extract path and return a windowed Extractor::Page: the lines from offset (1-indexed) up to limit of them, stopping early if max_bytes is reached, with over-long lines truncated at max_line_length.

Class Method Details

.binary?(input) ⇒ `Boolean`

Heuristic text-vs-binary classifier matching opencode’s: any NUL byte forces true; otherwise count bytes outside the printable t n v f r + ASCII-32..126 range and ratio against the sample size. UTF-8 continuation bytes (0x80-0xBF) are >127 so they sit outside the non-printable ranges and pass through unflagged, letting UTF-8 text read fine. An empty sample is treated as not-binary (callers reading an empty file take the empty-text path).

Parameters:

input (String, Pathname) —

the bytes to inspect, or a Pathname that this method opens in binary mode and reads up to SAMPLE_BYTES from. Caller is responsible for verifying the path exists.

Returns:

(Boolean)

# File 'lib/pikuri/file_type.rb', line 126

def binary?(input)
  bytes = sample_of(input)
  return false if bytes.empty?

  non_printable = 0
  bytes.each_byte do |b|
    return true if b.zero?

    non_printable += 1 if b < 9 || (b > 13 && b < 32)
  end
  non_printable.to_f / bytes.bytesize > BINARY_NONPRINTABLE_THRESHOLD
end

.detect_mime(input) ⇒ `String`^?

Recognise a file from its leading bytes. Returns the MIME type as a String for formats pikuri handles specially, or nil for “unrecognised” — callers interpret nil themselves (text, opaque binary, …).

Parameters:

input (String, Pathname) —

the bytes to inspect, or a Pathname that this method opens in binary mode and reads up to SAMPLE_BYTES from. Caller is responsible for verifying the path exists; missing-file errors propagate as Errno::ENOENT.

Returns:

(String, nil)

# File 'lib/pikuri/file_type.rb', line 98

def detect_mime(input)
  bytes = sample_of(input)
  return 'application/pdf' if bytes.start_with?(PDF_MAGIC)

  IMAGE_MAGIC_BYTES.each do |prefix, mime|
    return mime if bytes.start_with?(prefix)
  end
  return 'image/webp' if bytes.bytesize >= 12 &&
                         bytes.byteslice(0, 4) == 'RIFF'.b &&
                         bytes.byteslice(8, 4) == 'WEBP'.b

  nil
end

.read_as_text(path) ⇒ `String`

Read path and return its content as plain UTF-8 text, routed through the Extractor registry: anything unrecognised-but-textual passes through verbatim (Extractor::Passthrough); with pikuri-pdf registered, PDFs are extracted with “— Page N —” markers (a scanned-image PDF with no extractable text comes back as the empty String, a deliberate silent skip callers detect by length if they care).

Refusal cases — all raise rather than returning a sentinel because the callers are internal pikuri code, not an LLM tool. The LLM-facing Workspace::Read does its own routing and returns “Error: …” observations; that’s a separate concern.

Path doesn’t exist → Errno::ENOENT.
Path is a directory → ArgumentError.
Image (PNG / JPEG / GIF / WebP per detect_mime) →ArgumentError; images aren’t text.
Content no extractor claims (opaque binary) →ArgumentError, mapped from Extractor::Unsupported.
Extraction failure (malformed PDF, …) → RuntimeError with the path included, mapped from Extractor::Error so callers don’t need to know any extractor’s exception hierarchy.

Parameters:

path (Pathname) —

file to read.

Returns:

(String) —

UTF-8 text. May be empty (empty text file, or scanned-image PDF).

Raises:

(ArgumentError) —

if path isn’t a Pathname, points at a directory, is an image, or is binary.
(Errno::ENOENT) —

if path doesn’t exist.
(RuntimeError) —

on an extraction failure (malformed / unsupported PDF, …).

# File 'lib/pikuri/file_type.rb', line 170

def read_as_text(path)
  mime = guard_extractable(path)
  path.open('rb') { |io| Extractor.extract(io, content_type: mime) }
rescue Extractor::Unsupported
  raise ArgumentError, "#{path} appears to be binary; cannot extract as text"
rescue Extractor::Error => e
  raise "Cannot extract text from #{path}: #{e.message}"
end

.read_as_text_paged(path, offset: 1, limit: Extractor::PAGE_DEFAULT_LIMIT, max_bytes: Extractor::PAGE_MAX_BYTES, max_line_length: Extractor::PAGE_MAX_LINE_LENGTH) ⇒ `Extractor::Page`

Extract path and return a windowed Extractor::Page: the lines from offset (1-indexed) up to limit of them, stopping early if max_bytes is reached, with over-long lines truncated at max_line_length. Same routing and refusal contract as read_as_text; the windowing semantics (including the lazy extract_lines consumption that stops parsing once the window fills) are Extractor.extract_paged‘s. The LLM-facing callers map the exceptions into “Error: …” observations themselves.

Parameters:

path (Pathname) —

file to read.
offset (Integer) (defaults to: 1) —

1-indexed first line to include. The caller is responsible for validating offset >= 1.
limit (Integer) (defaults to: Extractor::PAGE_DEFAULT_LIMIT) —

maximum lines to collect. Caller validates limit >= 1.
max_bytes (Integer) (defaults to: Extractor::PAGE_MAX_BYTES) —

hard byte cap on collected content.
max_line_length (Integer) (defaults to: Extractor::PAGE_MAX_LINE_LENGTH) —

per-line truncation threshold.

Returns:

(Extractor::Page) —

the windowed slice.

Raises:

(ArgumentError) —

if path isn’t a Pathname, is a directory, an image, or binary.
(Errno::ENOENT) —

if path doesn’t exist.
(RuntimeError) —

on an extraction failure (malformed / unsupported PDF, …).

# File 'lib/pikuri/file_type.rb', line 202

def read_as_text_paged(path, offset: 1, limit: Extractor::PAGE_DEFAULT_LIMIT,
                       max_bytes: Extractor::PAGE_MAX_BYTES,
                       max_line_length: Extractor::PAGE_MAX_LINE_LENGTH)
  mime = guard_extractable(path)
  path.open('rb') do |io|
    Extractor.extract_paged(io, content_type: mime, offset: offset, limit: limit,
                                max_bytes: max_bytes, max_line_length: max_line_length)
  end
rescue Extractor::Unsupported
  raise ArgumentError, "#{path} appears to be binary; cannot extract as text"
rescue Extractor::Error => e
  raise "Cannot extract text from #{path}: #{e.message}"
end

Module: Pikuri::FileType

Overview

Why a separate module

Deliberate non-goals

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.binary?(input) ⇒ Boolean

.detect_mime(input) ⇒ String?

.read_as_text(path) ⇒ String

.read_as_text_paged(path, offset: 1, limit: Extractor::PAGE_DEFAULT_LIMIT, max_bytes: Extractor::PAGE_MAX_BYTES, max_line_length: Extractor::PAGE_MAX_LINE_LENGTH) ⇒ Extractor::Page

.binary?(input) ⇒ `Boolean`

.detect_mime(input) ⇒ `String`^?

.read_as_text(path) ⇒ `String`

.read_as_text_paged(path, offset: 1, limit: Extractor::PAGE_DEFAULT_LIMIT, max_bytes: Extractor::PAGE_MAX_BYTES, max_line_length: Extractor::PAGE_MAX_LINE_LENGTH) ⇒ `Extractor::Page`