Module: Pikuri::FileType

Defined in:: lib/pikuri/file_type.rb

Overview

Magic-byte content sniffing + text extraction, centralised. Three responsibilities:

FileType.detect_mime — recognise a file from its leading bytes. Returns a MIME String for formats pikuri knows how to handle specially (<code>application/pdf</code>, the four image formats), or nil for “unrecognised — could be text, could be opaque binary; caller decides”.
FileType.binary? — heuristic text-vs-binary classifier. Independent of FileType.detect_mime: a file can be both recognised (e.g. PDF) and binary. FileType.detect_mime tells you what the bytes are; FileType.binary? tells you whether they’re safe to render as text.
FileType.read_as_text — read a file and return its content as plain UTF-8 text. PDFs go through pdf-reader page-by-page; plain text passes through; images / binaries / missing files raise. The pure-extraction shape consumers like Pikuri::VectorDb‘s indexer want (no LLM-tool concerns — no paging, no line numbering, no byte caps; just bytes-in-text-out).

FileType.detect_mime and FileType.binary? accept either a String of bytes (sample taken by the caller) or a Pathname — when given a path, the module opens the file in binary mode and reads SAMPLE_BYTES for the sniff itself. The Pathname form is the convenience path; the bytes form is for callers that already have the sample or are calling both methods on the same file and want to avoid a second open. FileType.read_as_text takes a Pathname only — there’s no bytes-in shortcut because the PDF case needs to seek the file.

Why a separate module

Without this module, magic-byte tables and the binary heuristic ended up scattered through whichever tool needed them — first PDF in Workspace::Read, then images alongside it, then a copy of FileType.binary? reached for by Workspace::Edit. Collecting the detection logic here lets Read focus on routing (mime-to-formatter), Edit drop its cross-tool reach, and new tools (a future Workspace::Diff, an attachment-aware web fetcher, …) share one set of magic-byte truths.

Deliberate non-goals

*Not a full MIME database.* The set grows when a pikuri tool needs a new format, not speculatively. Keeps the “audit in an evening” ceiling honest.
*No path / extension fallback.* Extensions lie (a renamed .png → opaque garbage); magic-byte detection on the actual content is the source of truth. Callers that need extension-based behaviour can layer it themselves.
*No convenience predicates* like image? / pdf?. Callers do mime == ‘application/pdf’ or mime&.start_with?(‘image/’) —one extra character, zero added API surface.

Constant Summary collapse

SAMPLE_BYTES = Returns recommended number of bytes to sample for detect_mime and binary?. Big enough to catch every prefix pikuri sniffs today (the largest is WebP’s 12-byte container header) with comfortable slack; small enough that reading it off any reasonable filesystem is effectively free. Returns: (Integer) — recommended number of bytes to sample for detect_mime and binary?. Big enough to catch every prefix pikuri sniffs today (the largest is WebP’s 12-byte container header) with comfortable slack; small enough that reading it off any reasonable filesystem is effectively free.

BINARY_NONPRINTABLE_THRESHOLD = Returns fraction of the sample that may be non-printable before binary? flags the bytes as binary. Matches opencode’s threshold. Returns: (Float) — fraction of the sample that may be non-printable before binary? flags the bytes as binary. Matches opencode’s threshold.

0.30

IMAGE_MAGIC_BYTES = Returns magic-byte prefixes → MIME types for the image formats with flat (offset-zero, fixed-length) signatures. WebP isn’t here — its signature is split across the RIFF container header — and is handled directly in detect_mime. Returns: (Hash{String => String}) — magic-byte prefixes → MIME types for the image formats with flat (offset-zero, fixed-length) signatures. WebP isn’t here — its signature is split across the RIFF container header — and is handled directly in detect_mime.

{
  "\x89PNG\r\n\x1a\n".b => 'image/png',
  "\xff\xd8\xff".b      => 'image/jpeg',
  "GIF87a".b            => 'image/gif',
  "GIF89a".b            => 'image/gif'
}.freeze

PDF_MAGIC = Returns PDF magic prefix. Every conformant PDF starts with this five-byte ASCII sequence per ISO 32000-1 §7.5.2. Returns: (String) — PDF magic prefix. Every conformant PDF starts with this five-byte ASCII sequence per ISO 32000-1 §7.5.2.

'%PDF-'

Class Method Summary collapse

.binary?(input) ⇒ Boolean

Heuristic text-vs-binary classifier matching opencode’s: any NUL byte forces true; otherwise count bytes outside the printable t n v f r + ASCII-32..126 range and ratio against the sample size.
.detect_mime(input) ⇒ String^?

Recognise a file from its leading bytes.
.read_as_text(path) ⇒ String

Read path and return its content as plain UTF-8 text.

Class Method Details

.binary?(input) ⇒ `Boolean`

Heuristic text-vs-binary classifier matching opencode’s: any NUL byte forces true; otherwise count bytes outside the printable t n v f r + ASCII-32..126 range and ratio against the sample size. UTF-8 continuation bytes (0x80-0xBF) are >127 so they sit outside the non-printable ranges and pass through unflagged, letting UTF-8 text read fine. An empty sample is treated as not-binary (callers reading an empty file take the empty-text path).

Parameters:

input (String, Pathname) —

the bytes to inspect, or a Pathname that this method opens in binary mode and reads up to SAMPLE_BYTES from. Caller is responsible for verifying the path exists.

Returns:

(Boolean)

# File 'lib/pikuri/file_type.rb', line 126

def binary?(input)
  bytes = sample_of(input)
  return false if bytes.empty?

  non_printable = 0
  bytes.each_byte do |b|
    return true if b.zero?

    non_printable += 1 if b < 9 || (b > 13 && b < 32)
  end
  non_printable.to_f / bytes.bytesize > BINARY_NONPRINTABLE_THRESHOLD
end

.detect_mime(input) ⇒ `String`^?

Recognise a file from its leading bytes. Returns the MIME type as a String for formats pikuri handles specially, or nil for “unrecognised” — callers interpret nil themselves (text, opaque binary, …).

Parameters:

input (String, Pathname) —

the bytes to inspect, or a Pathname that this method opens in binary mode and reads up to SAMPLE_BYTES from. Caller is responsible for verifying the path exists; missing-file errors propagate as Errno::ENOENT.

Returns:

(String, nil)

# File 'lib/pikuri/file_type.rb', line 98

def detect_mime(input)
  bytes = sample_of(input)
  return 'application/pdf' if bytes.start_with?(PDF_MAGIC)

  IMAGE_MAGIC_BYTES.each do |prefix, mime|
    return mime if bytes.start_with?(prefix)
  end
  return 'image/webp' if bytes.bytesize >= 12 &&
                         bytes.byteslice(0, 4) == 'RIFF'.b &&
                         bytes.byteslice(8, 4) == 'WEBP'.b

  nil
end

.read_as_text(path) ⇒ `String`

Read path and return its content as plain UTF-8 text. Two extraction paths, picked by detect_mime:

PDF — walked page-by-page via pdf-reader; each page’s extracted text is stripped and pages are joined with a blank line. A scanned-image PDF (no extractable text) comes back as the empty String — a deliberate silent skip, callers detect by length if they care.
**Plain text** — anything that detect_mime doesn’t recognise and that binary? accepts. Read with UTF-8 encoding; behaviour on non-UTF-8 bytes is whatever File.read does with encoding: Encoding::UTF_8 (which is “leave invalid bytes in, let downstream decide”).

Refusal cases — all raise rather than returning a sentinel because the callers are internal pikuri code, not an LLM tool. The LLM-facing Workspace::Read does its own routing and returns “Error: …” observations; that’s a separate concern.

Path doesn’t exist → Errno::ENOENT.
Path is a directory → ArgumentError.
Image (PNG / JPEG / GIF / WebP per detect_mime) →ArgumentError; images aren’t text.
Binary content (per binary?) and not a recognised MIME →ArgumentError.
Malformed PDF — pdf-reader‘s MalformedPDFError / UnsupportedFeatureError / InvalidPageError are re-raised as a RuntimeError with the path included so callers don’t need to know pdf-reader’s exception hierarchy.

Parameters:

path (Pathname) —

file to read.

Returns:

(String) —

UTF-8 text. May be empty (empty text file, or scanned-image PDF).

Raises:

(ArgumentError) —

if path isn’t a Pathname, points at a directory, is an image, or is binary.
(Errno::ENOENT) —

if path doesn’t exist.
(RuntimeError) —

on a malformed / unsupported PDF.

# File 'lib/pikuri/file_type.rb', line 177

def read_as_text(path)
  raise ArgumentError, "expected Pathname, got #{path.class}" unless path.is_a?(Pathname)
  raise Errno::ENOENT, path.to_s unless path.exist?
  raise ArgumentError, "#{path} is a directory" if path.directory?

  mime = detect_mime(path)
  return read_pdf_text(path) if mime == 'application/pdf'
  raise ArgumentError, "#{path} is an image (#{mime}); cannot extract as text" if mime&.start_with?('image/')
  raise ArgumentError, "#{path} appears to be binary; cannot extract as text" if binary?(path)

  path.read(encoding: Encoding::UTF_8)
end

Module: Pikuri::FileType

Overview

Why a separate module

Deliberate non-goals

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.binary?(input) ⇒ Boolean

.detect_mime(input) ⇒ String?

.read_as_text(path) ⇒ String

.binary?(input) ⇒ `Boolean`

.detect_mime(input) ⇒ `String`^?

.read_as_text(path) ⇒ `String`