Module: Pikuri::FileType

Defined in:: lib/pikuri/file_type.rb

Overview

Magic-byte content sniffing + text extraction, centralised. Three responsibilities:

FileType.detect_mime — recognise a file from its leading bytes. Returns a MIME String for formats pikuri knows how to handle specially (<code>application/pdf</code>, the four image formats), or nil for “unrecognised — could be text, could be opaque binary; caller decides”.
FileType.binary? — heuristic text-vs-binary classifier. Independent of FileType.detect_mime: a file can be both recognised (e.g. PDF) and binary. FileType.detect_mime tells you what the bytes are; FileType.binary? tells you whether they’re safe to render as text.
FileType.read_as_text — read a file and return its content as plain UTF-8 text. PDFs go through pdf-reader page-by-page; plain text passes through; images / binaries / missing files raise. The pure-extraction shape consumers like Pikuri::VectorDb‘s indexer want (no LLM-tool concerns — no paging, no line numbering, no byte caps; just bytes-in-text-out).
FileType.read_as_text_paged — the LLM-tool shape: the same extraction as FileType.read_as_text, but lazily windowed to a line range with a byte cap, returning a Page value the caller renders. Shared by Workspace::Read and VectorDb::Tools::Read so the offset/limit/byte-cap windowing lives in one tested place; each tool keeps its own presentation (cat-n numbering, trailer wording, citation vs. path). Same refusal contract as FileType.read_as_text (raises on image / binary / missing / malformed-PDF).

FileType.detect_mime and FileType.binary? accept either a String of bytes (sample taken by the caller) or a Pathname — when given a path, the module opens the file in binary mode and reads SAMPLE_BYTES for the sniff itself. The Pathname form is the convenience path; the bytes form is for callers that already have the sample or are calling both methods on the same file and want to avoid a second open. FileType.read_as_text takes a Pathname only — there’s no bytes-in shortcut because the PDF case needs to seek the file.

Why a separate module

Without this module, magic-byte tables and the binary heuristic ended up scattered through whichever tool needed them — first PDF in Workspace::Read, then images alongside it, then a copy of FileType.binary? reached for by Workspace::Edit. Collecting the detection logic here lets Read focus on routing (mime-to-formatter), Edit drop its cross-tool reach, and new tools (a future Workspace::Diff, an attachment-aware web fetcher, …) share one set of magic-byte truths.

Deliberate non-goals

*Not a full MIME database.* The set grows when a pikuri tool needs a new format, not speculatively. Keeps the “audit in an evening” ceiling honest.
*No path / extension fallback.* Extensions lie (a renamed .png → opaque garbage); magic-byte detection on the actual content is the source of truth. Callers that need extension-based behaviour can layer it themselves.
*No convenience predicates* like image? / pdf?. Callers do mime == ‘application/pdf’ or mime&.start_with?(‘image/’) —one extra character, zero added API surface.

Defined Under Namespace

Classes: Page

Constant Summary collapse

SAMPLE_BYTES = Returns recommended number of bytes to sample for detect_mime and binary?. Big enough to catch every prefix pikuri sniffs today (the largest is WebP’s 12-byte container header) with comfortable slack; small enough that reading it off any reasonable filesystem is effectively free. Returns: (Integer) — recommended number of bytes to sample for detect_mime and binary?. Big enough to catch every prefix pikuri sniffs today (the largest is WebP’s 12-byte container header) with comfortable slack; small enough that reading it off any reasonable filesystem is effectively free.

BINARY_NONPRINTABLE_THRESHOLD = Returns fraction of the sample that may be non-printable before binary? flags the bytes as binary. Matches opencode’s threshold. Returns: (Float) — fraction of the sample that may be non-printable before binary? flags the bytes as binary. Matches opencode’s threshold.

0.30

IMAGE_MAGIC_BYTES = Returns magic-byte prefixes → MIME types for the image formats with flat (offset-zero, fixed-length) signatures. WebP isn’t here — its signature is split across the RIFF container header — and is handled directly in detect_mime. Returns: (Hash{String => String}) — magic-byte prefixes → MIME types for the image formats with flat (offset-zero, fixed-length) signatures. WebP isn’t here — its signature is split across the RIFF container header — and is handled directly in detect_mime.

{
  "\x89PNG\r\n\x1a\n".b => 'image/png',
  "\xff\xd8\xff".b      => 'image/jpeg',
  "GIF87a".b            => 'image/gif',
  "GIF89a".b            => 'image/gif'
}.freeze

PDF_MAGIC = Returns PDF magic prefix. Every conformant PDF starts with this five-byte ASCII sequence per ISO 32000-1 §7.5.2. Returns: (String) — PDF magic prefix. Every conformant PDF starts with this five-byte ASCII sequence per ISO 32000-1 §7.5.2.

'%PDF-'

PAGE_DEFAULT_LIMIT = Returns default line-window size for read_as_text_paged when the caller omits limit. Returns: (Integer) — default line-window size for read_as_text_paged when the caller omits limit.

PAGE_MAX_BYTES = Returns default hard byte cap on the content collected by a single read_as_text_paged call. Bypassable by paging via offset. The rendered output is slightly larger (line numbering, trailer) — that’s the caller’s concern. Returns: (Integer) — default hard byte cap on the content collected by a single read_as_text_paged call. Bypassable by paging via offset. The rendered output is slightly larger (line numbering, trailer) — that’s the caller’s concern.

50 * 1024

PAGE_MAX_LINE_LENGTH = Returns default per-line character cap; read_as_text_paged truncates longer lines and appends PAGE_LINE_TRUNCATION_MARKER. Returns: (Integer) — default per-line character cap; read_as_text_paged truncates longer lines and appends PAGE_LINE_TRUNCATION_MARKER.

PAGE_LINE_TRUNCATION_MARKER = Returns suffix appended to a line truncated at PAGE_MAX_LINE_LENGTH. Returns: (String) — suffix appended to a line truncated at PAGE_MAX_LINE_LENGTH.

"... (line truncated to #{PAGE_MAX_LINE_LENGTH} chars)"

Class Method Summary collapse

.binary?(input) ⇒ Boolean

Heuristic text-vs-binary classifier matching opencode’s: any NUL byte forces true; otherwise count bytes outside the printable t n v f r + ASCII-32..126 range and ratio against the sample size.
.detect_mime(input) ⇒ String^?

Recognise a file from its leading bytes.
.read_as_text(path) ⇒ String

Read path and return its content as plain UTF-8 text.
.read_as_text_paged(path, offset: 1, limit: PAGE_DEFAULT_LIMIT, max_bytes: PAGE_MAX_BYTES, max_line_length: PAGE_MAX_LINE_LENGTH) ⇒ Page

Extract path as text and return a windowed Page: the lines from offset (1-indexed) up to limit of them, stopping early if max_bytes is reached, with over-long lines truncated at max_line_length.

Class Method Details

.binary?(input) ⇒ `Boolean`

Heuristic text-vs-binary classifier matching opencode’s: any NUL byte forces true; otherwise count bytes outside the printable t n v f r + ASCII-32..126 range and ratio against the sample size. UTF-8 continuation bytes (0x80-0xBF) are >127 so they sit outside the non-printable ranges and pass through unflagged, letting UTF-8 text read fine. An empty sample is treated as not-binary (callers reading an empty file take the empty-text path).

Parameters:

input (String, Pathname) —

the bytes to inspect, or a Pathname that this method opens in binary mode and reads up to SAMPLE_BYTES from. Caller is responsible for verifying the path exists.

Returns:

(Boolean)

# File 'lib/pikuri/file_type.rb', line 187

def binary?(input)
  bytes = sample_of(input)
  return false if bytes.empty?

  non_printable = 0
  bytes.each_byte do |b|
    return true if b.zero?

    non_printable += 1 if b < 9 || (b > 13 && b < 32)
  end
  non_printable.to_f / bytes.bytesize > BINARY_NONPRINTABLE_THRESHOLD
end

.detect_mime(input) ⇒ `String`^?

Recognise a file from its leading bytes. Returns the MIME type as a String for formats pikuri handles specially, or nil for “unrecognised” — callers interpret nil themselves (text, opaque binary, …).

Parameters:

input (String, Pathname) —

the bytes to inspect, or a Pathname that this method opens in binary mode and reads up to SAMPLE_BYTES from. Caller is responsible for verifying the path exists; missing-file errors propagate as Errno::ENOENT.

Returns:

(String, nil)

# File 'lib/pikuri/file_type.rb', line 159

def detect_mime(input)
  bytes = sample_of(input)
  return 'application/pdf' if bytes.start_with?(PDF_MAGIC)

  IMAGE_MAGIC_BYTES.each do |prefix, mime|
    return mime if bytes.start_with?(prefix)
  end
  return 'image/webp' if bytes.bytesize >= 12 &&
                         bytes.byteslice(0, 4) == 'RIFF'.b &&
                         bytes.byteslice(8, 4) == 'WEBP'.b

  nil
end

.read_as_text(path) ⇒ `String`

Read path and return its content as plain UTF-8 text. Two extraction paths, picked by detect_mime:

PDF — walked page-by-page via pdf-reader; each page’s extracted text is stripped and pages are joined with a blank line. A scanned-image PDF (no extractable text) comes back as the empty String — a deliberate silent skip, callers detect by length if they care.
**Plain text** — anything that detect_mime doesn’t recognise and that binary? accepts. Read with UTF-8 encoding; behaviour on non-UTF-8 bytes is whatever File.read does with encoding: Encoding::UTF_8 (which is “leave invalid bytes in, let downstream decide”).

Refusal cases — all raise rather than returning a sentinel because the callers are internal pikuri code, not an LLM tool. The LLM-facing Workspace::Read does its own routing and returns “Error: …” observations; that’s a separate concern.

Path doesn’t exist → Errno::ENOENT.
Path is a directory → ArgumentError.
Image (PNG / JPEG / GIF / WebP per detect_mime) →ArgumentError; images aren’t text.
Binary content (per binary?) and not a recognised MIME →ArgumentError.
Malformed PDF — pdf-reader‘s MalformedPDFError / UnsupportedFeatureError / InvalidPageError are re-raised as a RuntimeError with the path included so callers don’t need to know pdf-reader’s exception hierarchy.

Parameters:

path (Pathname) —

file to read.

Returns:

(String) —

UTF-8 text. May be empty (empty text file, or scanned-image PDF).

Raises:

(ArgumentError) —

if path isn’t a Pathname, points at a directory, is an image, or is binary.
(Errno::ENOENT) —

if path doesn’t exist.
(RuntimeError) —

on a malformed / unsupported PDF.

# File 'lib/pikuri/file_type.rb', line 238

def read_as_text(path)
  raise ArgumentError, "expected Pathname, got #{path.class}" unless path.is_a?(Pathname)
  raise Errno::ENOENT, path.to_s unless path.exist?
  raise ArgumentError, "#{path} is a directory" if path.directory?

  mime = detect_mime(path)
  return read_pdf_text(path) if mime == 'application/pdf'
  raise ArgumentError, "#{path} is an image (#{mime}); cannot extract as text" if mime&.start_with?('image/')
  raise ArgumentError, "#{path} appears to be binary; cannot extract as text" if binary?(path)

  path.read(encoding: Encoding::UTF_8)
end

.read_as_text_paged(path, offset: 1, limit: PAGE_DEFAULT_LIMIT, max_bytes: PAGE_MAX_BYTES, max_line_length: PAGE_MAX_LINE_LENGTH) ⇒ `Page`

Extract path as text and return a windowed Page: the lines from offset (1-indexed) up to limit of them, stopping early if max_bytes is reached, with over-long lines truncated at max_line_length. Lazy by design — a text file is streamed line-by-line and a PDF is parsed page-by-page only until the window fills, so reading the first page of a 500-page PDF parses a handful of pages, not all of them.

Same routing and refusal contract as read_as_text: PDFs are extracted (with “— Page N —” marker lines, unlike read_as_text‘s marker-free join — paging is a display path, the marker-free form stays the indexing path); images, binaries, directories, missing files, and malformed PDFs all raise rather than returning a sentinel. The LLM-facing callers map those into “Error: …” observations themselves.

Parameters:

path (Pathname) —

file to read.
offset (Integer) (defaults to: 1) —

1-indexed first line to include. The caller is responsible for validating offset >= 1.
limit (Integer) (defaults to: PAGE_DEFAULT_LIMIT) —

maximum lines to collect. Caller validates limit >= 1.
max_bytes (Integer) (defaults to: PAGE_MAX_BYTES) —

hard byte cap on collected content.
max_line_length (Integer) (defaults to: PAGE_MAX_LINE_LENGTH) —

per-line truncation threshold.

Returns:

(Page) —

the windowed slice.

Raises:

(ArgumentError) —

if path isn’t a Pathname, is a directory, an image, or binary.
(Errno::ENOENT) —

if path doesn’t exist.
(RuntimeError) —

on a malformed / unsupported PDF.

# File 'lib/pikuri/file_type.rb', line 300

def read_as_text_paged(path, offset: 1, limit: PAGE_DEFAULT_LIMIT,
                       max_bytes: PAGE_MAX_BYTES, max_line_length: PAGE_MAX_LINE_LENGTH)
  raise ArgumentError, "expected Pathname, got #{path.class}" unless path.is_a?(Pathname)
  raise Errno::ENOENT, path.to_s unless path.exist?
  raise ArgumentError, "#{path} is a directory" if path.directory?

  mime = detect_mime(path)
  if mime == 'application/pdf'
    return paged_pdf(path, offset: offset, limit: limit,
                           max_bytes: max_bytes, max_line_length: max_line_length)
  end
  raise ArgumentError, "#{path} is an image (#{mime}); cannot extract as text" if mime&.start_with?('image/')
  raise ArgumentError, "#{path} appears to be binary; cannot extract as text" if binary?(path)

  paged_text(path, offset: offset, limit: limit,
                   max_bytes: max_bytes, max_line_length: max_line_length)
end

Module: Pikuri::FileType

Overview

Why a separate module

Deliberate non-goals

Defined Under Namespace

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.binary?(input) ⇒ Boolean

.detect_mime(input) ⇒ String?

.read_as_text(path) ⇒ String

.read_as_text_paged(path, offset: 1, limit: PAGE_DEFAULT_LIMIT, max_bytes: PAGE_MAX_BYTES, max_line_length: PAGE_MAX_LINE_LENGTH) ⇒ Page

.binary?(input) ⇒ `Boolean`

.detect_mime(input) ⇒ `String`^?

.read_as_text(path) ⇒ `String`

.read_as_text_paged(path, offset: 1, limit: PAGE_DEFAULT_LIMIT, max_bytes: PAGE_MAX_BYTES, max_line_length: PAGE_MAX_LINE_LENGTH) ⇒ `Page`