Module: Pikuri::FileType

Defined in:
lib/pikuri/file_type.rb

Overview

Magic-byte content sniffing + text extraction, centralised. Three responsibilities:

  • FileType.detect_mime — recognise a file from its leading bytes. Returns a MIME String for formats pikuri knows how to handle specially (<code>application/pdf</code>, the four image formats), or nil for “unrecognised — could be text, could be opaque binary; caller decides”.

  • FileType.binary? — heuristic text-vs-binary classifier. Independent of FileType.detect_mime: a file can be both recognised (e.g. PDF) and binary. FileType.detect_mime tells you what the bytes are; FileType.binary? tells you whether they’re safe to render as text.

  • FileType.read_as_text — read a file and return its content as plain UTF-8 text. PDFs go through pdf-reader page-by-page; plain text passes through; images / binaries / missing files raise. The pure-extraction shape consumers like Pikuri::VectorDb‘s indexer want (no LLM-tool concerns — no paging, no line numbering, no byte caps; just bytes-in-text-out).

  • FileType.read_as_text_paged — the LLM-tool shape: the same extraction as FileType.read_as_text, but lazily windowed to a line range with a byte cap, returning a Page value the caller renders. Shared by Workspace::Read and VectorDb::Tools::Read so the offset/limit/byte-cap windowing lives in one tested place; each tool keeps its own presentation (cat-n numbering, trailer wording, citation vs. path). Same refusal contract as FileType.read_as_text (raises on image / binary / missing / malformed-PDF).

FileType.detect_mime and FileType.binary? accept either a String of bytes (sample taken by the caller) or a Pathname — when given a path, the module opens the file in binary mode and reads SAMPLE_BYTES for the sniff itself. The Pathname form is the convenience path; the bytes form is for callers that already have the sample or are calling both methods on the same file and want to avoid a second open. FileType.read_as_text takes a Pathname only — there’s no bytes-in shortcut because the PDF case needs to seek the file.

Why a separate module

Without this module, magic-byte tables and the binary heuristic ended up scattered through whichever tool needed them — first PDF in Workspace::Read, then images alongside it, then a copy of FileType.binary? reached for by Workspace::Edit. Collecting the detection logic here lets Read focus on routing (mime-to-formatter), Edit drop its cross-tool reach, and new tools (a future Workspace::Diff, an attachment-aware web fetcher, …) share one set of magic-byte truths.

Deliberate non-goals

  • *Not a full MIME database.* The set grows when a pikuri tool needs a new format, not speculatively. Keeps the “audit in an evening” ceiling honest.

  • *No path / extension fallback.* Extensions lie (a renamed .png → opaque garbage); magic-byte detection on the actual content is the source of truth. Callers that need extension-based behaviour can layer it themselves.

  • *No convenience predicates* like image? / pdf?. Callers do mime == ‘application/pdf’ or mime&.start_with?(‘image/’) —one extra character, zero added API surface.

Defined Under Namespace

Classes: Page

Constant Summary collapse

SAMPLE_BYTES =

Returns recommended number of bytes to sample for detect_mime and binary?. Big enough to catch every prefix pikuri sniffs today (the largest is WebP’s 12-byte container header) with comfortable slack; small enough that reading it off any reasonable filesystem is effectively free.

Returns:

  • (Integer)

    recommended number of bytes to sample for detect_mime and binary?. Big enough to catch every prefix pikuri sniffs today (the largest is WebP’s 12-byte container header) with comfortable slack; small enough that reading it off any reasonable filesystem is effectively free.

4096
BINARY_NONPRINTABLE_THRESHOLD =

Returns fraction of the sample that may be non-printable before binary? flags the bytes as binary. Matches opencode’s threshold.

Returns:

  • (Float)

    fraction of the sample that may be non-printable before binary? flags the bytes as binary. Matches opencode’s threshold.

0.30
IMAGE_MAGIC_BYTES =

Returns magic-byte prefixes → MIME types for the image formats with flat (offset-zero, fixed-length) signatures. WebP isn’t here — its signature is split across the RIFF container header — and is handled directly in detect_mime.

Returns:

  • (Hash{String => String})

    magic-byte prefixes → MIME types for the image formats with flat (offset-zero, fixed-length) signatures. WebP isn’t here — its signature is split across the RIFF container header — and is handled directly in detect_mime.

{
  "\x89PNG\r\n\x1a\n".b => 'image/png',
  "\xff\xd8\xff".b      => 'image/jpeg',
  "GIF87a".b            => 'image/gif',
  "GIF89a".b            => 'image/gif'
}.freeze
PDF_MAGIC =

Returns PDF magic prefix. Every conformant PDF starts with this five-byte ASCII sequence per ISO 32000-1 §7.5.2.

Returns:

  • (String)

    PDF magic prefix. Every conformant PDF starts with this five-byte ASCII sequence per ISO 32000-1 §7.5.2.

'%PDF-'
PAGE_DEFAULT_LIMIT =

Returns default line-window size for read_as_text_paged when the caller omits limit.

Returns:

  • (Integer)

    default line-window size for read_as_text_paged when the caller omits limit.

2000
PAGE_MAX_BYTES =

Returns default hard byte cap on the content collected by a single read_as_text_paged call. Bypassable by paging via offset. The rendered output is slightly larger (line numbering, trailer) — that’s the caller’s concern.

Returns:

  • (Integer)

    default hard byte cap on the content collected by a single read_as_text_paged call. Bypassable by paging via offset. The rendered output is slightly larger (line numbering, trailer) — that’s the caller’s concern.

50 * 1024
PAGE_MAX_LINE_LENGTH =

Returns default per-line character cap; read_as_text_paged truncates longer lines and appends PAGE_LINE_TRUNCATION_MARKER.

Returns:

2000
PAGE_LINE_TRUNCATION_MARKER =

Returns suffix appended to a line truncated at PAGE_MAX_LINE_LENGTH.

Returns:

"... (line truncated to #{PAGE_MAX_LINE_LENGTH} chars)"

Class Method Summary collapse

Class Method Details

.binary?(input) ⇒ Boolean

Heuristic text-vs-binary classifier matching opencode’s: any NUL byte forces true; otherwise count bytes outside the printable t n v f r + ASCII-32..126 range and ratio against the sample size. UTF-8 continuation bytes (0x80-0xBF) are >127 so they sit outside the non-printable ranges and pass through unflagged, letting UTF-8 text read fine. An empty sample is treated as not-binary (callers reading an empty file take the empty-text path).

Parameters:

  • input (String, Pathname)

    the bytes to inspect, or a Pathname that this method opens in binary mode and reads up to SAMPLE_BYTES from. Caller is responsible for verifying the path exists.

Returns:

  • (Boolean)


187
188
189
190
191
192
193
194
195
196
197
198
# File 'lib/pikuri/file_type.rb', line 187

def binary?(input)
  bytes = sample_of(input)
  return false if bytes.empty?

  non_printable = 0
  bytes.each_byte do |b|
    return true if b.zero?

    non_printable += 1 if b < 9 || (b > 13 && b < 32)
  end
  non_printable.to_f / bytes.bytesize > BINARY_NONPRINTABLE_THRESHOLD
end

.detect_mime(input) ⇒ String?

Recognise a file from its leading bytes. Returns the MIME type as a String for formats pikuri handles specially, or nil for “unrecognised” — callers interpret nil themselves (text, opaque binary, …).

Parameters:

  • input (String, Pathname)

    the bytes to inspect, or a Pathname that this method opens in binary mode and reads up to SAMPLE_BYTES from. Caller is responsible for verifying the path exists; missing-file errors propagate as Errno::ENOENT.

Returns:

  • (String, nil)


159
160
161
162
163
164
165
166
167
168
169
170
171
# File 'lib/pikuri/file_type.rb', line 159

def detect_mime(input)
  bytes = sample_of(input)
  return 'application/pdf' if bytes.start_with?(PDF_MAGIC)

  IMAGE_MAGIC_BYTES.each do |prefix, mime|
    return mime if bytes.start_with?(prefix)
  end
  return 'image/webp' if bytes.bytesize >= 12 &&
                         bytes.byteslice(0, 4) == 'RIFF'.b &&
                         bytes.byteslice(8, 4) == 'WEBP'.b

  nil
end

.read_as_text(path) ⇒ String

Read path and return its content as plain UTF-8 text. Two extraction paths, picked by detect_mime:

  • PDF — walked page-by-page via pdf-reader; each page’s extracted text is stripped and pages are joined with a blank line. A scanned-image PDF (no extractable text) comes back as the empty String — a deliberate silent skip, callers detect by length if they care.

  • **Plain text** — anything that detect_mime doesn’t recognise and that binary? accepts. Read with UTF-8 encoding; behaviour on non-UTF-8 bytes is whatever File.read does with encoding: Encoding::UTF_8 (which is “leave invalid bytes in, let downstream decide”).

Refusal cases — all raise rather than returning a sentinel because the callers are internal pikuri code, not an LLM tool. The LLM-facing Workspace::Read does its own routing and returns “Error: …” observations; that’s a separate concern.

  • Path doesn’t exist → Errno::ENOENT.

  • Path is a directory → ArgumentError.

  • Image (PNG / JPEG / GIF / WebP per detect_mime) →ArgumentError; images aren’t text.

  • Binary content (per binary?) and not a recognised MIME →ArgumentError.

  • Malformed PDF — pdf-reader‘s MalformedPDFError / UnsupportedFeatureError / InvalidPageError are re-raised as a RuntimeError with the path included so callers don’t need to know pdf-reader’s exception hierarchy.

Parameters:

  • path (Pathname)

    file to read.

Returns:

  • (String)

    UTF-8 text. May be empty (empty text file, or scanned-image PDF).

Raises:

  • (ArgumentError)

    if path isn’t a Pathname, points at a directory, is an image, or is binary.

  • (Errno::ENOENT)

    if path doesn’t exist.

  • (RuntimeError)

    on a malformed / unsupported PDF.



238
239
240
241
242
243
244
245
246
247
248
249
# File 'lib/pikuri/file_type.rb', line 238

def read_as_text(path)
  raise ArgumentError, "expected Pathname, got #{path.class}" unless path.is_a?(Pathname)
  raise Errno::ENOENT, path.to_s unless path.exist?
  raise ArgumentError, "#{path} is a directory" if path.directory?

  mime = detect_mime(path)
  return read_pdf_text(path) if mime == 'application/pdf'
  raise ArgumentError, "#{path} is an image (#{mime}); cannot extract as text" if mime&.start_with?('image/')
  raise ArgumentError, "#{path} appears to be binary; cannot extract as text" if binary?(path)

  path.read(encoding: Encoding::UTF_8)
end

.read_as_text_paged(path, offset: 1, limit: PAGE_DEFAULT_LIMIT, max_bytes: PAGE_MAX_BYTES, max_line_length: PAGE_MAX_LINE_LENGTH) ⇒ Page

Extract path as text and return a windowed Page: the lines from offset (1-indexed) up to limit of them, stopping early if max_bytes is reached, with over-long lines truncated at max_line_length. Lazy by design — a text file is streamed line-by-line and a PDF is parsed page-by-page only until the window fills, so reading the first page of a 500-page PDF parses a handful of pages, not all of them.

Same routing and refusal contract as read_as_text: PDFs are extracted (with “— Page N —” marker lines, unlike read_as_text‘s marker-free join — paging is a display path, the marker-free form stays the indexing path); images, binaries, directories, missing files, and malformed PDFs all raise rather than returning a sentinel. The LLM-facing callers map those into “Error: …” observations themselves.

Parameters:

  • path (Pathname)

    file to read.

  • offset (Integer) (defaults to: 1)

    1-indexed first line to include. The caller is responsible for validating offset >= 1.

  • limit (Integer) (defaults to: PAGE_DEFAULT_LIMIT)

    maximum lines to collect. Caller validates limit >= 1.

  • max_bytes (Integer) (defaults to: PAGE_MAX_BYTES)

    hard byte cap on collected content.

  • max_line_length (Integer) (defaults to: PAGE_MAX_LINE_LENGTH)

    per-line truncation threshold.

Returns:

  • (Page)

    the windowed slice.

Raises:

  • (ArgumentError)

    if path isn’t a Pathname, is a directory, an image, or binary.

  • (Errno::ENOENT)

    if path doesn’t exist.

  • (RuntimeError)

    on a malformed / unsupported PDF.



300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
# File 'lib/pikuri/file_type.rb', line 300

def read_as_text_paged(path, offset: 1, limit: PAGE_DEFAULT_LIMIT,
                       max_bytes: PAGE_MAX_BYTES, max_line_length: PAGE_MAX_LINE_LENGTH)
  raise ArgumentError, "expected Pathname, got #{path.class}" unless path.is_a?(Pathname)
  raise Errno::ENOENT, path.to_s unless path.exist?
  raise ArgumentError, "#{path} is a directory" if path.directory?

  mime = detect_mime(path)
  if mime == 'application/pdf'
    return paged_pdf(path, offset: offset, limit: limit,
                           max_bytes: max_bytes, max_line_length: max_line_length)
  end
  raise ArgumentError, "#{path} is an image (#{mime}); cannot extract as text" if mime&.start_with?('image/')
  raise ArgumentError, "#{path} appears to be binary; cannot extract as text" if binary?(path)

  paged_text(path, offset: offset, limit: limit,
                   max_bytes: max_bytes, max_line_length: max_line_length)
end