Module: Pikuri::FileType

Defined in:
lib/pikuri/file_type.rb

Overview

Magic-byte content sniffing + text extraction, centralised. Three responsibilities:

  • FileType.detect_mime — recognise a file from its leading bytes. Returns a MIME String for formats pikuri knows how to handle specially (<code>application/pdf</code>, the four image formats), or nil for “unrecognised — could be text, could be opaque binary; caller decides”.

  • FileType.binary? — heuristic text-vs-binary classifier. Independent of FileType.detect_mime: a file can be both recognised (e.g. PDF) and binary. FileType.detect_mime tells you what the bytes are; FileType.binary? tells you whether they’re safe to render as text.

  • FileType.read_as_text — read a file and return its content as plain UTF-8 text. PDFs go through pdf-reader page-by-page; plain text passes through; images / binaries / missing files raise. The pure-extraction shape consumers like Pikuri::VectorDb‘s indexer want (no LLM-tool concerns — no paging, no line numbering, no byte caps; just bytes-in-text-out).

FileType.detect_mime and FileType.binary? accept either a String of bytes (sample taken by the caller) or a Pathname — when given a path, the module opens the file in binary mode and reads SAMPLE_BYTES for the sniff itself. The Pathname form is the convenience path; the bytes form is for callers that already have the sample or are calling both methods on the same file and want to avoid a second open. FileType.read_as_text takes a Pathname only — there’s no bytes-in shortcut because the PDF case needs to seek the file.

Why a separate module

Without this module, magic-byte tables and the binary heuristic ended up scattered through whichever tool needed them — first PDF in Workspace::Read, then images alongside it, then a copy of FileType.binary? reached for by Workspace::Edit. Collecting the detection logic here lets Read focus on routing (mime-to-formatter), Edit drop its cross-tool reach, and new tools (a future Workspace::Diff, an attachment-aware web fetcher, …) share one set of magic-byte truths.

Deliberate non-goals

  • *Not a full MIME database.* The set grows when a pikuri tool needs a new format, not speculatively. Keeps the “audit in an evening” ceiling honest.

  • *No path / extension fallback.* Extensions lie (a renamed .png → opaque garbage); magic-byte detection on the actual content is the source of truth. Callers that need extension-based behaviour can layer it themselves.

  • *No convenience predicates* like image? / pdf?. Callers do mime == ‘application/pdf’ or mime&.start_with?(‘image/’) —one extra character, zero added API surface.

Constant Summary collapse

SAMPLE_BYTES =

Returns recommended number of bytes to sample for detect_mime and binary?. Big enough to catch every prefix pikuri sniffs today (the largest is WebP’s 12-byte container header) with comfortable slack; small enough that reading it off any reasonable filesystem is effectively free.

Returns:

  • (Integer)

    recommended number of bytes to sample for detect_mime and binary?. Big enough to catch every prefix pikuri sniffs today (the largest is WebP’s 12-byte container header) with comfortable slack; small enough that reading it off any reasonable filesystem is effectively free.

4096
BINARY_NONPRINTABLE_THRESHOLD =

Returns fraction of the sample that may be non-printable before binary? flags the bytes as binary. Matches opencode’s threshold.

Returns:

  • (Float)

    fraction of the sample that may be non-printable before binary? flags the bytes as binary. Matches opencode’s threshold.

0.30
IMAGE_MAGIC_BYTES =

Returns magic-byte prefixes → MIME types for the image formats with flat (offset-zero, fixed-length) signatures. WebP isn’t here — its signature is split across the RIFF container header — and is handled directly in detect_mime.

Returns:

  • (Hash{String => String})

    magic-byte prefixes → MIME types for the image formats with flat (offset-zero, fixed-length) signatures. WebP isn’t here — its signature is split across the RIFF container header — and is handled directly in detect_mime.

{
  "\x89PNG\r\n\x1a\n".b => 'image/png',
  "\xff\xd8\xff".b      => 'image/jpeg',
  "GIF87a".b            => 'image/gif',
  "GIF89a".b            => 'image/gif'
}.freeze
PDF_MAGIC =

Returns PDF magic prefix. Every conformant PDF starts with this five-byte ASCII sequence per ISO 32000-1 §7.5.2.

Returns:

  • (String)

    PDF magic prefix. Every conformant PDF starts with this five-byte ASCII sequence per ISO 32000-1 §7.5.2.

'%PDF-'

Class Method Summary collapse

Class Method Details

.binary?(input) ⇒ Boolean

Heuristic text-vs-binary classifier matching opencode’s: any NUL byte forces true; otherwise count bytes outside the printable t n v f r + ASCII-32..126 range and ratio against the sample size. UTF-8 continuation bytes (0x80-0xBF) are >127 so they sit outside the non-printable ranges and pass through unflagged, letting UTF-8 text read fine. An empty sample is treated as not-binary (callers reading an empty file take the empty-text path).

Parameters:

  • input (String, Pathname)

    the bytes to inspect, or a Pathname that this method opens in binary mode and reads up to SAMPLE_BYTES from. Caller is responsible for verifying the path exists.

Returns:

  • (Boolean)


126
127
128
129
130
131
132
133
134
135
136
137
# File 'lib/pikuri/file_type.rb', line 126

def binary?(input)
  bytes = sample_of(input)
  return false if bytes.empty?

  non_printable = 0
  bytes.each_byte do |b|
    return true if b.zero?

    non_printable += 1 if b < 9 || (b > 13 && b < 32)
  end
  non_printable.to_f / bytes.bytesize > BINARY_NONPRINTABLE_THRESHOLD
end

.detect_mime(input) ⇒ String?

Recognise a file from its leading bytes. Returns the MIME type as a String for formats pikuri handles specially, or nil for “unrecognised” — callers interpret nil themselves (text, opaque binary, …).

Parameters:

  • input (String, Pathname)

    the bytes to inspect, or a Pathname that this method opens in binary mode and reads up to SAMPLE_BYTES from. Caller is responsible for verifying the path exists; missing-file errors propagate as Errno::ENOENT.

Returns:

  • (String, nil)


98
99
100
101
102
103
104
105
106
107
108
109
110
# File 'lib/pikuri/file_type.rb', line 98

def detect_mime(input)
  bytes = sample_of(input)
  return 'application/pdf' if bytes.start_with?(PDF_MAGIC)

  IMAGE_MAGIC_BYTES.each do |prefix, mime|
    return mime if bytes.start_with?(prefix)
  end
  return 'image/webp' if bytes.bytesize >= 12 &&
                         bytes.byteslice(0, 4) == 'RIFF'.b &&
                         bytes.byteslice(8, 4) == 'WEBP'.b

  nil
end

.read_as_text(path) ⇒ String

Read path and return its content as plain UTF-8 text. Two extraction paths, picked by detect_mime:

  • PDF — walked page-by-page via pdf-reader; each page’s extracted text is stripped and pages are joined with a blank line. A scanned-image PDF (no extractable text) comes back as the empty String — a deliberate silent skip, callers detect by length if they care.

  • **Plain text** — anything that detect_mime doesn’t recognise and that binary? accepts. Read with UTF-8 encoding; behaviour on non-UTF-8 bytes is whatever File.read does with encoding: Encoding::UTF_8 (which is “leave invalid bytes in, let downstream decide”).

Refusal cases — all raise rather than returning a sentinel because the callers are internal pikuri code, not an LLM tool. The LLM-facing Workspace::Read does its own routing and returns “Error: …” observations; that’s a separate concern.

  • Path doesn’t exist → Errno::ENOENT.

  • Path is a directory → ArgumentError.

  • Image (PNG / JPEG / GIF / WebP per detect_mime) →ArgumentError; images aren’t text.

  • Binary content (per binary?) and not a recognised MIME →ArgumentError.

  • Malformed PDF — pdf-reader‘s MalformedPDFError / UnsupportedFeatureError / InvalidPageError are re-raised as a RuntimeError with the path included so callers don’t need to know pdf-reader’s exception hierarchy.

Parameters:

  • path (Pathname)

    file to read.

Returns:

  • (String)

    UTF-8 text. May be empty (empty text file, or scanned-image PDF).

Raises:

  • (ArgumentError)

    if path isn’t a Pathname, points at a directory, is an image, or is binary.

  • (Errno::ENOENT)

    if path doesn’t exist.

  • (RuntimeError)

    on a malformed / unsupported PDF.



177
178
179
180
181
182
183
184
185
186
187
188
# File 'lib/pikuri/file_type.rb', line 177

def read_as_text(path)
  raise ArgumentError, "expected Pathname, got #{path.class}" unless path.is_a?(Pathname)
  raise Errno::ENOENT, path.to_s unless path.exist?
  raise ArgumentError, "#{path} is a directory" if path.directory?

  mime = detect_mime(path)
  return read_pdf_text(path) if mime == 'application/pdf'
  raise ArgumentError, "#{path} is an image (#{mime}); cannot extract as text" if mime&.start_with?('image/')
  raise ArgumentError, "#{path} appears to be binary; cannot extract as text" if binary?(path)

  path.read(encoding: Encoding::UTF_8)
end