Module: Pikuri::FileType
- Defined in:
- lib/pikuri/file_type.rb
Overview
Magic-byte content sniffing + text extraction, centralised. Three responsibilities:
-
FileType.detect_mime — recognise a file from its leading bytes. Returns a MIME String for formats pikuri knows how to handle specially (<code>application/pdf</code>, the four image formats), or
nilfor “unrecognised — could be text, could be opaque binary; caller decides”. -
FileType.binary? — heuristic text-vs-binary classifier. Independent of FileType.detect_mime: a file can be both recognised (e.g. PDF) and binary. FileType.detect_mime tells you what the bytes are; FileType.binary? tells you whether they’re safe to render as text.
-
FileType.read_as_text — read a file and return its content as plain UTF-8 text. PDFs go through
pdf-readerpage-by-page; plain text passes through; images / binaries / missing files raise. The pure-extraction shape consumers likePikuri::VectorDb‘s indexer want (no LLM-tool concerns — no paging, no line numbering, no byte caps; just bytes-in-text-out).
FileType.detect_mime and FileType.binary? accept either a String of bytes (sample taken by the caller) or a Pathname — when given a path, the module opens the file in binary mode and reads SAMPLE_BYTES for the sniff itself. The Pathname form is the convenience path; the bytes form is for callers that already have the sample or are calling both methods on the same file and want to avoid a second open. FileType.read_as_text takes a Pathname only — there’s no bytes-in shortcut because the PDF case needs to seek the file.
Why a separate module
Without this module, magic-byte tables and the binary heuristic ended up scattered through whichever tool needed them — first PDF in Workspace::Read, then images alongside it, then a copy of FileType.binary? reached for by Workspace::Edit. Collecting the detection logic here lets Read focus on routing (mime-to-formatter), Edit drop its cross-tool reach, and new tools (a future Workspace::Diff, an attachment-aware web fetcher, …) share one set of magic-byte truths.
Deliberate non-goals
-
*Not a full MIME database.* The set grows when a pikuri tool needs a new format, not speculatively. Keeps the “audit in an evening” ceiling honest.
-
*No path / extension fallback.* Extensions lie (a renamed
.png→ opaque garbage); magic-byte detection on the actual content is the source of truth. Callers that need extension-based behaviour can layer it themselves. -
*No convenience predicates* like
image?/pdf?. Callers do mime == ‘application/pdf’ or mime&.start_with?(‘image/’) —one extra character, zero added API surface.
Constant Summary collapse
- SAMPLE_BYTES =
Returns recommended number of bytes to sample for detect_mime and binary?. Big enough to catch every prefix pikuri sniffs today (the largest is WebP’s 12-byte container header) with comfortable slack; small enough that reading it off any reasonable filesystem is effectively free.
4096- BINARY_NONPRINTABLE_THRESHOLD =
Returns fraction of the sample that may be non-printable before binary? flags the bytes as binary. Matches opencode’s threshold.
0.30- IMAGE_MAGIC_BYTES =
Returns magic-byte prefixes → MIME types for the image formats with flat (offset-zero, fixed-length) signatures. WebP isn’t here — its signature is split across the RIFF container header — and is handled directly in detect_mime.
{ "\x89PNG\r\n\x1a\n".b => 'image/png', "\xff\xd8\xff".b => 'image/jpeg', "GIF87a".b => 'image/gif', "GIF89a".b => 'image/gif' }.freeze
- PDF_MAGIC =
Returns PDF magic prefix. Every conformant PDF starts with this five-byte ASCII sequence per ISO 32000-1 §7.5.2.
'%PDF-'
Class Method Summary collapse
-
.binary?(input) ⇒ Boolean
Heuristic text-vs-binary classifier matching opencode’s: any
NULbyte forcestrue; otherwise count bytes outside the printable t n v f r + ASCII-32..126 range and ratio against the sample size. -
.detect_mime(input) ⇒ String?
Recognise a file from its leading bytes.
-
.read_as_text(path) ⇒ String
Read
pathand return its content as plain UTF-8 text.
Class Method Details
.binary?(input) ⇒ Boolean
Heuristic text-vs-binary classifier matching opencode’s: any NUL byte forces true; otherwise count bytes outside the printable t n v f r + ASCII-32..126 range and ratio against the sample size. UTF-8 continuation bytes (0x80-0xBF) are >127 so they sit outside the non-printable ranges and pass through unflagged, letting UTF-8 text read fine. An empty sample is treated as not-binary (callers reading an empty file take the empty-text path).
126 127 128 129 130 131 132 133 134 135 136 137 |
# File 'lib/pikuri/file_type.rb', line 126 def binary?(input) bytes = sample_of(input) return false if bytes.empty? non_printable = 0 bytes.each_byte do |b| return true if b.zero? non_printable += 1 if b < 9 || (b > 13 && b < 32) end non_printable.to_f / bytes.bytesize > BINARY_NONPRINTABLE_THRESHOLD end |
.detect_mime(input) ⇒ String?
Recognise a file from its leading bytes. Returns the MIME type as a String for formats pikuri handles specially, or nil for “unrecognised” — callers interpret nil themselves (text, opaque binary, …).
98 99 100 101 102 103 104 105 106 107 108 109 110 |
# File 'lib/pikuri/file_type.rb', line 98 def detect_mime(input) bytes = sample_of(input) return 'application/pdf' if bytes.start_with?(PDF_MAGIC) IMAGE_MAGIC_BYTES.each do |prefix, mime| return mime if bytes.start_with?(prefix) end return 'image/webp' if bytes.bytesize >= 12 && bytes.byteslice(0, 4) == 'RIFF'.b && bytes.byteslice(8, 4) == 'WEBP'.b nil end |
.read_as_text(path) ⇒ String
Read path and return its content as plain UTF-8 text. Two extraction paths, picked by detect_mime:
-
PDF — walked page-by-page via
pdf-reader; each page’s extracted text is stripped and pages are joined with a blank line. A scanned-image PDF (no extractable text) comes back as the empty String — a deliberate silent skip, callers detect by length if they care. -
**Plain text** — anything that detect_mime doesn’t recognise and that binary? accepts. Read with UTF-8 encoding; behaviour on non-UTF-8 bytes is whatever
File.readdoes with encoding: Encoding::UTF_8 (which is “leave invalid bytes in, let downstream decide”).
Refusal cases — all raise rather than returning a sentinel because the callers are internal pikuri code, not an LLM tool. The LLM-facing Workspace::Read does its own routing and returns “Error: …” observations; that’s a separate concern.
-
Path doesn’t exist →
Errno::ENOENT. -
Path is a directory →
ArgumentError. -
Image (PNG / JPEG / GIF / WebP per detect_mime) →
ArgumentError; images aren’t text. -
Binary content (per binary?) and not a recognised MIME →
ArgumentError. -
Malformed PDF —
pdf-reader‘sMalformedPDFError/UnsupportedFeatureError/InvalidPageErrorare re-raised as aRuntimeErrorwith the path included so callers don’t need to know pdf-reader’s exception hierarchy.
177 178 179 180 181 182 183 184 185 186 187 188 |
# File 'lib/pikuri/file_type.rb', line 177 def read_as_text(path) raise ArgumentError, "expected Pathname, got #{path.class}" unless path.is_a?(Pathname) raise Errno::ENOENT, path.to_s unless path.exist? raise ArgumentError, "#{path} is a directory" if path.directory? mime = detect_mime(path) return read_pdf_text(path) if mime == 'application/pdf' raise ArgumentError, "#{path} is an image (#{mime}); cannot extract as text" if mime&.start_with?('image/') raise ArgumentError, "#{path} appears to be binary; cannot extract as text" if binary?(path) path.read(encoding: Encoding::UTF_8) end |