Module: Pikuri::FileType
- Defined in:
- lib/pikuri/file_type.rb
Overview
Magic-byte content sniffing, plus the path-aware front over the Extractor registry. Two responsibilities:
-
FileType.detect_mime — recognise a file from its leading bytes. Returns a MIME String for formats pikuri knows how to handle specially (<code>application/pdf</code>, the four image formats), or
nilfor “unrecognised — could be text, could be opaque binary; caller decides”. -
FileType.binary? — heuristic text-vs-binary classifier. Independent of FileType.detect_mime: a file can be both recognised (e.g. PDF) and binary. FileType.detect_mime tells you what the bytes are; FileType.binary? tells you whether they’re safe to render as text.
On top of those sit the two Pathname conveniences, FileType.read_as_text (whole document, the VectorDb indexer’s shape) and FileType.read_as_text_paged (line-windowed, the Read tools’ shape). Both are thin wrappers: they own the path-level refusals (missing file, directory, image) and the exception mapping, then hand the opened IO to Extractor.extract / Extractor.extract_paged — which format the bytes are and how they become text is entirely the registry’s business, so a gem plugging a new extractor in extends these wrappers for free.
FileType.detect_mime and FileType.binary? accept either a String of bytes (sample taken by the caller) or a Pathname — when given a path, the module opens the file in binary mode and reads SAMPLE_BYTES for the sniff itself. The Pathname form is the convenience path; the bytes form is for callers that already have the sample or are calling both methods on the same file and want to avoid a second open.
Why a separate module
Without this module, magic-byte tables and the binary heuristic ended up scattered through whichever tool needed them — first PDF in Workspace::Read, then images alongside it, then a copy of FileType.binary? reached for by Workspace::Edit. Collecting the detection logic here lets Read focus on routing (mime-to-formatter), Edit drop its cross-tool reach, and new tools share one set of magic-byte truths.
Deliberate non-goals
-
*Not a full MIME database.* The set grows when a pikuri tool needs a new format, not speculatively. Keeps the “audit in an evening” ceiling honest.
-
*No path / extension fallback.* Extensions lie (a renamed
.png→ opaque garbage); magic-byte detection on the actual content is the source of truth. Callers that need extension-based behaviour can layer it themselves. -
*No convenience predicates* like
image?/pdf?. Callers do mime == ‘application/pdf’ or mime&.start_with?(‘image/’) —one extra character, zero added API surface.
Constant Summary collapse
- SAMPLE_BYTES =
Returns recommended number of bytes to sample for detect_mime and binary?. Big enough to catch every prefix pikuri sniffs today (the largest is WebP’s 12-byte container header) with comfortable slack; small enough that reading it off any reasonable filesystem is effectively free.
4096- BINARY_NONPRINTABLE_THRESHOLD =
Returns fraction of the sample that may be non-printable before binary? flags the bytes as binary. Matches opencode’s threshold.
0.30- IMAGE_MAGIC_BYTES =
Returns magic-byte prefixes → MIME types for the image formats with flat (offset-zero, fixed-length) signatures. WebP isn’t here — its signature is split across the RIFF container header — and is handled directly in detect_mime.
{ "\x89PNG\r\n\x1a\n".b => 'image/png', "\xff\xd8\xff".b => 'image/jpeg', "GIF87a".b => 'image/gif', "GIF89a".b => 'image/gif' }.freeze
- PDF_MAGIC =
Returns PDF magic prefix. Every conformant PDF starts with this five-byte ASCII sequence per ISO 32000-1 §7.5.2.
'%PDF-'
Class Method Summary collapse
-
.binary?(input) ⇒ Boolean
Heuristic text-vs-binary classifier matching opencode’s: any
NULbyte forcestrue; otherwise count bytes outside the printable t n v f r + ASCII-32..126 range and ratio against the sample size. -
.detect_mime(input) ⇒ String?
Recognise a file from its leading bytes.
-
.read_as_text(path) ⇒ String
Read
pathand return its content as plain UTF-8 text, routed through the Extractor registry: anything unrecognised-but-textual passes through verbatim (Extractor::Passthrough); with pikuri-pdf registered, PDFs are extracted with “— Page N —” markers (a scanned-image PDF with no extractable text comes back as the empty String, a deliberate silent skip callers detect by length if they care). -
.read_as_text_paged(path, offset: 1, limit: Extractor::PAGE_DEFAULT_LIMIT, max_bytes: Extractor::PAGE_MAX_BYTES, max_line_length: Extractor::PAGE_MAX_LINE_LENGTH) ⇒ Extractor::Page
Extract
pathand return a windowed Extractor::Page: the lines fromoffset(1-indexed) up tolimitof them, stopping early ifmax_bytesis reached, with over-long lines truncated atmax_line_length.
Class Method Details
.binary?(input) ⇒ Boolean
Heuristic text-vs-binary classifier matching opencode’s: any NUL byte forces true; otherwise count bytes outside the printable t n v f r + ASCII-32..126 range and ratio against the sample size. UTF-8 continuation bytes (0x80-0xBF) are >127 so they sit outside the non-printable ranges and pass through unflagged, letting UTF-8 text read fine. An empty sample is treated as not-binary (callers reading an empty file take the empty-text path).
126 127 128 129 130 131 132 133 134 135 136 137 |
# File 'lib/pikuri/file_type.rb', line 126 def binary?(input) bytes = sample_of(input) return false if bytes.empty? non_printable = 0 bytes.each_byte do |b| return true if b.zero? non_printable += 1 if b < 9 || (b > 13 && b < 32) end non_printable.to_f / bytes.bytesize > BINARY_NONPRINTABLE_THRESHOLD end |
.detect_mime(input) ⇒ String?
Recognise a file from its leading bytes. Returns the MIME type as a String for formats pikuri handles specially, or nil for “unrecognised” — callers interpret nil themselves (text, opaque binary, …).
98 99 100 101 102 103 104 105 106 107 108 109 110 |
# File 'lib/pikuri/file_type.rb', line 98 def detect_mime(input) bytes = sample_of(input) return 'application/pdf' if bytes.start_with?(PDF_MAGIC) IMAGE_MAGIC_BYTES.each do |prefix, mime| return mime if bytes.start_with?(prefix) end return 'image/webp' if bytes.bytesize >= 12 && bytes.byteslice(0, 4) == 'RIFF'.b && bytes.byteslice(8, 4) == 'WEBP'.b nil end |
.read_as_text(path) ⇒ String
Read path and return its content as plain UTF-8 text, routed through the Extractor registry: anything unrecognised-but-textual passes through verbatim (Extractor::Passthrough); with pikuri-pdf registered, PDFs are extracted with “— Page N —” markers (a scanned-image PDF with no extractable text comes back as the empty String, a deliberate silent skip callers detect by length if they care).
Refusal cases — all raise rather than returning a sentinel because the callers are internal pikuri code, not an LLM tool. The LLM-facing Workspace::Read does its own routing and returns “Error: …” observations; that’s a separate concern.
-
Path doesn’t exist →
Errno::ENOENT. -
Path is a directory →
ArgumentError. -
Image (PNG / JPEG / GIF / WebP per detect_mime) →
ArgumentError; images aren’t text. -
Content no extractor claims (opaque binary) →
ArgumentError, mapped from Extractor::Unsupported. -
Extraction failure (malformed PDF, …) →
RuntimeErrorwith the path included, mapped from Extractor::Error so callers don’t need to know any extractor’s exception hierarchy.
170 171 172 173 174 175 176 177 |
# File 'lib/pikuri/file_type.rb', line 170 def read_as_text(path) mime = guard_extractable(path) path.open('rb') { |io| Extractor.extract(io, content_type: mime) } rescue Extractor::Unsupported raise ArgumentError, "#{path} appears to be binary; cannot extract as text" rescue Extractor::Error => e raise "Cannot extract text from #{path}: #{e.}" end |
.read_as_text_paged(path, offset: 1, limit: Extractor::PAGE_DEFAULT_LIMIT, max_bytes: Extractor::PAGE_MAX_BYTES, max_line_length: Extractor::PAGE_MAX_LINE_LENGTH) ⇒ Extractor::Page
Extract path and return a windowed Extractor::Page: the lines from offset (1-indexed) up to limit of them, stopping early if max_bytes is reached, with over-long lines truncated at max_line_length. Same routing and refusal contract as read_as_text; the windowing semantics (including the lazy extract_lines consumption that stops parsing once the window fills) are Extractor.extract_paged‘s. The LLM-facing callers map the exceptions into “Error: …” observations themselves.
202 203 204 205 206 207 208 209 210 211 212 213 214 |
# File 'lib/pikuri/file_type.rb', line 202 def read_as_text_paged(path, offset: 1, limit: Extractor::PAGE_DEFAULT_LIMIT, max_bytes: Extractor::PAGE_MAX_BYTES, max_line_length: Extractor::PAGE_MAX_LINE_LENGTH) mime = guard_extractable(path) path.open('rb') do |io| Extractor.extract_paged(io, content_type: mime, offset: offset, limit: limit, max_bytes: max_bytes, max_line_length: max_line_length) end rescue Extractor::Unsupported raise ArgumentError, "#{path} appears to be binary; cannot extract as text" rescue Extractor::Error => e raise "Cannot extract text from #{path}: #{e.}" end |