Module: Pikuri::FileType
- Defined in:
- lib/pikuri/file_type.rb
Overview
Magic-byte content sniffing + text extraction, centralised. Three responsibilities:
-
FileType.detect_mime — recognise a file from its leading bytes. Returns a MIME String for formats pikuri knows how to handle specially (<code>application/pdf</code>, the four image formats), or
nilfor “unrecognised — could be text, could be opaque binary; caller decides”. -
FileType.binary? — heuristic text-vs-binary classifier. Independent of FileType.detect_mime: a file can be both recognised (e.g. PDF) and binary. FileType.detect_mime tells you what the bytes are; FileType.binary? tells you whether they’re safe to render as text.
-
FileType.read_as_text — read a file and return its content as plain UTF-8 text. PDFs go through
pdf-readerpage-by-page; plain text passes through; images / binaries / missing files raise. The pure-extraction shape consumers likePikuri::VectorDb‘s indexer want (no LLM-tool concerns — no paging, no line numbering, no byte caps; just bytes-in-text-out). -
FileType.read_as_text_paged — the LLM-tool shape: the same extraction as FileType.read_as_text, but lazily windowed to a line range with a byte cap, returning a Page value the caller renders. Shared by
Workspace::ReadandVectorDb::Tools::Readso the offset/limit/byte-cap windowing lives in one tested place; each tool keeps its own presentation (cat-n numbering, trailer wording, citation vs. path). Same refusal contract as FileType.read_as_text (raises on image / binary / missing / malformed-PDF).
FileType.detect_mime and FileType.binary? accept either a String of bytes (sample taken by the caller) or a Pathname — when given a path, the module opens the file in binary mode and reads SAMPLE_BYTES for the sniff itself. The Pathname form is the convenience path; the bytes form is for callers that already have the sample or are calling both methods on the same file and want to avoid a second open. FileType.read_as_text takes a Pathname only — there’s no bytes-in shortcut because the PDF case needs to seek the file.
Why a separate module
Without this module, magic-byte tables and the binary heuristic ended up scattered through whichever tool needed them — first PDF in Workspace::Read, then images alongside it, then a copy of FileType.binary? reached for by Workspace::Edit. Collecting the detection logic here lets Read focus on routing (mime-to-formatter), Edit drop its cross-tool reach, and new tools (a future Workspace::Diff, an attachment-aware web fetcher, …) share one set of magic-byte truths.
Deliberate non-goals
-
*Not a full MIME database.* The set grows when a pikuri tool needs a new format, not speculatively. Keeps the “audit in an evening” ceiling honest.
-
*No path / extension fallback.* Extensions lie (a renamed
.png→ opaque garbage); magic-byte detection on the actual content is the source of truth. Callers that need extension-based behaviour can layer it themselves. -
*No convenience predicates* like
image?/pdf?. Callers do mime == ‘application/pdf’ or mime&.start_with?(‘image/’) —one extra character, zero added API surface.
Defined Under Namespace
Classes: Page
Constant Summary collapse
- SAMPLE_BYTES =
Returns recommended number of bytes to sample for detect_mime and binary?. Big enough to catch every prefix pikuri sniffs today (the largest is WebP’s 12-byte container header) with comfortable slack; small enough that reading it off any reasonable filesystem is effectively free.
4096- BINARY_NONPRINTABLE_THRESHOLD =
Returns fraction of the sample that may be non-printable before binary? flags the bytes as binary. Matches opencode’s threshold.
0.30- IMAGE_MAGIC_BYTES =
Returns magic-byte prefixes → MIME types for the image formats with flat (offset-zero, fixed-length) signatures. WebP isn’t here — its signature is split across the RIFF container header — and is handled directly in detect_mime.
{ "\x89PNG\r\n\x1a\n".b => 'image/png', "\xff\xd8\xff".b => 'image/jpeg', "GIF87a".b => 'image/gif', "GIF89a".b => 'image/gif' }.freeze
- PDF_MAGIC =
Returns PDF magic prefix. Every conformant PDF starts with this five-byte ASCII sequence per ISO 32000-1 §7.5.2.
'%PDF-'- PAGE_DEFAULT_LIMIT =
Returns default line-window size for read_as_text_paged when the caller omits
limit. 2000- PAGE_MAX_BYTES =
Returns default hard byte cap on the content collected by a single read_as_text_paged call. Bypassable by paging via
offset. The rendered output is slightly larger (line numbering, trailer) — that’s the caller’s concern. 50 * 1024
- PAGE_MAX_LINE_LENGTH =
Returns default per-line character cap; read_as_text_paged truncates longer lines and appends PAGE_LINE_TRUNCATION_MARKER.
2000- PAGE_LINE_TRUNCATION_MARKER =
Returns suffix appended to a line truncated at PAGE_MAX_LINE_LENGTH.
"... (line truncated to #{PAGE_MAX_LINE_LENGTH} chars)"
Class Method Summary collapse
-
.binary?(input) ⇒ Boolean
Heuristic text-vs-binary classifier matching opencode’s: any
NULbyte forcestrue; otherwise count bytes outside the printable t n v f r + ASCII-32..126 range and ratio against the sample size. -
.detect_mime(input) ⇒ String?
Recognise a file from its leading bytes.
-
.read_as_text(path) ⇒ String
Read
pathand return its content as plain UTF-8 text. -
.read_as_text_paged(path, offset: 1, limit: PAGE_DEFAULT_LIMIT, max_bytes: PAGE_MAX_BYTES, max_line_length: PAGE_MAX_LINE_LENGTH) ⇒ Page
Extract
pathas text and return a windowed Page: the lines fromoffset(1-indexed) up tolimitof them, stopping early ifmax_bytesis reached, with over-long lines truncated atmax_line_length.
Class Method Details
.binary?(input) ⇒ Boolean
Heuristic text-vs-binary classifier matching opencode’s: any NUL byte forces true; otherwise count bytes outside the printable t n v f r + ASCII-32..126 range and ratio against the sample size. UTF-8 continuation bytes (0x80-0xBF) are >127 so they sit outside the non-printable ranges and pass through unflagged, letting UTF-8 text read fine. An empty sample is treated as not-binary (callers reading an empty file take the empty-text path).
187 188 189 190 191 192 193 194 195 196 197 198 |
# File 'lib/pikuri/file_type.rb', line 187 def binary?(input) bytes = sample_of(input) return false if bytes.empty? non_printable = 0 bytes.each_byte do |b| return true if b.zero? non_printable += 1 if b < 9 || (b > 13 && b < 32) end non_printable.to_f / bytes.bytesize > BINARY_NONPRINTABLE_THRESHOLD end |
.detect_mime(input) ⇒ String?
Recognise a file from its leading bytes. Returns the MIME type as a String for formats pikuri handles specially, or nil for “unrecognised” — callers interpret nil themselves (text, opaque binary, …).
159 160 161 162 163 164 165 166 167 168 169 170 171 |
# File 'lib/pikuri/file_type.rb', line 159 def detect_mime(input) bytes = sample_of(input) return 'application/pdf' if bytes.start_with?(PDF_MAGIC) IMAGE_MAGIC_BYTES.each do |prefix, mime| return mime if bytes.start_with?(prefix) end return 'image/webp' if bytes.bytesize >= 12 && bytes.byteslice(0, 4) == 'RIFF'.b && bytes.byteslice(8, 4) == 'WEBP'.b nil end |
.read_as_text(path) ⇒ String
Read path and return its content as plain UTF-8 text. Two extraction paths, picked by detect_mime:
-
PDF — walked page-by-page via
pdf-reader; each page’s extracted text is stripped and pages are joined with a blank line. A scanned-image PDF (no extractable text) comes back as the empty String — a deliberate silent skip, callers detect by length if they care. -
**Plain text** — anything that detect_mime doesn’t recognise and that binary? accepts. Read with UTF-8 encoding; behaviour on non-UTF-8 bytes is whatever
File.readdoes with encoding: Encoding::UTF_8 (which is “leave invalid bytes in, let downstream decide”).
Refusal cases — all raise rather than returning a sentinel because the callers are internal pikuri code, not an LLM tool. The LLM-facing Workspace::Read does its own routing and returns “Error: …” observations; that’s a separate concern.
-
Path doesn’t exist →
Errno::ENOENT. -
Path is a directory →
ArgumentError. -
Image (PNG / JPEG / GIF / WebP per detect_mime) →
ArgumentError; images aren’t text. -
Binary content (per binary?) and not a recognised MIME →
ArgumentError. -
Malformed PDF —
pdf-reader‘sMalformedPDFError/UnsupportedFeatureError/InvalidPageErrorare re-raised as aRuntimeErrorwith the path included so callers don’t need to know pdf-reader’s exception hierarchy.
238 239 240 241 242 243 244 245 246 247 248 249 |
# File 'lib/pikuri/file_type.rb', line 238 def read_as_text(path) raise ArgumentError, "expected Pathname, got #{path.class}" unless path.is_a?(Pathname) raise Errno::ENOENT, path.to_s unless path.exist? raise ArgumentError, "#{path} is a directory" if path.directory? mime = detect_mime(path) return read_pdf_text(path) if mime == 'application/pdf' raise ArgumentError, "#{path} is an image (#{mime}); cannot extract as text" if mime&.start_with?('image/') raise ArgumentError, "#{path} appears to be binary; cannot extract as text" if binary?(path) path.read(encoding: Encoding::UTF_8) end |
.read_as_text_paged(path, offset: 1, limit: PAGE_DEFAULT_LIMIT, max_bytes: PAGE_MAX_BYTES, max_line_length: PAGE_MAX_LINE_LENGTH) ⇒ Page
Extract path as text and return a windowed Page: the lines from offset (1-indexed) up to limit of them, stopping early if max_bytes is reached, with over-long lines truncated at max_line_length. Lazy by design — a text file is streamed line-by-line and a PDF is parsed page-by-page only until the window fills, so reading the first page of a 500-page PDF parses a handful of pages, not all of them.
Same routing and refusal contract as read_as_text: PDFs are extracted (with “— Page N —” marker lines, unlike read_as_text‘s marker-free join — paging is a display path, the marker-free form stays the indexing path); images, binaries, directories, missing files, and malformed PDFs all raise rather than returning a sentinel. The LLM-facing callers map those into “Error: …” observations themselves.
300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 |
# File 'lib/pikuri/file_type.rb', line 300 def read_as_text_paged(path, offset: 1, limit: PAGE_DEFAULT_LIMIT, max_bytes: PAGE_MAX_BYTES, max_line_length: PAGE_MAX_LINE_LENGTH) raise ArgumentError, "expected Pathname, got #{path.class}" unless path.is_a?(Pathname) raise Errno::ENOENT, path.to_s unless path.exist? raise ArgumentError, "#{path} is a directory" if path.directory? mime = detect_mime(path) if mime == 'application/pdf' return paged_pdf(path, offset: offset, limit: limit, max_bytes: max_bytes, max_line_length: max_line_length) end raise ArgumentError, "#{path} is an image (#{mime}); cannot extract as text" if mime&.start_with?('image/') raise ArgumentError, "#{path} appears to be binary; cannot extract as text" if binary?(path) paged_text(path, offset: offset, limit: limit, max_bytes: max_bytes, max_line_length: max_line_length) end |