Module: Clacky::Utils::FileProcessor
- Defined in:
- lib/clacky/utils/file_processor.rb
Overview
File processing pipeline.
Two entry points:
FileProcessor.save(body:, filename:)
→ Store raw bytes to disk only. Returns { name:, path: }.
Used by http_server and channel adapters — no parsing here.
FileProcessor.process_path(path, name: nil)
→ Parse an already-saved file. Returns FileRef (with preview_path or parse_error).
Used by agent.run when building the file prompt.
(FileProcessor.process = save + process_path in one call, for convenience.)
Defined Under Namespace
Classes: FileRef
Constant Summary collapse
- UPLOAD_DIR =
File.join(Dir.tmpdir, "clacky-uploads").freeze
- MAX_FILE_BYTES =
32 MB
32 * 1024 * 1024
- MAX_IMAGE_BYTES =
5 MB
5 * 1024 * 1024
- MAX_FILE_SIZE =
Alias used by FileReader tool
MAX_FILE_BYTES- IMAGE_MAX_WIDTH =
Images wider than this will be downscaled before sending to LLM (pixels)
800- IMAGE_MAX_BASE64_BYTES =
Hard limit for images that can’t be resized: Anthropic/Bedrock vision API supports up to 5MB
5_000_000- BINARY_EXTENSIONS =
%w[ .png .jpg .jpeg .gif .webp .bmp .tiff .ico .svg .pdf .zip .gz .tgz .tar .rar .7z .exe .dll .so .dylib .mp3 .mp4 .avi .mov .mkv .wav .flac .ttf .otf .woff .woff2 .db .sqlite .bin .dat .wps .et .dps ].freeze
- GLOB_ALLOWED_BINARY_EXTENSIONS =
%w[ .pdf .doc .docx .ppt .pptx .xls .xlsx .odt .odp .ods .wps .et .dps ].freeze
- LLM_BINARY_EXTENSIONS =
%w[.png .jpg .jpeg .gif .webp .pdf].freeze
- MIME_TYPES =
{ ".png" => "image/png", ".jpg" => "image/jpeg", ".jpeg" => "image/jpeg", ".gif" => "image/gif", ".webp" => "image/webp", ".mp4" => "video/mp4", ".webm" => "video/webm", ".mov" => "video/quicktime", ".wav" => "audio/wav", ".mp3" => "audio/mpeg", ".ogg" => "audio/ogg", ".aac" => "audio/aac", ".flac" => "audio/flac", ".m4a" => "audio/mp4", ".pdf" => "application/pdf" }.freeze
- FILE_TYPES =
{ ".docx" => :document, ".doc" => :document, ".xlsx" => :spreadsheet, ".xls" => :spreadsheet, ".pptx" => :presentation, ".ppt" => :presentation, ".wps" => :document, ".et" => :spreadsheet, ".dps" => :presentation, ".pdf" => :pdf, ".zip" => :zip, ".gz" => :zip, ".tgz" => :zip, ".tar" => :zip, ".rar" => :zip, ".7z" => :zip, ".png" => :image, ".jpg" => :image, ".jpeg" => :image, ".gif" => :image, ".webp" => :image, ".csv" => :csv, ".md" => :text, ".markdown" => :text, ".txt" => :text, ".log" => :text }.freeze
- TEXT_PREVIEW_EXTENSIONS =
Plain-text extensions whose raw content can be embedded directly as the preview (no external parser needed). Kept conservative to avoid pulling in huge source files by mistake.
%w[.md .markdown .txt .log].freeze
- LOCAL_IMAGE_EXTENSIONS =
Image extensions that can be inlined as data URLs in markdown content.
%w[.png .jpg .jpeg .gif .webp].freeze
- LOCAL_VIDEO_EXTENSIONS =
%w[.mp4 .webm .mov].freeze
- LOCAL_AUDIO_EXTENSIONS =
%w[.wav .mp3 .ogg .aac .flac .m4a].freeze
- LOCAL_MEDIA_EXTENSIONS =
(LOCAL_IMAGE_EXTENSIONS + LOCAL_VIDEO_EXTENSIONS + LOCAL_AUDIO_EXTENSIONS).freeze
Class Method Summary collapse
-
.binary_file_path?(path) ⇒ Boolean
————————————————————————— File type helpers (used by tools and agent) —————————————————————————.
-
.detect_image_mime_type(data, fallback_mime = "image/png") ⇒ String
Detect the actual image MIME type from raw binary data by inspecting magic bytes, ignoring the file extension.
- .detect_mime_type(path, _data = nil) ⇒ Object
-
.downscale_image_base64(b64, mime_type, max_width: IMAGE_MAX_WIDTH) ⇒ String
Downscale a base64-encoded image so its width is at most max_width pixels.
- .file_to_base64(path) ⇒ Object
- .glob_allowed_binary?(path) ⇒ Boolean
- .image_path_to_data_url(path) ⇒ Object
-
.inline_local_images(content) ⇒ String
Replace local image paths in markdown content with base64 data URLs.
-
.process(body:, filename:) ⇒ FileRef
Save + parse in one call (convenience method).
-
.process_path(path, name: nil) ⇒ FileRef
Parse an already-saved file and return a FileRef.
-
.rewrite_local_image_urls(content) ⇒ String?
Rewrite local image paths in markdown content to use the /api/local-image proxy.
-
.save(body:, filename:) ⇒ Hash
Store raw bytes to disk — no parsing.
-
.save_image_to_disk(body:, mime_type:, filename: "image.jpg") ⇒ Object
Save raw image bytes to disk and return a FileRef.
- .supported_binary_file?(path) ⇒ Boolean
Class Method Details
.binary_file_path?(path) ⇒ Boolean
File type helpers (used by tools and agent)
199 200 201 202 203 204 205 |
# File 'lib/clacky/utils/file_processor.rb', line 199 def self.binary_file_path?(path) ext = File.extname(path).downcase return true if BINARY_EXTENSIONS.include?(ext) File.binread(path, 512).to_s.include?("\x00") rescue false end |
.detect_image_mime_type(data, fallback_mime = "image/png") ⇒ String
Detect the actual image MIME type from raw binary data by inspecting magic bytes, ignoring the file extension. Falls back to extension-based detection when magic bytes don’t match any known format.
Handles: PNG, JPEG, GIF, WEBP, BMP, TIFF
432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 |
# File 'lib/clacky/utils/file_processor.rb', line 432 def self.detect_image_mime_type(data, fallback_mime = "image/png") return fallback_mime if data.nil? || data.bytesize < 4 bytes = data.bytes case # PNG: \x89 P N G \r \n \x1a \n when bytes[0] == 0x89 && bytes[1] == 0x50 && bytes[2] == 0x4E && bytes[3] == 0x47 "image/png" # JPEG: \xFF \xD8 \xFF when bytes[0] == 0xFF && bytes[1] == 0xD8 && bytes[2] == 0xFF "image/jpeg" # GIF: GIF87a or GIF89a when bytes[0] == 0x47 && bytes[1] == 0x49 && bytes[2] == 0x46 && bytes[3] == 0x38 "image/gif" # WEBP: RIFF .... WEBP when bytes[0] == 0x52 && bytes[1] == 0x49 && bytes[2] == 0x46 && bytes[3] == 0x46 && data.bytesize >= 12 && data[8, 4] == "WEBP" "image/webp" # BMP: BM when bytes[0] == 0x42 && bytes[1] == 0x4D "image/bmp" # TIFF: II*\x00 (little-endian) or MM\x00* (big-endian) when (bytes[0] == 0x49 && bytes[1] == 0x49 && bytes[2] == 0x2A && bytes[3] == 0x00) || (bytes[0] == 0x4D && bytes[1] == 0x4D && bytes[2] == 0x00 && bytes[3] == 0x2A) "image/tiff" else fallback_mime end end |
.detect_mime_type(path, _data = nil) ⇒ Object
215 216 217 |
# File 'lib/clacky/utils/file_processor.rb', line 215 def self.detect_mime_type(path, _data = nil) MIME_TYPES[File.extname(path).downcase] || "application/octet-stream" end |
.downscale_image_base64(b64, mime_type, max_width: IMAGE_MAX_WIDTH) ⇒ String
Downscale a base64-encoded image so its width is at most max_width pixels.
Strategy:
PNG → chunky_png (pure Ruby, always available as gem dependency)
other formats (JPG/WEBP/GIF) → sips on macOS, `convert` (ImageMagick) on Linux
fallback (no CLI tool) → return as-is, but raise if larger than IMAGE_MAX_BASE64_BYTES
230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 |
# File 'lib/clacky/utils/file_processor.rb', line 230 def self.downscale_image_base64(b64, mime_type, max_width: IMAGE_MAX_WIDTH) require "base64" result = if mime_type == "image/png" downscale_png_chunky(b64, max_width) else downscale_via_cli(b64, mime_type, max_width) end return result if result # No resize tool available — enforce API hard size limit (5MB) if b64.bytesize > IMAGE_MAX_BASE64_BYTES size_kb = b64.bytesize / 1024 limit_mb = IMAGE_MAX_BASE64_BYTES / 1_000_000 raise ArgumentError, "Image too large to send (#{size_kb}KB > #{limit_mb}MB). " \ "Install ImageMagick (`brew install imagemagick`) to enable automatic resizing." end b64 end |
.file_to_base64(path) ⇒ Object
252 253 254 255 256 257 258 259 260 261 262 263 264 265 |
# File 'lib/clacky/utils/file_processor.rb', line 252 def self.file_to_base64(path) require "base64" ext = File.extname(path).downcase size = File.size(path) raise ArgumentError, "File too large: #{path}" if size > MAX_FILE_BYTES ext_mime = MIME_TYPES[ext] || "application/octet-stream" raw_data = File.binread(path) # Detect actual image format from magic bytes (ignore misleading extensions) mime = ext_mime.start_with?("image/") ? detect_image_mime_type(raw_data, ext_mime) : ext_mime data = Base64.strict_encode64(raw_data) # Downscale images before sending to LLM to reduce token cost data = downscale_image_base64(data, mime) if mime.start_with?("image/") { format: ext[1..], mime_type: mime, size_bytes: size, base64_data: data } end |
.glob_allowed_binary?(path) ⇒ Boolean
207 208 209 |
# File 'lib/clacky/utils/file_processor.rb', line 207 def self.glob_allowed_binary?(path) GLOB_ALLOWED_BINARY_EXTENSIONS.include?(File.extname(path).downcase) end |
.image_path_to_data_url(path) ⇒ Object
267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 |
# File 'lib/clacky/utils/file_processor.rb', line 267 def self.image_path_to_data_url(path) raise ArgumentError, "Image file not found: #{path}" unless File.exist?(path) size = File.size(path) if size > MAX_IMAGE_BYTES raise ArgumentError, "Image too large (#{size / 1024}KB > #{MAX_IMAGE_BYTES / 1024}KB): #{path}" end require "base64" # Extension-based guess as fallback only ext = File.extname(path).downcase.delete(".") ext_mime = case ext when "jpg", "jpeg" then "image/jpeg" when "png" then "image/png" when "gif" then "image/gif" when "webp" then "image/webp" else "image/#{ext}" end raw_data = File.binread(path) # Detect actual image format from magic bytes (ignore misleading extensions) mime = detect_image_mime_type(raw_data, ext_mime) b64 = Base64.strict_encode64(raw_data) # Downscale images before sending to LLM to reduce token cost b64 = downscale_image_base64(b64, mime) "data:#{mime};base64,#{b64}" end |
.inline_local_images(content) ⇒ String
Replace local image paths in markdown content with base64 data URLs.
Handles both ‘file:///path/to/img.png` and bare `/path/to/img.png` in markdown image syntax ``.
551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 |
# File 'lib/clacky/utils/file_processor.rb', line 551 def self.inline_local_images(content) return content if content.nil? || content.empty? content.gsub(%r{(!\[[^\]]*\])\((file://)?(/[^)]+)\)}) do prefix = $1 _scheme = $2 raw_path = $3 path = CGI.unescape(raw_path) ext = File.extname(path).downcase full_match = $& unless LOCAL_IMAGE_EXTENSIONS.include?(ext) && File.exist?(path) next full_match end begin data_url = image_path_to_data_url(path) Clacky::Logger.info("file_processor.inline_local_images", path: path, size: File.size(path)) "#{prefix}(#{data_url})" rescue StandardError => e Clacky::Logger.warn("file_processor.inline_local_images.failed", path: path, error: e.) full_match end end end |
.process(body:, filename:) ⇒ FileRef
Save + parse in one call (convenience method).
180 181 182 183 |
# File 'lib/clacky/utils/file_processor.rb', line 180 def self.process(body:, filename:) saved = save(body: body, filename: filename) process_path(saved[:path], name: saved[:name]) end |
.process_path(path, name: nil) ⇒ FileRef
Parse an already-saved file and return a FileRef. Called by agent.run for each disk file before building the prompt.
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
# File 'lib/clacky/utils/file_processor.rb', line 121 def self.process_path(path, name: nil) name ||= File.basename(path.to_s) # Use compound extension for .tar.gz so it's treated as a tarball, not gzip. basename_lower = name.to_s.downcase ext = if basename_lower.end_with?(".tar.gz") ".tar.gz" else File.extname(path.to_s).downcase end type = FILE_TYPES[ext] || :file case ext when ".zip" body = File.binread(path) preview_content = parse_zip_listing(body) preview_path = save_preview(preview_content, path) FileRef.new(name: name, type: :zip, original_path: path, preview_path: preview_path) when ".tar", ".tar.gz", ".tgz", ".gz" # Archive listing for tarballs and gzip'd files. Provides the LLM a # file-tree preview so it can decide whether to ask the user to # extract them (via the shell tool). begin preview_content = parse_tar_listing(path, ext) preview_path = save_preview(preview_content, path) FileRef.new(name: name, type: :zip, original_path: path, preview_path: preview_path) rescue => e FileRef.new(name: name, type: :zip, original_path: path, parse_error: e.) end when ".png", ".jpg", ".jpeg", ".gif", ".webp" FileRef.new(name: name, type: :image, original_path: path) when ".csv" # CSV is plain text — the file itself IS the preview. No parser, no copy. # FileReader handles encoding fallback via safe_utf8 when it reads the file. FileRef.new(name: name, type: :csv, original_path: path, preview_path: path) when *TEXT_PREVIEW_EXTENSIONS # Markdown / plain text / log: the file itself IS the preview. # No parser needed, no tmpdir copy — just point preview_path at the original. FileRef.new(name: name, type: :text, original_path: path, preview_path: path) else result = Utils::ParserManager.parse(path) if result[:success] preview_path = save_preview(result[:text], path) FileRef.new(name: name, type: type, original_path: path, preview_path: preview_path) else FileRef.new(name: name, type: type, original_path: path, parse_error: result[:error], parser_path: result[:parser_path]) end end end |
.rewrite_local_image_urls(content) ⇒ String?
Rewrite local image paths in markdown content to use the /api/local-image proxy.
Matches two patterns inside ‘`:
1. file:// URLs → 
2. bare absolute paths → 
https:// URLs and non-image files are left untouched.
594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 |
# File 'lib/clacky/utils/file_processor.rb', line 594 def self.rewrite_local_image_urls(content) return content if content.nil? || content.empty? # Rewrite markdown image syntax  → proxy URL content = content.gsub(/!\[([^\]]*)\]\(((?:file:\/\/)?\/[^)]+)\)/) do |_match| alt = Regexp.last_match(1) href = Regexp.last_match(2) path = href.sub(%r{\Afile://}, "") path = CGI.unescape(path) ext = File.extname(path).downcase if LOCAL_MEDIA_EXTENSIONS.include?(ext) && File.exist?(path) encoded = CGI.escape(href) "" else _match end end # Rewrite <video src="file:///path/vid.mp4" ...> → proxy URL content = content.gsub(/<video\b([^>]*)\bsrc="((?:file:\/\/)?\/[^"]+)"([^>]*)>/) do |_match| pre = Regexp.last_match(1) || "" href = Regexp.last_match(2) post = Regexp.last_match(3) || "" path = href.sub(%r{\Afile://}, "") path = CGI.unescape(path) ext = File.extname(path).downcase if LOCAL_VIDEO_EXTENSIONS.include?(ext) && File.exist?(path) encoded = CGI.escape(href) "<video#{pre} src=\"/api/local-image?path=#{encoded}\"#{post}>" else _match end end # Rewrite <audio src="file:///path/audio.wav" ...> → proxy URL content = content.gsub(/<audio\b([^>]*)\bsrc="((?:file:\/\/)?\/[^"]+)"([^>]*)>/) do |_match| pre = Regexp.last_match(1) || "" href = Regexp.last_match(2) post = Regexp.last_match(3) || "" path = href.sub(%r{\Afile://}, "") path = CGI.unescape(path) ext = File.extname(path).downcase if LOCAL_AUDIO_EXTENSIONS.include?(ext) && File.exist?(path) encoded = CGI.escape(href) "<audio#{pre} src=\"/api/local-image?path=#{encoded}\"#{post}>" else _match end end # Rewrite video/audio markdown links [text](file:///path) → proxy URL content = content.gsub(/(?<!!)\[([^\]]*)\]\(((?:file:\/\/)?\/[^)]+)\)/) do |_match| text = Regexp.last_match(1) href = Regexp.last_match(2) path = href.sub(%r{\Afile://}, "") path = CGI.unescape(path) ext = File.extname(path).downcase if LOCAL_VIDEO_EXTENSIONS.include?(ext) || LOCAL_AUDIO_EXTENSIONS.include?(ext) if File.exist?(path) encoded = CGI.escape(href) "[#{text}](/api/local-image?path=#{encoded})" else _match end else _match end end content end |
.save(body:, filename:) ⇒ Hash
Store raw bytes to disk — no parsing. Used by http_server upload endpoint and channel adapters.
107 108 109 110 111 112 113 |
# File 'lib/clacky/utils/file_processor.rb', line 107 def self.save(body:, filename:) FileUtils.mkdir_p(UPLOAD_DIR) safe_name = sanitize_filename(filename) dest = File.join(UPLOAD_DIR, "#{SecureRandom.hex(8)}_#{safe_name}") File.binwrite(dest, body) { name: safe_name, path: dest } end |
.save_image_to_disk(body:, mime_type:, filename: "image.jpg") ⇒ Object
Save raw image bytes to disk and return a FileRef. Used by agent when an image exceeds MAX_IMAGE_BYTES and must be downgraded to disk.
187 188 189 190 191 192 193 |
# File 'lib/clacky/utils/file_processor.rb', line 187 def self.save_image_to_disk(body:, mime_type:, filename: "image.jpg") FileUtils.mkdir_p(UPLOAD_DIR) safe_name = sanitize_filename(filename) dest = File.join(UPLOAD_DIR, "#{SecureRandom.hex(8)}_#{safe_name}") File.binwrite(dest, body) FileRef.new(name: safe_name, type: :image, original_path: dest) end |
.supported_binary_file?(path) ⇒ Boolean
211 212 213 |
# File 'lib/clacky/utils/file_processor.rb', line 211 def self.supported_binary_file?(path) LLM_BINARY_EXTENSIONS.include?(File.extname(path).downcase) end |