Module: Clacky::Utils::FileProcessor
- Defined in:
- lib/clacky/utils/file_processor.rb
Overview
File processing pipeline.
Two entry points:
FileProcessor.save(body:, filename:)
→ Store raw bytes to disk only. Returns { name:, path: }.
Used by http_server and channel adapters — no parsing here.
FileProcessor.process_path(path, name: nil)
→ Parse an already-saved file. Returns FileRef (with preview_path or parse_error).
Used by agent.run when building the file prompt.
(FileProcessor.process = save + process_path in one call, for convenience.)
Defined Under Namespace
Classes: FileRef
Constant Summary collapse
- UPLOAD_DIR =
File.join(Dir.tmpdir, "clacky-uploads").freeze
- MAX_FILE_BYTES =
32 MB
32 * 1024 * 1024
- MAX_IMAGE_BYTES =
5 MB
5 * 1024 * 1024
- MAX_FILE_SIZE =
Alias used by FileReader tool
MAX_FILE_BYTES- IMAGE_MAX_WIDTH =
Images wider than this will be downscaled before sending to LLM (pixels)
800- IMAGE_MAX_BASE64_BYTES =
Hard limit for images that can’t be resized: Anthropic/Bedrock vision API supports up to 5MB
5_000_000- BINARY_EXTENSIONS =
%w[ .png .jpg .jpeg .gif .webp .bmp .tiff .ico .svg .pdf .zip .gz .tar .rar .7z .exe .dll .so .dylib .mp3 .mp4 .avi .mov .mkv .wav .flac .ttf .otf .woff .woff2 .db .sqlite .bin .dat ].freeze
- GLOB_ALLOWED_BINARY_EXTENSIONS =
%w[ .pdf .doc .docx .ppt .pptx .xls .xlsx .odt .odp .ods ].freeze
- LLM_BINARY_EXTENSIONS =
%w[.png .jpg .jpeg .gif .webp .pdf].freeze
- MIME_TYPES =
{ ".png" => "image/png", ".jpg" => "image/jpeg", ".jpeg" => "image/jpeg", ".gif" => "image/gif", ".webp" => "image/webp", ".pdf" => "application/pdf" }.freeze
- FILE_TYPES =
{ ".docx" => :document, ".doc" => :document, ".xlsx" => :spreadsheet, ".xls" => :spreadsheet, ".pptx" => :presentation, ".ppt" => :presentation, ".pdf" => :pdf, ".zip" => :zip, ".gz" => :zip, ".tar" => :zip, ".rar" => :zip, ".7z" => :zip, ".png" => :image, ".jpg" => :image, ".jpeg" => :image, ".gif" => :image, ".webp" => :image, ".csv" => :csv }.freeze
Class Method Summary collapse
-
.binary_file_path?(path) ⇒ Boolean
————————————————————————— File type helpers (used by tools and agent) —————————————————————————.
- .detect_mime_type(path, _data = nil) ⇒ Object
-
.downscale_image_base64(b64, mime_type, max_width: IMAGE_MAX_WIDTH) ⇒ String
Downscale a base64-encoded image so its width is at most max_width pixels.
- .file_to_base64(path) ⇒ Object
- .glob_allowed_binary?(path) ⇒ Boolean
- .image_path_to_data_url(path) ⇒ Object
-
.process(body:, filename:) ⇒ FileRef
Save + parse in one call (convenience method).
-
.process_path(path, name: nil) ⇒ FileRef
Parse an already-saved file and return a FileRef.
-
.save(body:, filename:) ⇒ Hash
Store raw bytes to disk — no parsing.
-
.save_image_to_disk(body:, mime_type:, filename: "image.jpg") ⇒ Object
Save raw image bytes to disk and return a FileRef.
- .supported_binary_file?(path) ⇒ Boolean
Class Method Details
.binary_file_path?(path) ⇒ Boolean
File type helpers (used by tools and agent)
163 164 165 166 167 168 169 |
# File 'lib/clacky/utils/file_processor.rb', line 163 def self.binary_file_path?(path) ext = File.extname(path).downcase return true if BINARY_EXTENSIONS.include?(ext) File.binread(path, 512).to_s.include?("\x00") rescue false end |
.detect_mime_type(path, _data = nil) ⇒ Object
179 180 181 |
# File 'lib/clacky/utils/file_processor.rb', line 179 def self.detect_mime_type(path, _data = nil) MIME_TYPES[File.extname(path).downcase] || "application/octet-stream" end |
.downscale_image_base64(b64, mime_type, max_width: IMAGE_MAX_WIDTH) ⇒ String
Downscale a base64-encoded image so its width is at most max_width pixels.
Strategy:
PNG → chunky_png (pure Ruby, always available as gem dependency)
other formats (JPG/WEBP/GIF) → sips on macOS, `convert` (ImageMagick) on Linux
fallback (no CLI tool) → return as-is, but raise if larger than IMAGE_MAX_BASE64_BYTES
194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 |
# File 'lib/clacky/utils/file_processor.rb', line 194 def self.downscale_image_base64(b64, mime_type, max_width: IMAGE_MAX_WIDTH) require "base64" result = if mime_type == "image/png" downscale_png_chunky(b64, max_width) else downscale_via_cli(b64, mime_type, max_width) end return result if result # No resize tool available — enforce API hard size limit (5MB) if b64.bytesize > IMAGE_MAX_BASE64_BYTES size_kb = b64.bytesize / 1024 limit_mb = IMAGE_MAX_BASE64_BYTES / 1_000_000 raise ArgumentError, "Image too large to send (#{size_kb}KB > #{limit_mb}MB). " \ "Install ImageMagick (`brew install imagemagick`) to enable automatic resizing." end b64 end |
.file_to_base64(path) ⇒ Object
216 217 218 219 220 221 222 223 224 225 226 |
# File 'lib/clacky/utils/file_processor.rb', line 216 def self.file_to_base64(path) require "base64" ext = File.extname(path).downcase size = File.size(path) raise ArgumentError, "File too large: #{path}" if size > MAX_FILE_BYTES mime = MIME_TYPES[ext] || "application/octet-stream" data = Base64.strict_encode64(File.binread(path)) # Downscale images before sending to LLM to reduce token cost data = downscale_image_base64(data, mime) if mime.start_with?("image/") { format: ext[1..], mime_type: mime, size_bytes: size, base64_data: data } end |
.glob_allowed_binary?(path) ⇒ Boolean
171 172 173 |
# File 'lib/clacky/utils/file_processor.rb', line 171 def self.glob_allowed_binary?(path) GLOB_ALLOWED_BINARY_EXTENSIONS.include?(File.extname(path).downcase) end |
.image_path_to_data_url(path) ⇒ Object
228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 |
# File 'lib/clacky/utils/file_processor.rb', line 228 def self.image_path_to_data_url(path) raise ArgumentError, "Image file not found: #{path}" unless File.exist?(path) size = File.size(path) if size > MAX_IMAGE_BYTES raise ArgumentError, "Image too large (#{size / 1024}KB > #{MAX_IMAGE_BYTES / 1024}KB): #{path}" end require "base64" ext = File.extname(path).downcase.delete(".") mime = case ext when "jpg", "jpeg" then "image/jpeg" when "png" then "image/png" when "gif" then "image/gif" when "webp" then "image/webp" else "image/#{ext}" end b64 = Base64.strict_encode64(File.binread(path)) # Downscale images before sending to LLM to reduce token cost b64 = downscale_image_base64(b64, mime) "data:#{mime};base64,#{b64}" end |
.process(body:, filename:) ⇒ FileRef
Save + parse in one call (convenience method).
144 145 146 147 |
# File 'lib/clacky/utils/file_processor.rb', line 144 def self.process(body:, filename:) saved = save(body: body, filename: filename) process_path(saved[:path], name: saved[:name]) end |
.process_path(path, name: nil) ⇒ FileRef
Parse an already-saved file and return a FileRef. Called by agent.run for each disk file before building the prompt.
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
# File 'lib/clacky/utils/file_processor.rb', line 103 def self.process_path(path, name: nil) name ||= File.basename(path.to_s) ext = File.extname(path.to_s).downcase type = FILE_TYPES[ext] || :file case ext when ".zip" body = File.binread(path) preview_content = parse_zip_listing(body) preview_path = save_preview(preview_content, path) FileRef.new(name: name, type: :zip, original_path: path, preview_path: preview_path) when ".png", ".jpg", ".jpeg", ".gif", ".webp" FileRef.new(name: name, type: :image, original_path: path) when ".csv" # CSV is plain text — read directly, no external parser needed. # Try UTF-8 first, then GBK (common in Chinese-origin CSV), then binary with replacement. begin text = read_text_with_encoding_fallback(path) preview_path = save_preview(text, path) FileRef.new(name: name, type: :csv, original_path: path, preview_path: preview_path) rescue => e FileRef.new(name: name, type: :csv, original_path: path, parse_error: e.) end else result = Utils::ParserManager.parse(path) if result[:success] preview_path = save_preview(result[:text], path) FileRef.new(name: name, type: type, original_path: path, preview_path: preview_path) else FileRef.new(name: name, type: type, original_path: path, parse_error: result[:error], parser_path: result[:parser_path]) end end end |
.save(body:, filename:) ⇒ Hash
Store raw bytes to disk — no parsing. Used by http_server upload endpoint and channel adapters.
89 90 91 92 93 94 95 |
# File 'lib/clacky/utils/file_processor.rb', line 89 def self.save(body:, filename:) FileUtils.mkdir_p(UPLOAD_DIR) safe_name = sanitize_filename(filename) dest = File.join(UPLOAD_DIR, "#{SecureRandom.hex(8)}_#{safe_name}") File.binwrite(dest, body) { name: safe_name, path: dest } end |
.save_image_to_disk(body:, mime_type:, filename: "image.jpg") ⇒ Object
Save raw image bytes to disk and return a FileRef. Used by agent when an image exceeds MAX_IMAGE_BYTES and must be downgraded to disk.
151 152 153 154 155 156 157 |
# File 'lib/clacky/utils/file_processor.rb', line 151 def self.save_image_to_disk(body:, mime_type:, filename: "image.jpg") FileUtils.mkdir_p(UPLOAD_DIR) safe_name = sanitize_filename(filename) dest = File.join(UPLOAD_DIR, "#{SecureRandom.hex(8)}_#{safe_name}") File.binwrite(dest, body) FileRef.new(name: safe_name, type: :image, original_path: dest) end |
.supported_binary_file?(path) ⇒ Boolean
175 176 177 |
# File 'lib/clacky/utils/file_processor.rb', line 175 def self.supported_binary_file?(path) LLM_BINARY_EXTENSIONS.include?(File.extname(path).downcase) end |