Module: Clacky::Utils::FileProcessor

Defined in:
lib/clacky/utils/file_processor.rb

Overview

File processing pipeline.

Two entry points:

FileProcessor.save(body:, filename:)
  → Store raw bytes to disk only. Returns { name:, path: }.
    Used by http_server and channel adapters — no parsing here.

FileProcessor.process_path(path, name: nil)
  → Parse an already-saved file. Returns FileRef (with preview_path or parse_error).
    Used by agent.run when building the file prompt.

(FileProcessor.process = save + process_path in one call, for convenience.)

Defined Under Namespace

Classes: FileRef

Constant Summary collapse

UPLOAD_DIR =
File.join(Dir.tmpdir, "clacky-uploads").freeze
MAX_FILE_BYTES =

32 MB

32 * 1024 * 1024
MAX_IMAGE_BYTES =

5 MB

5 * 1024 * 1024
MAX_FILE_SIZE =

Alias used by FileReader tool

MAX_FILE_BYTES
IMAGE_MAX_WIDTH =

Images wider than this will be downscaled before sending to LLM (pixels)

800
IMAGE_MAX_BASE64_BYTES =

Hard limit for images that can’t be resized: Anthropic/Bedrock vision API supports up to 5MB

5_000_000
BINARY_EXTENSIONS =
%w[
  .png .jpg .jpeg .gif .webp .bmp .tiff .ico .svg
  .pdf
  .zip .gz .tar .rar .7z
  .exe .dll .so .dylib
  .mp3 .mp4 .avi .mov .mkv .wav .flac
  .ttf .otf .woff .woff2
  .db .sqlite .bin .dat
].freeze
GLOB_ALLOWED_BINARY_EXTENSIONS =
%w[
  .pdf .doc .docx .ppt .pptx .xls .xlsx .odt .odp .ods
].freeze
LLM_BINARY_EXTENSIONS =
%w[.png .jpg .jpeg .gif .webp .pdf].freeze
MIME_TYPES =
{
  ".png"  => "image/png",
  ".jpg"  => "image/jpeg",
  ".jpeg" => "image/jpeg",
  ".gif"  => "image/gif",
  ".webp" => "image/webp",
  ".pdf"  => "application/pdf"
}.freeze
FILE_TYPES =
{
  ".docx" => :document,  ".doc"  => :document,
  ".xlsx" => :spreadsheet, ".xls" => :spreadsheet,
  ".pptx" => :presentation, ".ppt" => :presentation,
  ".pdf"  => :pdf,
  ".zip"  => :zip, ".gz" => :zip, ".tar" => :zip, ".rar" => :zip, ".7z" => :zip,
  ".png"  => :image, ".jpg" => :image, ".jpeg" => :image,
  ".gif"  => :image, ".webp" => :image,
  ".csv"  => :csv
}.freeze

Class Method Summary collapse

Class Method Details

.binary_file_path?(path) ⇒ Boolean


File type helpers (used by tools and agent)


Returns:

  • (Boolean)


163
164
165
166
167
168
169
# File 'lib/clacky/utils/file_processor.rb', line 163

def self.binary_file_path?(path)
  ext = File.extname(path).downcase
  return true if BINARY_EXTENSIONS.include?(ext)
  File.binread(path, 512).to_s.include?("\x00")
rescue
  false
end

.detect_mime_type(path, _data = nil) ⇒ Object



179
180
181
# File 'lib/clacky/utils/file_processor.rb', line 179

def self.detect_mime_type(path, _data = nil)
  MIME_TYPES[File.extname(path).downcase] || "application/octet-stream"
end

.downscale_image_base64(b64, mime_type, max_width: IMAGE_MAX_WIDTH) ⇒ String

Downscale a base64-encoded image so its width is at most max_width pixels.

Strategy:

PNG  → chunky_png (pure Ruby, always available as gem dependency)
other formats (JPG/WEBP/GIF) → sips on macOS, `convert` (ImageMagick) on Linux
fallback (no CLI tool) → return as-is, but raise if larger than IMAGE_MAX_BASE64_BYTES

Parameters:

  • b64 (String)

    base64-encoded image data

  • mime_type (String)

    e.g. “image/png”, “image/jpeg”, “image/webp”

  • max_width (Integer) (defaults to: IMAGE_MAX_WIDTH)

    maximum output width in pixels (default: IMAGE_MAX_WIDTH)

Returns:

  • (String)

    base64-encoded (possibly downscaled) image data



194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
# File 'lib/clacky/utils/file_processor.rb', line 194

def self.downscale_image_base64(b64, mime_type, max_width: IMAGE_MAX_WIDTH)
  require "base64"

  result = if mime_type == "image/png"
             downscale_png_chunky(b64, max_width)
           else
             downscale_via_cli(b64, mime_type, max_width)
           end

  return result if result

  # No resize tool available — enforce API hard size limit (5MB)
  if b64.bytesize > IMAGE_MAX_BASE64_BYTES
    size_kb = b64.bytesize / 1024
    limit_mb = IMAGE_MAX_BASE64_BYTES / 1_000_000
    raise ArgumentError,
      "Image too large to send (#{size_kb}KB > #{limit_mb}MB). " \
      "Install ImageMagick (`brew install imagemagick`) to enable automatic resizing."
  end
  b64
end

.file_to_base64(path) ⇒ Object

Raises:

  • (ArgumentError)


216
217
218
219
220
221
222
223
224
225
226
# File 'lib/clacky/utils/file_processor.rb', line 216

def self.file_to_base64(path)
  require "base64"
  ext  = File.extname(path).downcase
  size = File.size(path)
  raise ArgumentError, "File too large: #{path}" if size > MAX_FILE_BYTES
  mime = MIME_TYPES[ext] || "application/octet-stream"
  data = Base64.strict_encode64(File.binread(path))
  # Downscale images before sending to LLM to reduce token cost
  data = downscale_image_base64(data, mime) if mime.start_with?("image/")
  { format: ext[1..], mime_type: mime, size_bytes: size, base64_data: data }
end

.glob_allowed_binary?(path) ⇒ Boolean

Returns:

  • (Boolean)


171
172
173
# File 'lib/clacky/utils/file_processor.rb', line 171

def self.glob_allowed_binary?(path)
  GLOB_ALLOWED_BINARY_EXTENSIONS.include?(File.extname(path).downcase)
end

.image_path_to_data_url(path) ⇒ Object

Raises:

  • (ArgumentError)


228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
# File 'lib/clacky/utils/file_processor.rb', line 228

def self.image_path_to_data_url(path)
  raise ArgumentError, "Image file not found: #{path}" unless File.exist?(path)
  size = File.size(path)
  if size > MAX_IMAGE_BYTES
    raise ArgumentError, "Image too large (#{size / 1024}KB > #{MAX_IMAGE_BYTES / 1024}KB): #{path}"
  end
  require "base64"
  ext  = File.extname(path).downcase.delete(".")
  mime = case ext
         when "jpg", "jpeg" then "image/jpeg"
         when "png"         then "image/png"
         when "gif"         then "image/gif"
         when "webp"        then "image/webp"
         else "image/#{ext}"
         end
  b64 = Base64.strict_encode64(File.binread(path))
  # Downscale images before sending to LLM to reduce token cost
  b64 = downscale_image_base64(b64, mime)
  "data:#{mime};base64,#{b64}"
end

.process(body:, filename:) ⇒ FileRef

Save + parse in one call (convenience method).

Returns:



144
145
146
147
# File 'lib/clacky/utils/file_processor.rb', line 144

def self.process(body:, filename:)
  saved = save(body: body, filename: filename)
  process_path(saved[:path], name: saved[:name])
end

.process_path(path, name: nil) ⇒ FileRef

Parse an already-saved file and return a FileRef. Called by agent.run for each disk file before building the prompt.

Parameters:

  • path (String)

    Path to the file on disk

  • name (String) (defaults to: nil)

    Display name (defaults to basename)

Returns:



103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# File 'lib/clacky/utils/file_processor.rb', line 103

def self.process_path(path, name: nil)
  name ||= File.basename(path.to_s)
  ext   = File.extname(path.to_s).downcase
  type  = FILE_TYPES[ext] || :file

  case ext
  when ".zip"
    body            = File.binread(path)
    preview_content = parse_zip_listing(body)
    preview_path    = save_preview(preview_content, path)
    FileRef.new(name: name, type: :zip, original_path: path, preview_path: preview_path)

  when ".png", ".jpg", ".jpeg", ".gif", ".webp"
    FileRef.new(name: name, type: :image, original_path: path)

  when ".csv"
    # CSV is plain text — read directly, no external parser needed.
    # Try UTF-8 first, then GBK (common in Chinese-origin CSV), then binary with replacement.
    begin
      text         = read_text_with_encoding_fallback(path)
      preview_path = save_preview(text, path)
      FileRef.new(name: name, type: :csv, original_path: path, preview_path: preview_path)
    rescue => e
      FileRef.new(name: name, type: :csv, original_path: path, parse_error: e.message)
    end

  else
    result = Utils::ParserManager.parse(path)
    if result[:success]
      preview_path = save_preview(result[:text], path)
      FileRef.new(name: name, type: type, original_path: path, preview_path: preview_path)
    else
      FileRef.new(name: name, type: type, original_path: path,
                  parse_error: result[:error], parser_path: result[:parser_path])
    end
  end
end

.save(body:, filename:) ⇒ Hash

Store raw bytes to disk — no parsing. Used by http_server upload endpoint and channel adapters.

Returns:

  • (Hash)

    { name: String, path: String }



89
90
91
92
93
94
95
# File 'lib/clacky/utils/file_processor.rb', line 89

def self.save(body:, filename:)
  FileUtils.mkdir_p(UPLOAD_DIR)
  safe_name = sanitize_filename(filename)
  dest      = File.join(UPLOAD_DIR, "#{SecureRandom.hex(8)}_#{safe_name}")
  File.binwrite(dest, body)
  { name: safe_name, path: dest }
end

.save_image_to_disk(body:, mime_type:, filename: "image.jpg") ⇒ Object

Save raw image bytes to disk and return a FileRef. Used by agent when an image exceeds MAX_IMAGE_BYTES and must be downgraded to disk.



151
152
153
154
155
156
157
# File 'lib/clacky/utils/file_processor.rb', line 151

def self.save_image_to_disk(body:, mime_type:, filename: "image.jpg")
  FileUtils.mkdir_p(UPLOAD_DIR)
  safe_name = sanitize_filename(filename)
  dest      = File.join(UPLOAD_DIR, "#{SecureRandom.hex(8)}_#{safe_name}")
  File.binwrite(dest, body)
  FileRef.new(name: safe_name, type: :image, original_path: dest)
end

.supported_binary_file?(path) ⇒ Boolean

Returns:

  • (Boolean)


175
176
177
# File 'lib/clacky/utils/file_processor.rb', line 175

def self.supported_binary_file?(path)
  LLM_BINARY_EXTENSIONS.include?(File.extname(path).downcase)
end