Module: Clacky::Utils::FileProcessor

Defined in:
lib/clacky/utils/file_processor.rb

Overview

File processing pipeline.

Two entry points:

FileProcessor.save(body:, filename:)
  → Store raw bytes to disk only. Returns { name:, path: }.
    Used by http_server and channel adapters — no parsing here.

FileProcessor.process_path(path, name: nil)
  → Parse an already-saved file. Returns FileRef (with preview_path or parse_error).
    Used by agent.run when building the file prompt.

(FileProcessor.process = save + process_path in one call, for convenience.)

Defined Under Namespace

Classes: FileRef

Constant Summary collapse

UPLOAD_DIR =
File.join(Dir.tmpdir, "clacky-uploads").freeze
MAX_FILE_BYTES =

32 MB

32 * 1024 * 1024
MAX_IMAGE_BYTES =

5 MB

5 * 1024 * 1024
MAX_FILE_SIZE =

Alias used by FileReader tool

MAX_FILE_BYTES
IMAGE_MAX_WIDTH =

Images wider than this will be downscaled before sending to LLM (pixels)

800
IMAGE_MAX_BASE64_BYTES =

Hard limit for images that can’t be resized: Anthropic/Bedrock vision API supports up to 5MB

5_000_000
BINARY_EXTENSIONS =
%w[
  .png .jpg .jpeg .gif .webp .bmp .tiff .ico .svg
  .pdf
  .zip .gz .tgz .tar .rar .7z
  .exe .dll .so .dylib
  .mp3 .mp4 .avi .mov .mkv .wav .flac
  .ttf .otf .woff .woff2
  .db .sqlite .bin .dat
  .wps .et .dps
].freeze
GLOB_ALLOWED_BINARY_EXTENSIONS =
%w[
  .pdf .doc .docx .ppt .pptx .xls .xlsx .odt .odp .ods
  .wps .et .dps
].freeze
LLM_BINARY_EXTENSIONS =
%w[.png .jpg .jpeg .gif .webp .pdf].freeze
MIME_TYPES =
{
  ".png"  => "image/png",
  ".jpg"  => "image/jpeg",
  ".jpeg" => "image/jpeg",
  ".gif"  => "image/gif",
  ".webp" => "image/webp",
  ".pdf"  => "application/pdf"
}.freeze
FILE_TYPES =
{
  ".docx" => :document,  ".doc"  => :document,
  ".xlsx" => :spreadsheet, ".xls" => :spreadsheet,
  ".pptx" => :presentation, ".ppt" => :presentation,
  ".wps"  => :document, ".et"  => :spreadsheet, ".dps" => :presentation,
  ".pdf"  => :pdf,
  ".zip"  => :zip, ".gz" => :zip, ".tgz" => :zip, ".tar" => :zip, ".rar" => :zip, ".7z" => :zip,
  ".png"  => :image, ".jpg" => :image, ".jpeg" => :image,
  ".gif"  => :image, ".webp" => :image,
  ".csv"  => :csv,
  ".md"   => :text, ".markdown" => :text, ".txt" => :text, ".log" => :text
}.freeze
TEXT_PREVIEW_EXTENSIONS =

Plain-text extensions whose raw content can be embedded directly as the preview (no external parser needed). Kept conservative to avoid pulling in huge source files by mistake.

%w[.md .markdown .txt .log].freeze
LOCAL_IMAGE_EXTENSIONS =

Image extensions that can be inlined as data URLs in markdown content.

%w[.png .jpg .jpeg .gif .webp].freeze

Class Method Summary collapse

Class Method Details

.binary_file_path?(path) ⇒ Boolean


File type helpers (used by tools and agent)


Returns:

  • (Boolean)


190
191
192
193
194
195
196
# File 'lib/clacky/utils/file_processor.rb', line 190

def self.binary_file_path?(path)
  ext = File.extname(path).downcase
  return true if BINARY_EXTENSIONS.include?(ext)
  File.binread(path, 512).to_s.include?("\x00")
rescue
  false
end

.detect_image_mime_type(data, fallback_mime = "image/png") ⇒ String

Detect the actual image MIME type from raw binary data by inspecting magic bytes, ignoring the file extension. Falls back to extension-based detection when magic bytes don’t match any known format.

Handles: PNG, JPEG, GIF, WEBP, BMP, TIFF

Parameters:

  • data (String)

    raw binary data (first 12 bytes is sufficient)

  • fallback_mime (String) (defaults to: "image/png")

    MIME type from extension, used as fallback

Returns:

  • (String)

    detected MIME type (e.g. “image/png”, “image/jpeg”)



423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
# File 'lib/clacky/utils/file_processor.rb', line 423

def self.detect_image_mime_type(data, fallback_mime = "image/png")
  return fallback_mime if data.nil? || data.bytesize < 4

  bytes = data.bytes

  case
  # PNG: \x89 P N G \r \n \x1a \n
  when bytes[0] == 0x89 && bytes[1] == 0x50 && bytes[2] == 0x4E && bytes[3] == 0x47
    "image/png"
  # JPEG: \xFF \xD8 \xFF
  when bytes[0] == 0xFF && bytes[1] == 0xD8 && bytes[2] == 0xFF
    "image/jpeg"
  # GIF: GIF87a or GIF89a
  when bytes[0] == 0x47 && bytes[1] == 0x49 && bytes[2] == 0x46 && bytes[3] == 0x38
    "image/gif"
  # WEBP: RIFF .... WEBP
  when bytes[0] == 0x52 && bytes[1] == 0x49 && bytes[2] == 0x46 && bytes[3] == 0x46 &&
       data.bytesize >= 12 && data[8, 4] == "WEBP"
    "image/webp"
  # BMP: BM
  when bytes[0] == 0x42 && bytes[1] == 0x4D
    "image/bmp"
  # TIFF: II*\x00 (little-endian) or MM\x00* (big-endian)
  when (bytes[0] == 0x49 && bytes[1] == 0x49 && bytes[2] == 0x2A && bytes[3] == 0x00) ||
       (bytes[0] == 0x4D && bytes[1] == 0x4D && bytes[2] == 0x00 && bytes[3] == 0x2A)
    "image/tiff"
  else
    fallback_mime
  end
end

.detect_mime_type(path, _data = nil) ⇒ Object



206
207
208
# File 'lib/clacky/utils/file_processor.rb', line 206

def self.detect_mime_type(path, _data = nil)
  MIME_TYPES[File.extname(path).downcase] || "application/octet-stream"
end

.downscale_image_base64(b64, mime_type, max_width: IMAGE_MAX_WIDTH) ⇒ String

Downscale a base64-encoded image so its width is at most max_width pixels.

Strategy:

PNG  → chunky_png (pure Ruby, always available as gem dependency)
other formats (JPG/WEBP/GIF) → sips on macOS, `convert` (ImageMagick) on Linux
fallback (no CLI tool) → return as-is, but raise if larger than IMAGE_MAX_BASE64_BYTES

Parameters:

  • b64 (String)

    base64-encoded image data

  • mime_type (String)

    e.g. “image/png”, “image/jpeg”, “image/webp”

  • max_width (Integer) (defaults to: IMAGE_MAX_WIDTH)

    maximum output width in pixels (default: IMAGE_MAX_WIDTH)

Returns:

  • (String)

    base64-encoded (possibly downscaled) image data



221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
# File 'lib/clacky/utils/file_processor.rb', line 221

def self.downscale_image_base64(b64, mime_type, max_width: IMAGE_MAX_WIDTH)
  require "base64"

  result = if mime_type == "image/png"
             downscale_png_chunky(b64, max_width)
           else
             downscale_via_cli(b64, mime_type, max_width)
           end

  return result if result

  # No resize tool available — enforce API hard size limit (5MB)
  if b64.bytesize > IMAGE_MAX_BASE64_BYTES
    size_kb = b64.bytesize / 1024
    limit_mb = IMAGE_MAX_BASE64_BYTES / 1_000_000
    raise ArgumentError,
      "Image too large to send (#{size_kb}KB > #{limit_mb}MB). " \
      "Install ImageMagick (`brew install imagemagick`) to enable automatic resizing."
  end
  b64
end

.file_to_base64(path) ⇒ Object

Raises:

  • (ArgumentError)


243
244
245
246
247
248
249
250
251
252
253
254
255
256
# File 'lib/clacky/utils/file_processor.rb', line 243

def self.file_to_base64(path)
  require "base64"
  ext  = File.extname(path).downcase
  size = File.size(path)
  raise ArgumentError, "File too large: #{path}" if size > MAX_FILE_BYTES
  ext_mime = MIME_TYPES[ext] || "application/octet-stream"
  raw_data = File.binread(path)
  # Detect actual image format from magic bytes (ignore misleading extensions)
  mime = ext_mime.start_with?("image/") ? detect_image_mime_type(raw_data, ext_mime) : ext_mime
  data = Base64.strict_encode64(raw_data)
  # Downscale images before sending to LLM to reduce token cost
  data = downscale_image_base64(data, mime) if mime.start_with?("image/")
  { format: ext[1..], mime_type: mime, size_bytes: size, base64_data: data }
end

.glob_allowed_binary?(path) ⇒ Boolean

Returns:

  • (Boolean)


198
199
200
# File 'lib/clacky/utils/file_processor.rb', line 198

def self.glob_allowed_binary?(path)
  GLOB_ALLOWED_BINARY_EXTENSIONS.include?(File.extname(path).downcase)
end

.image_path_to_data_url(path) ⇒ Object

Raises:

  • (ArgumentError)


258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
# File 'lib/clacky/utils/file_processor.rb', line 258

def self.image_path_to_data_url(path)
  raise ArgumentError, "Image file not found: #{path}" unless File.exist?(path)
  size = File.size(path)
  if size > MAX_IMAGE_BYTES
    raise ArgumentError, "Image too large (#{size / 1024}KB > #{MAX_IMAGE_BYTES / 1024}KB): #{path}"
  end
  require "base64"
  # Extension-based guess as fallback only
  ext  = File.extname(path).downcase.delete(".")
  ext_mime = case ext
             when "jpg", "jpeg" then "image/jpeg"
             when "png"         then "image/png"
             when "gif"         then "image/gif"
             when "webp"        then "image/webp"
             else "image/#{ext}"
             end
  raw_data = File.binread(path)
  # Detect actual image format from magic bytes (ignore misleading extensions)
  mime = detect_image_mime_type(raw_data, ext_mime)
  b64 = Base64.strict_encode64(raw_data)
  # Downscale images before sending to LLM to reduce token cost
  b64 = downscale_image_base64(b64, mime)
  "data:#{mime};base64,#{b64}"
end

.inline_local_images(content) ⇒ String

Replace local image paths in markdown content with base64 data URLs.

Handles both ‘file:///path/to/img.png` and bare `/path/to/img.png` in markdown image syntax `![alt](src)`.

Parameters:

  • content (String)

    markdown text potentially containing local image references

Returns:

  • (String)

    content with local images replaced by data URLs



539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
# File 'lib/clacky/utils/file_processor.rb', line 539

def self.inline_local_images(content)
  return content if content.nil? || content.empty?

  content.gsub(%r{(!\[[^\]]*\])\((file://)?(/[^)]+)\)}) do
    prefix     = $1
    _scheme    = $2
    raw_path   = $3
    path       = CGI.unescape(raw_path)
    ext        = File.extname(path).downcase
    full_match = $&

    unless LOCAL_IMAGE_EXTENSIONS.include?(ext) && File.exist?(path)
      next full_match
    end

    begin
      data_url = image_path_to_data_url(path)
      Clacky::Logger.info("file_processor.inline_local_images", path: path, size: File.size(path))
      "#{prefix}(#{data_url})"
    rescue StandardError => e
      Clacky::Logger.warn("file_processor.inline_local_images.failed", path: path, error: e.message)
      full_match
    end
  end
end

.process(body:, filename:) ⇒ FileRef

Save + parse in one call (convenience method).

Returns:



171
172
173
174
# File 'lib/clacky/utils/file_processor.rb', line 171

def self.process(body:, filename:)
  saved = save(body: body, filename: filename)
  process_path(saved[:path], name: saved[:name])
end

.process_path(path, name: nil) ⇒ FileRef

Parse an already-saved file and return a FileRef. Called by agent.run for each disk file before building the prompt.

Parameters:

  • path (String)

    Path to the file on disk

  • name (String) (defaults to: nil)

    Display name (defaults to basename)

Returns:



112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
# File 'lib/clacky/utils/file_processor.rb', line 112

def self.process_path(path, name: nil)
  name ||= File.basename(path.to_s)
  # Use compound extension for .tar.gz so it's treated as a tarball, not gzip.
  basename_lower = name.to_s.downcase
  ext =
    if basename_lower.end_with?(".tar.gz")
      ".tar.gz"
    else
      File.extname(path.to_s).downcase
    end
  type  = FILE_TYPES[ext] || :file

  case ext
  when ".zip"
    body            = File.binread(path)
    preview_content = parse_zip_listing(body)
    preview_path    = save_preview(preview_content, path)
    FileRef.new(name: name, type: :zip, original_path: path, preview_path: preview_path)

  when ".tar", ".tar.gz", ".tgz", ".gz"
    # Archive listing for tarballs and gzip'd files. Provides the LLM a
    # file-tree preview so it can decide whether to ask the user to
    # extract them (via the shell tool).
    begin
      preview_content = parse_tar_listing(path, ext)
      preview_path    = save_preview(preview_content, path)
      FileRef.new(name: name, type: :zip, original_path: path, preview_path: preview_path)
    rescue => e
      FileRef.new(name: name, type: :zip, original_path: path, parse_error: e.message)
    end

  when ".png", ".jpg", ".jpeg", ".gif", ".webp"
    FileRef.new(name: name, type: :image, original_path: path)

  when ".csv"
    # CSV is plain text — the file itself IS the preview. No parser, no copy.
    # FileReader handles encoding fallback via safe_utf8 when it reads the file.
    FileRef.new(name: name, type: :csv, original_path: path, preview_path: path)

  when *TEXT_PREVIEW_EXTENSIONS
    # Markdown / plain text / log: the file itself IS the preview.
    # No parser needed, no tmpdir copy — just point preview_path at the original.
    FileRef.new(name: name, type: :text, original_path: path, preview_path: path)

  else
    result = Utils::ParserManager.parse(path)
    if result[:success]
      preview_path = save_preview(result[:text], path)
      FileRef.new(name: name, type: type, original_path: path, preview_path: preview_path)
    else
      FileRef.new(name: name, type: type, original_path: path,
                  parse_error: result[:error], parser_path: result[:parser_path])
    end
  end
end

.rewrite_local_image_urls(content) ⇒ String?

Rewrite local image paths in markdown content to use the /api/local-image proxy.

Matches two patterns inside ‘![alt](url)`:

1. file:// URLs  →  ![alt](/api/local-image?path=file:///abs/path.png)
2. bare absolute paths  →  ![alt](/api/local-image?path=/abs/path.png)

https:// URLs and non-image files are left untouched.

Parameters:

  • content (String, nil)

    markdown text

Returns:

  • (String, nil)

    rewritten content (or original if nothing matched)



582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
# File 'lib/clacky/utils/file_processor.rb', line 582

def self.rewrite_local_image_urls(content)
  return content if content.nil? || content.empty?

  content.gsub(/!\[([^\]]*)\]\(((?:file:\/\/)?\/[^)]+)\)/) do |match|
    alt  = Regexp.last_match(1)
    href = Regexp.last_match(2)

    # Extract the filesystem path from the href
    path = href.sub(%r{\Afile://}, "")
    path = CGI.unescape(path)

    ext = File.extname(path).downcase
    if LOCAL_IMAGE_EXTENSIONS.include?(ext) && File.exist?(path)
      encoded = CGI.escape(href)
      "![#{alt}](/api/local-image?path=#{encoded})"
    else
      match # return original match unchanged
    end
  end
end

.save(body:, filename:) ⇒ Hash

Store raw bytes to disk — no parsing. Used by http_server upload endpoint and channel adapters.

Returns:

  • (Hash)

    { name: String, path: String }



98
99
100
101
102
103
104
# File 'lib/clacky/utils/file_processor.rb', line 98

def self.save(body:, filename:)
  FileUtils.mkdir_p(UPLOAD_DIR)
  safe_name = sanitize_filename(filename)
  dest      = File.join(UPLOAD_DIR, "#{SecureRandom.hex(8)}_#{safe_name}")
  File.binwrite(dest, body)
  { name: safe_name, path: dest }
end

.save_image_to_disk(body:, mime_type:, filename: "image.jpg") ⇒ Object

Save raw image bytes to disk and return a FileRef. Used by agent when an image exceeds MAX_IMAGE_BYTES and must be downgraded to disk.



178
179
180
181
182
183
184
# File 'lib/clacky/utils/file_processor.rb', line 178

def self.save_image_to_disk(body:, mime_type:, filename: "image.jpg")
  FileUtils.mkdir_p(UPLOAD_DIR)
  safe_name = sanitize_filename(filename)
  dest      = File.join(UPLOAD_DIR, "#{SecureRandom.hex(8)}_#{safe_name}")
  File.binwrite(dest, body)
  FileRef.new(name: safe_name, type: :image, original_path: dest)
end

.supported_binary_file?(path) ⇒ Boolean

Returns:

  • (Boolean)


202
203
204
# File 'lib/clacky/utils/file_processor.rb', line 202

def self.supported_binary_file?(path)
  LLM_BINARY_EXTENSIONS.include?(File.extname(path).downcase)
end