Module: Rubino::Attachments::Classify
- Defined in:
- lib/rubino/attachments/classify.rb
Overview
Deterministic, no-LLM attachment classifier with a fail-closed safety pipeline. Magic bytes (Marcel content-sniff) WIN over extension; the extension only breaks ties when sniff returns octet-stream, and any magic/extension disagreement resolves to the STRICTER kind (never up to :image/:text). Reuses the gem’s existing primitives – Tools::ReadTool’s magic-byte binary? detector and Tools::Base realpath confine – rather than a second classifier.
Constant Summary collapse
- IMAGE_MIMES =
%w[ image/png image/jpeg image/gif image/webp image/bmp image/tiff image/x-ms-bmp ].freeze
- DOCUMENT_MIMES =
SVG is XML -> treat as text, never as a native image.
%w[ application/pdf application/vnd.openxmlformats-officedocument.wordprocessingml.document application/vnd.openxmlformats-officedocument.spreadsheetml.sheet application/vnd.openxmlformats-officedocument.presentationml.presentation application/vnd.oasis.opendocument.text application/vnd.oasis.opendocument.spreadsheet application/msword application/vnd.ms-excel application/vnd.ms-powerpoint application/rtf text/rtf ].freeze
- ARCHIVE_MIMES =
%w[ application/zip application/x-tar application/gzip application/x-gzip application/x-7z-compressed application/x-rar-compressed application/vnd.rar application/x-bzip2 application/x-xz ].freeze
- IMAGE_EXTS =
%w[.png .jpg .jpeg .gif .webp .bmp .tiff .tif].freeze
- IMAGE_SIGNATURES =
Leading magic bytes per recognised image MIME (WebP is special-cased: RIFF container + WEBP tag). Marcel lets the file NAME break the tie when the content sniff only yields a generic type (text/plain, octet-stream), so a text file renamed fake.png came back image/png and was shipped to the provider (#158). An image verdict must therefore be backed by the actual signature.
{ "image/png" => ["\x89PNG\r\n\x1a\n".b], "image/jpeg" => ["\xFF\xD8\xFF".b], "image/gif" => ["GIF87a".b, "GIF89a".b], "image/bmp" => ["BM".b], "image/x-ms-bmp" => ["BM".b], "image/tiff" => ["II*\x00".b, "MM\x00*".b] }.freeze
Class Method Summary collapse
-
.base_helper ⇒ Object
A throwaway ReadTool instance gives us binary?/canonical_path without re-implementing the magic-byte list or the realpath confine.
-
.call(path, confine_dir: nil) ⇒ Object
Returns a Classification.
- .classify_kind(real) ⇒ Object
-
.image_signature?(real, mime) ⇒ Boolean
True when the file’s leading bytes carry the signature
mimeclaims. -
.kind_for_mime(mime) ⇒ Object
Maps a recognised MIME to a kind.
-
.textual_application_mime?(mime) ⇒ Boolean
JSON/XML/YAML/JS and friends arrive as application/* but are text.
- .unsafe(path, reason) ⇒ Object
Class Method Details
.base_helper ⇒ Object
A throwaway ReadTool instance gives us binary?/canonical_path without re-implementing the magic-byte list or the realpath confine. They are protected on Tools::Base, so we reach them with send – deliberate reuse of the audited primitives rather than a second copy.
161 162 163 |
# File 'lib/rubino/attachments/classify.rb', line 161 def base_helper @base_helper ||= Tools::ReadTool.new end |
.call(path, confine_dir: nil) ⇒ Object
Returns a Classification. Never raises on suspicious input – returns safe: false so the executor skips the attachment with a warn.
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
# File 'lib/rubino/attachments/classify.rb', line 57 def call(path, confine_dir: nil) original = path.to_s # --- Safety pipeline (BEFORE classify; order matters; fail closed) --- # 1. lstat first: reject symlink/FIFO/device/socket (non-regular). lst = begin File.lstat(original) rescue SystemCallError => e return unsafe(original, "cannot stat: #{e.class}") end return unsafe(original, "not a regular file (#{lst.ftype})") unless lst.file? # 2. realpath-confine to the attachment dir (reuse Base helper). Skip # when no confine_dir is given (unit calls) -- the lstat above # already blocked the symlink-escape vector. real = base_helper.send(:canonical_path, original) return unsafe(original, "cannot resolve realpath") if real.nil? if confine_dir root = base_helper.send(:canonical_path, confine_dir) unless root && (real == root || real.start_with?("#{root}#{File::SEPARATOR}")) return unsafe(original, "resolves outside attachment dir") end end # 3. size cap before reading. size = File.size(real) if size > Policy.max_file_bytes return unsafe(real, "exceeds max_file_bytes (#{size} > #{Policy.max_file_bytes})") end # 4. classify (magic wins). kind, mime = classify_kind(real) Classification.new(path: real, kind: kind, mime: mime, size_bytes: size, safe: true, reason: nil) rescue SystemCallError => e unsafe(original, "io error: #{e.class}") end |
.classify_kind(real) ⇒ Object
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
# File 'lib/rubino/attachments/classify.rb', line 96 def classify_kind(real) basename = File.basename(real) mime = Marcel::MimeType.for(Pathname(real), name: basename).to_s # Extension-spoof gate (#158): an image verdict that the magic bytes # don't back up came from the extension, not the content. Re-resolve # from content alone (no name:); when that is generic too, the text/ # binary sniff names the honest type — so fake.png full of text is # rejected at the staging gate as text/plain, before any network call. if IMAGE_MIMES.include?(mime) && !image_signature?(real, mime) mime = Marcel::MimeType.for(Pathname(real)).to_s if mime.empty? || mime == "application/octet-stream" return base_helper.send(:binary?, real) ? [:binary, "application/octet-stream"] : [:text, "text/plain"] end end # Octet-stream / unknown: magic gave nothing -> fall back to a # text-vs-binary sniff (reuse ReadTool#binary?). A binary sniff stays # binary (stricter); a text sniff is text. if mime.empty? || mime == "application/octet-stream" sniff_kind = base_helper.send(:binary?, real) ? :binary : :text return [sniff_kind, mime.empty? ? "application/octet-stream" : mime] end # Magic recognised a type. If the extension claims image but magic says # otherwise (.png-named zip), magic wins and we keep the stricter, # non-image kind -- closes the MIME-spoof hole. [kind_for_mime(mime), mime] end |
.image_signature?(real, mime) ⇒ Boolean
True when the file’s leading bytes carry the signature mime claims. Unknown image MIMEs fail closed (no signature -> not verified).
140 141 142 143 144 145 |
# File 'lib/rubino/attachments/classify.rb', line 140 def image_signature?(real, mime) head = File.binread(real, 16).to_s.b return head.start_with?("RIFF") && head[8, 4] == "WEBP" if mime == "image/webp" Array(IMAGE_SIGNATURES[mime]).any? { |sig| head.start_with?(sig) } end |
.kind_for_mime(mime) ⇒ Object
Maps a recognised MIME to a kind. text/* and code is text; svg is text.
127 128 129 130 131 132 133 134 135 136 |
# File 'lib/rubino/attachments/classify.rb', line 127 def kind_for_mime(mime) return :image if IMAGE_MIMES.include?(mime) return :document if DOCUMENT_MIMES.include?(mime) return :archive if ARCHIVE_MIMES.include?(mime) return :text if mime.start_with?("text/") return :text if mime == "image/svg+xml" return :text if textual_application_mime?(mime) :binary end |
.textual_application_mime?(mime) ⇒ Boolean
JSON/XML/YAML/JS and friends arrive as application/* but are text.
148 149 150 151 152 153 154 155 |
# File 'lib/rubino/attachments/classify.rb', line 148 def textual_application_mime?(mime) mime == "application/json" || mime == "application/xml" || mime == "application/javascript" || mime == "application/x-yaml" || mime.end_with?("+json") || mime.end_with?("+xml") end |
.unsafe(path, reason) ⇒ Object
165 166 167 168 |
# File 'lib/rubino/attachments/classify.rb', line 165 def unsafe(path, reason) Classification.new(path: path.to_s, kind: :binary, mime: nil, size_bytes: nil, safe: false, reason: reason) end |