Module: Parse::Embeddings::ImageFetch
- Defined in:
- lib/parse/embeddings/image_fetch.rb
Overview
SDK-side image download for the bytes-fetch embedding path (v5.5).
Where the URL-forwarding path (v5.1) hands a validated URL to the embedding provider and lets the provider issue its own fetch, the bytes path downloads the image through the SDK's own SSRF-hardened primitive (File.safe_open_url — CIDR blocks, port allowlist, DNS-rebinding re-check, size caps, timeouts; NO parallel SSRF mechanism is introduced here), verifies the content, and forwards the bytes to the provider as a base64 data URI.
== Content verification (closes NEW-NET-4, "File MIME laundering")
The HTTP Content-Type header is never trusted. The MIME type
is determined exclusively by magic-byte sniffing of the leading
bytes (ImageFetch.sniff_mime), then:
- The sniffed type must be in allowed_image_types (default: JPEG / PNG / GIF / WebP).
- When the URL path carries a recognized image extension, the
extension's implied type must AGREE with the sniffed type —
a
.pngURL serving JPEG bytes (or an.htmlpayload with an image extension) is refused as a laundering attempt.
Unknown magic bytes are always refused: there is no fallthrough to header- or extension-derived typing.
== EXIF stripping (default ON)
User-uploaded photos commonly carry GPS coordinates and device serial numbers in EXIF. Forwarding those to a third-party embedding provider is a PII egress, so metadata is stripped by default:
- JPEG — APP1 segments (Exif and XMP) are removed.
- PNG —
eXIfchunks are removed. - WebP —
EXIF/XMPRIFF chunks are removed and the VP8X EXIF/XMP flag bits cleared. - GIF — no EXIF container; pass-through.
Callers that need orientation metadata preserved opt out per call
with exif_strip: false (the embed_image source: :bytes
directive forwards its own exif_strip: declaration).
Defined Under Namespace
Classes: FetchedImage, InvalidImageType
Constant Summary collapse
- DEFAULT_ALLOWED_IMAGE_TYPES =
MIME types the bytes path accepts by default. Operators extend via Parse::Embeddings.allowed_image_types=. SVG is deliberately absent — it is active content (script-capable), not a bitmap.
%w[image/jpeg image/png image/gif image/webp].freeze
- EXTENSION_MIME =
URL-path extensions whose implied MIME type is cross-checked against the sniffed type. Extensions not listed here are ignored (the magic bytes alone govern).
{ ".jpg" => "image/jpeg", ".jpeg" => "image/jpeg", ".jpe" => "image/jpeg", ".png" => "image/png", ".gif" => "image/gif", ".webp" => "image/webp", }.freeze
Class Method Summary collapse
-
.fetch!(url, allow_insecure: false, exif_strip: true, max_bytes: nil) ⇒ FetchedImage
Download, verify, and (by default) EXIF-strip an image.
-
.sniff_mime(bytes) ⇒ String?
Determine an image's MIME type from its leading magic bytes.
-
.strip_metadata(bytes, mime) ⇒ String
Strip embedded metadata for the formats that carry it.
-
.verify!(bytes, url: nil) ⇒ String
Verify raw bytes: sniff the magic, check the allowlist, and cross-check the URL extension.
Class Method Details
.fetch!(url, allow_insecure: false, exif_strip: true, max_bytes: nil) ⇒ FetchedImage
Download, verify, and (by default) EXIF-strip an image.
The URL is validated through
Parse::Embeddings.validate_image_url! in :fetch mode — host
allowlist (Parse::Embeddings.allowed_image_hosts, deny-all when
empty), obfuscated-IP-literal screen, port allowlist, CIDR check
— but WITHOUT the Parse::Embeddings.trust_provider_url_fetch=
sentinel, because no URL is forwarded to a third party: the SDK
itself performs the fetch through File.safe_open_url.
140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
# File 'lib/parse/embeddings/image_fetch.rb', line 140 def fetch!(url, allow_insecure: false, exif_strip: true, max_bytes: nil) canonical = Parse::Embeddings.validate_image_url!( url, allow_insecure: allow_insecure, mode: :fetch, ) io = Parse::File.safe_open_url(canonical) begin bytes = io.read ensure io.close if io.respond_to?(:close) end bytes = bytes.to_s.dup.force_encoding(Encoding::BINARY) if max_bytes && bytes.bytesize > Integer(max_bytes) raise ArgumentError, "Parse::Embeddings::ImageFetch: image exceeds max_bytes " \ "(#{bytes.bytesize} > #{Integer(max_bytes)})." end mime = verify!(bytes, url: canonical) bytes = (bytes, mime) if exif_strip FetchedImage.new(bytes: bytes, mime_type: mime, url: canonical) end |
.sniff_mime(bytes) ⇒ String?
Determine an image's MIME type from its leading magic bytes. The first ~16 bytes are sufficient for every supported format. Returns nil for anything unrecognized — callers must treat nil as a refusal, never fall back to header/extension typing.
108 109 110 111 112 113 114 115 116 117 118 |
# File 'lib/parse/embeddings/image_fetch.rb', line 108 def sniff_mime(bytes) return nil unless bytes.is_a?(String) && bytes.bytesize >= 12 b = bytes.byteslice(0, 16).force_encoding(Encoding::BINARY) return "image/jpeg" if b.start_with?("\xFF\xD8\xFF".b) return "image/png" if b.start_with?("\x89PNG\r\n\x1A\n".b) return "image/gif" if b.start_with?("GIF87a".b) || b.start_with?("GIF89a".b) if b.start_with?("RIFF".b) && b.byteslice(8, 4) == "WEBP".b return "image/webp" end nil end |
.strip_metadata(bytes, mime) ⇒ String
Strip embedded metadata for the formats that carry it. Unknown / metadata-free formats pass through unchanged. Never raises on a malformed container — falls back to the original bytes (the provider will reject genuinely corrupt input) — but the fallback is no longer silent: a container the walker could not parse may still carry EXIF/XMP to a third-party provider, so the PII-egress protection not running is warned about.
228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 |
# File 'lib/parse/embeddings/image_fetch.rb', line 228 def (bytes, mime) stripped = case mime when "image/jpeg" then strip_jpeg_app1(bytes) when "image/png" then strip_png_exif(bytes) when "image/webp" then (bytes) else return bytes end # The format walkers return the *original object* when they bail # on a structure they cannot parse; a successful walk always # returns a fresh copy (even when nothing was removed). if stripped.equal?(bytes) warn "[Parse::Embeddings::ImageFetch] could not parse the #{mime} " \ "container for metadata stripping; passing bytes through with " \ "embedded EXIF/XMP (if any) intact." end stripped rescue StandardError warn "[Parse::Embeddings::ImageFetch] metadata stripping raised while " \ "parsing the #{mime} container; passing bytes through with " \ "embedded EXIF/XMP (if any) intact." bytes end |
.verify!(bytes, url: nil) ⇒ String
Verify raw bytes: sniff the magic, check the allowlist, and cross-check the URL extension. Public so the upload-side validation path can reuse the same check.
172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 |
# File 'lib/parse/embeddings/image_fetch.rb', line 172 def verify!(bytes, url: nil) if bytes.nil? || bytes.empty? raise InvalidImageType.new(:empty, "Parse::Embeddings::ImageFetch: downloaded body is empty.") end mime = sniff_mime(bytes) if mime.nil? raise InvalidImageType.new(:unknown_magic, "Parse::Embeddings::ImageFetch: leading bytes match no supported image " \ "format (JPEG/PNG/GIF/WebP). The Content-Type header is not consulted — " \ "unrecognized content is refused outright.") end allowed = Parse::Embeddings.allowed_image_types unless allowed.include?(mime) raise InvalidImageType.new(:type_not_allowed, "Parse::Embeddings::ImageFetch: sniffed type #{mime.inspect} is not in " \ "Parse::Embeddings.allowed_image_types (#{allowed.inspect}).") end ext_mime = extension_mime(url) if ext_mime && ext_mime != mime raise InvalidImageType.new(:extension_mismatch, "Parse::Embeddings::ImageFetch: URL extension implies #{ext_mime.inspect} " \ "but the magic bytes are #{mime.inspect} — refusing MIME-laundered content.") end mime end |