Module: Parse::Embeddings::ImageFetch

Defined in:
lib/parse/embeddings/image_fetch.rb

Overview

SDK-side image download for the bytes-fetch embedding path (v5.5).

Where the URL-forwarding path (v5.1) hands a validated URL to the embedding provider and lets the provider issue its own fetch, the bytes path downloads the image through the SDK's own SSRF-hardened primitive (File.safe_open_url — CIDR blocks, port allowlist, DNS-rebinding re-check, size caps, timeouts; NO parallel SSRF mechanism is introduced here), verifies the content, and forwards the bytes to the provider as a base64 data URI.

== Content verification (closes NEW-NET-4, "File MIME laundering")

The HTTP Content-Type header is never trusted. The MIME type is determined exclusively by magic-byte sniffing of the leading bytes (ImageFetch.sniff_mime), then:

  1. The sniffed type must be in allowed_image_types (default: JPEG / PNG / GIF / WebP).
  2. When the URL path carries a recognized image extension, the extension's implied type must AGREE with the sniffed type — a .png URL serving JPEG bytes (or an .html payload with an image extension) is refused as a laundering attempt.

Unknown magic bytes are always refused: there is no fallthrough to header- or extension-derived typing.

== EXIF stripping (default ON)

User-uploaded photos commonly carry GPS coordinates and device serial numbers in EXIF. Forwarding those to a third-party embedding provider is a PII egress, so metadata is stripped by default:

  • JPEG — APP1 segments (Exif and XMP) are removed.
  • PNG — eXIf chunks are removed.
  • WebP — EXIF / XMP RIFF chunks are removed and the VP8X EXIF/XMP flag bits cleared.
  • GIF — no EXIF container; pass-through.

Callers that need orientation metadata preserved opt out per call with exif_strip: false (the embed_image source: :bytes directive forwards its own exif_strip: declaration).

Defined Under Namespace

Classes: FetchedImage, InvalidImageType

Constant Summary collapse

DEFAULT_ALLOWED_IMAGE_TYPES =

MIME types the bytes path accepts by default. Operators extend via Parse::Embeddings.allowed_image_types=. SVG is deliberately absent — it is active content (script-capable), not a bitmap.

%w[image/jpeg image/png image/gif image/webp].freeze
EXTENSION_MIME =

URL-path extensions whose implied MIME type is cross-checked against the sniffed type. Extensions not listed here are ignored (the magic bytes alone govern).

{
  ".jpg"  => "image/jpeg",
  ".jpeg" => "image/jpeg",
  ".jpe"  => "image/jpeg",
  ".png"  => "image/png",
  ".gif"  => "image/gif",
  ".webp" => "image/webp",
}.freeze

Class Method Summary collapse

Class Method Details

.fetch!(url, allow_insecure: false, exif_strip: true, max_bytes: nil) ⇒ FetchedImage

Download, verify, and (by default) EXIF-strip an image.

The URL is validated through Parse::Embeddings.validate_image_url! in :fetch mode — host allowlist (Parse::Embeddings.allowed_image_hosts, deny-all when empty), obfuscated-IP-literal screen, port allowlist, CIDR check — but WITHOUT the Parse::Embeddings.trust_provider_url_fetch= sentinel, because no URL is forwarded to a third party: the SDK itself performs the fetch through File.safe_open_url.

Parameters:

  • url (String)

    image URL (host must be allowlisted).

  • allow_insecure (Boolean) (defaults to: false)

    permit http:// (local dev only).

  • exif_strip (Boolean) (defaults to: true)

    strip EXIF/XMP metadata (default true).

  • max_bytes (Integer, nil) (defaults to: nil)

    additional size cap below File.max_remote_size; nil applies only the global cap.

Returns:

Raises:



140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
# File 'lib/parse/embeddings/image_fetch.rb', line 140

def fetch!(url, allow_insecure: false, exif_strip: true, max_bytes: nil)
  canonical = Parse::Embeddings.validate_image_url!(
    url, allow_insecure: allow_insecure, mode: :fetch,
  )
  io = Parse::File.safe_open_url(canonical)
  begin
    bytes = io.read
  ensure
    io.close if io.respond_to?(:close)
  end
  bytes = bytes.to_s.dup.force_encoding(Encoding::BINARY)

  if max_bytes && bytes.bytesize > Integer(max_bytes)
    raise ArgumentError,
          "Parse::Embeddings::ImageFetch: image exceeds max_bytes " \
          "(#{bytes.bytesize} > #{Integer(max_bytes)})."
  end

  mime = verify!(bytes, url: canonical)
  bytes = (bytes, mime) if exif_strip
  FetchedImage.new(bytes: bytes, mime_type: mime, url: canonical)
end

.sniff_mime(bytes) ⇒ String?

Determine an image's MIME type from its leading magic bytes. The first ~16 bytes are sufficient for every supported format. Returns nil for anything unrecognized — callers must treat nil as a refusal, never fall back to header/extension typing.

Parameters:

  • bytes (String)

    raw image bytes (at least the first 16).

Returns:

  • (String, nil)

    sniffed MIME type, or nil when unknown.



108
109
110
111
112
113
114
115
116
117
118
# File 'lib/parse/embeddings/image_fetch.rb', line 108

def sniff_mime(bytes)
  return nil unless bytes.is_a?(String) && bytes.bytesize >= 12
  b = bytes.byteslice(0, 16).force_encoding(Encoding::BINARY)
  return "image/jpeg" if b.start_with?("\xFF\xD8\xFF".b)
  return "image/png"  if b.start_with?("\x89PNG\r\n\x1A\n".b)
  return "image/gif"  if b.start_with?("GIF87a".b) || b.start_with?("GIF89a".b)
  if b.start_with?("RIFF".b) && b.byteslice(8, 4) == "WEBP".b
    return "image/webp"
  end
  nil
end

.strip_metadata(bytes, mime) ⇒ String

Strip embedded metadata for the formats that carry it. Unknown / metadata-free formats pass through unchanged. Never raises on a malformed container — falls back to the original bytes (the provider will reject genuinely corrupt input) — but the fallback is no longer silent: a container the walker could not parse may still carry EXIF/XMP to a third-party provider, so the PII-egress protection not running is warned about.

Parameters:

  • bytes (String)

    verified image bytes.

  • mime (String)

    sniffed MIME type.

Returns:

  • (String)

    bytes with metadata removed.



228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
# File 'lib/parse/embeddings/image_fetch.rb', line 228

def (bytes, mime)
  stripped =
    case mime
    when "image/jpeg" then strip_jpeg_app1(bytes)
    when "image/png"  then strip_png_exif(bytes)
    when "image/webp" then (bytes)
    else return bytes
    end
  # The format walkers return the *original object* when they bail
  # on a structure they cannot parse; a successful walk always
  # returns a fresh copy (even when nothing was removed).
  if stripped.equal?(bytes)
    warn "[Parse::Embeddings::ImageFetch] could not parse the #{mime} " \
         "container for metadata stripping; passing bytes through with " \
         "embedded EXIF/XMP (if any) intact."
  end
  stripped
rescue StandardError
  warn "[Parse::Embeddings::ImageFetch] metadata stripping raised while " \
       "parsing the #{mime} container; passing bytes through with " \
       "embedded EXIF/XMP (if any) intact."
  bytes
end

.verify!(bytes, url: nil) ⇒ String

Verify raw bytes: sniff the magic, check the allowlist, and cross-check the URL extension. Public so the upload-side validation path can reuse the same check.

Parameters:

  • bytes (String)

    raw image bytes.

  • url (String, nil) (defaults to: nil)

    source URL for the extension cross-check (nil skips it — e.g. caller-supplied byte payloads).

Returns:

  • (String)

    the sniffed MIME type.

Raises:



172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
# File 'lib/parse/embeddings/image_fetch.rb', line 172

def verify!(bytes, url: nil)
  if bytes.nil? || bytes.empty?
    raise InvalidImageType.new(:empty,
      "Parse::Embeddings::ImageFetch: downloaded body is empty.")
  end
  mime = sniff_mime(bytes)
  if mime.nil?
    raise InvalidImageType.new(:unknown_magic,
      "Parse::Embeddings::ImageFetch: leading bytes match no supported image " \
      "format (JPEG/PNG/GIF/WebP). The Content-Type header is not consulted — " \
      "unrecognized content is refused outright.")
  end
  allowed = Parse::Embeddings.allowed_image_types
  unless allowed.include?(mime)
    raise InvalidImageType.new(:type_not_allowed,
      "Parse::Embeddings::ImageFetch: sniffed type #{mime.inspect} is not in " \
      "Parse::Embeddings.allowed_image_types (#{allowed.inspect}).")
  end
  ext_mime = extension_mime(url)
  if ext_mime && ext_mime != mime
    raise InvalidImageType.new(:extension_mismatch,
      "Parse::Embeddings::ImageFetch: URL extension implies #{ext_mime.inspect} " \
      "but the magic bytes are #{mime.inspect} — refusing MIME-laundered content.")
  end
  mime
end