Module: Parse::Embeddings

Defined in:
lib/parse/embeddings.rb,
lib/parse/embeddings.rb,
lib/parse/embeddings/jina.rb,
lib/parse/embeddings/qwen.rb,
lib/parse/embeddings/cohere.rb,
lib/parse/embeddings/openai.rb,
lib/parse/embeddings/voyage.rb,
lib/parse/embeddings/fixture.rb,
lib/parse/embeddings/provider.rb,
lib/parse/embeddings/local_http.rb

Overview

Pluggable embedding-provider registry for :vector properties and the upcoming find_similar(text:) / Parse::Retrieval.retrieve surfaces.

Text-only providers shipped:

  • Fixture — deterministic, zero-network. Auto-registered as :fixture so tests can call Parse::Embeddings.provider(:fixture) with no setup.
  • OpenAI — text-embedding-3-small,large and ada-002.
  • Cohere — embed-english,multilingual-v3.0 and *-light-v3.0. Distinguishes :search_query / :search_document at the wire.
  • Voyage — voyage-4 family (incl. open-weight voyage-4-nano), voyage-3 family, voyage-code-3, voyage-finance-2, voyage-law-2. Distinguishes input types.
  • Jina — jina-embeddings-v3/v4/v5 (text + omni-text mode), jina-code-embeddings-00.5b,10.5b,1.5b. Matryoshka via dimensions:.
  • Qwen — qwen3-embedding-00.6b,4b,8b via Alibaba Cloud DashScope compatible-mode. All Matryoshka. The same checkpoints are open-weight on Hugging Face (Apache 2.0) for self-hosting behind LocalHTTP.
  • LocalHTTP — generic OpenAI-compatible client for Ollama, LM Studio, vLLM, etc. Configure-time SSRF gate; requires allow_private_endpoint: true to talk to localhost.

Image / multimodal embedding (embed_image) is a forthcoming feature — the Provider#embed_image hook is defined but only the multimodal-capable providers will override it.

== Registration

Two equivalent forms. Embeddings.register is the canonical one-liner and what every example in the gem uses; Embeddings.configure is the block form for registering several providers at once or for Rails-style initializers. Both end up at the same ProviderRegistry, so pick whichever reads better in context.

Examples:

canonical: register one provider

Parse::Embeddings.register(:openai,
  Parse::Embeddings::OpenAI.new(api_key: ENV.fetch("OPENAI_API_KEY")))

block form for several providers

Parse::Embeddings.configure do |c|
  c.providers[:openai] = Parse::Embeddings::OpenAI.new(api_key: ENV.fetch("OPENAI_API_KEY"))
  c.providers[:openai_large] = Parse::Embeddings::OpenAI.new(
    api_key: ENV.fetch("OPENAI_API_KEY"), model: "text-embedding-3-large")
end

lookup

Parse::Embeddings.provider(:openai)   # => the registered instance
Parse::Embeddings.provider(:fixture)  # => default Fixture, zero-config

Defined Under Namespace

Classes: Cohere, Configuration, ConfirmationRequired, Error, Fixture, InvalidImageURL, InvalidResponseError, Jina, LocalHTTP, OpenAI, Provider, ProviderNotRegistered, ProviderRegistry, Qwen, Voyage

Constant Summary collapse

CONFIG_MUTEX =

Monitor guarding configuration memoization and register writes. MRI's GVL would normally absorb the race on @configuration ||= ..., but JRuby and TruffleRuby can produce two Configuration instances when two threads race at boot (and lose any provider written to the loser). A Monitor (rather than a Mutex) is used so that register — which holds the lock and then calls configuration — can re-enter without deadlocking on the first-touch allocation path.

Monitor.new
TRUST_PROVIDER_URL_FETCH_SENTINEL =

The sentinel value that trust_provider_url_fetch= requires. An exact match unlocks validate_image_url! for URL forwarding to embedding providers. Any other value is refused with ConfirmationRequired. The constant is frozen so callers cannot mutate it in-place.

"PROVIDER_EGRESS_VERIFIED"

Class Method Summary collapse

Class Method Details

.allowed_image_hostsArray<String>

Returns currently-configured image-host allowlist (frozen).

Returns:

  • (Array<String>)

    currently-configured image-host allowlist (frozen).



297
298
299
# File 'lib/parse/embeddings.rb', line 297

def allowed_image_hosts
  @allowed_image_hosts ||= [].freeze
end

.allowed_image_hosts=(hosts) ⇒ Array<String>

Configure the host allowlist that validate_image_url! checks an incoming image URL's host against. Entries that begin with . match suffixes (.cdn.example.com matches images.cdn.example.com and cdn.example.com itself); entries without a leading . are exact-match.

Empty allowlist means "deny all". This is the opposite default from File.allowed_remote_hosts (where empty means "any public host"). The asymmetry is deliberate: image URLs that reach validate_image_url! typically originate from attacker-controlled inputs (chat queries, agent tool args, user-submitted document fields), so opening the surface requires an explicit operator declaration of which CDNs are trusted.

Examples:

Trust two CDN hostnames

Parse::Embeddings.allowed_image_hosts = [
  "images.example-cdn.com",
  ".cloudfront.net",   # any *.cloudfront.net host
]

Parameters:

  • hosts (Array<String>)

    hostnames or .suffix patterns.

Returns:



287
288
289
290
291
292
293
294
# File 'lib/parse/embeddings.rb', line 287

def allowed_image_hosts=(hosts)
  unless hosts.is_a?(Array) && hosts.all? { |h| h.is_a?(String) && !h.empty? }
    raise ArgumentError,
          "Parse::Embeddings.allowed_image_hosts= expects Array<String> of " \
          "non-empty hostnames or '.suffix' patterns (got #{hosts.inspect})."
  end
  CONFIG_MUTEX.synchronize { @allowed_image_hosts = hosts.dup.freeze }
end

.configurationConfiguration

Returns the singleton configuration object.

Returns:



172
173
174
175
176
177
# File 'lib/parse/embeddings.rb', line 172

def configuration
  # Double-checked memoization. The fast path is a single ivar
  # read; the slow path enters the mutex only when the
  # configuration is unallocated.
  @configuration || CONFIG_MUTEX.synchronize { @configuration ||= Configuration.new }
end

.configure {|config| ... } ⇒ Configuration

Block form for registering multiple providers at once. Prefer the one-liner register when adding a single provider; this form pays off when an initializer needs to set several or to mutate the registry conditionally.

Yield Parameters:

Returns:



166
167
168
169
# File 'lib/parse/embeddings.rb', line 166

def configure
  yield configuration if block_given?
  configuration
end

.ip_shaped_but_not_canonical?(host) ⇒ Boolean

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Return true when host looks like an obfuscated IP literal — rejecting hex (0x7f.0.0.1), octal-leading-zero (0177.0.0.1), decimal-blob (2130706433), and IPv4 short-forms (127.1, 127.0.1) BEFORE they reach DNS resolution. Anything that's clearly a hostname (contains a letter) falls through; canonical dotted-quad IPv4 and canonical IPv6 fall through; everything else is treated as obfuscated.

Round-2 audit identified two bypasses in the prior version: (1) 0x7f.0.0.1 passed the [a-zA-Z] early-out because of the x, and (2) bare-digit hostnames like 127.1 were accepted as DNS hostnames. This rewrite makes the check whitelist-shaped: explicit accept for canonical IPv4 / IPv6 / alpha-containing hostnames; explicit reject for hex prefix and any pure digits-and-dots that isn't a canonical 4-octet form.

Returns:

  • (Boolean)


509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
# File 'lib/parse/embeddings.rb', line 509

def ip_shaped_but_not_canonical?(host)
  # Hex prefix anywhere in the host (`0x7f`, `0.0X7f.0.1`) →
  # obfuscated. Case-insensitive `x`.
  return true if host =~ /(\A|\.)0[xX]/

  # Strict canonical dotted-quad IPv4: exactly 4 decimal octets,
  # 0..255, no leading zeros (except `0` itself).
  if host =~ /\A\d+(?:\.\d+){3}\z/
    octets = host.split(".")
    return true if octets.any? { |s| s.length > 1 && s.start_with?("0") }  # octal
    return true if octets.map(&:to_i).any? { |o| o > 255 }                 # > 255
    return false
  end

  # Numeric-only with dots but not 4 octets (`127.1`, `1.2.3`,
  # `1.2.3.4.5`) → IPv4 short-form / oversized. Refuse.
  return true if host =~ /\A\d+(?:\.\d+)+\z/

  # Pure-digit single label (`2130706433`, `0`, `42`) → decimal
  # IP blob. Refuse.
  return true if host =~ /\A\d+\z/

  # Anything else: try parsing as IPv6 (canonical IPv6 literals
  # like `::1`, `2001:db8::1`, `::ffff:1.2.3.4` succeed; the
  # CIDR check downstream catches private ranges including
  # IPv4-mapped IPv6 of private IPv4).
  begin
    IPAddr.new(host)
    false
  rescue IPAddr::InvalidAddressError
    # Not an IP, not numeric-shaped → must be a hostname.
    # Resolver downstream will validate or reject.
    false
  end
end

.provider(name) ⇒ Provider

Look up a registered provider.

Zero-config fallback: :fixture returns a default Fixture instance (64-dim, deterministic) when nothing is registered. Every other name raises ProviderNotRegistered. Tests can rely on provider(:fixture) working out of the box; production code must register what it uses.

Parameters:

Returns:

Raises:



208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
# File 'lib/parse/embeddings.rb', line 208

def provider(name)
  # Avoid blindly `to_sym`-ing the caller's input. An LLM tool or
  # webhook handler that pipes its `name:` argument through here
  # would otherwise let a remote caller grow the symbol table at
  # will. Ruby 3.2+ GCs symbols so the practical impact is small,
  # but a string-matched lookup costs nothing and closes the gap.
  if name.is_a?(Symbol)
    return configuration.providers[name] if configuration.providers.key?(name)
    key_string = name.to_s
  else
    key_string = name.to_s
    found = configuration.providers.keys.find { |k| k.to_s == key_string }
    return configuration.providers[found] if found
  end
  if key_string == "fixture"
    CONFIG_MUTEX.synchronize do
      return configuration.providers[:fixture] ||= Fixture.new
    end
  end
  raise ProviderNotRegistered,
        "Parse::Embeddings.provider(#{name.inspect}): no provider registered. " \
        "Register one via Parse::Embeddings.register(#{name.inspect}, …)."
end

.register(name, provider) ⇒ Provider

Canonical one-liner: register a single provider under name. Overwrites any previous registration. Use configure for multi-provider blocks.

Parameters:

Returns:

  • (Provider)

    the registered provider.



186
187
188
189
190
191
192
193
194
195
# File 'lib/parse/embeddings.rb', line 186

def register(name, provider)
  unless provider.is_a?(Provider)
    raise ArgumentError,
          "Parse::Embeddings.register: #{name.inspect} expects a Parse::Embeddings::Provider " \
          "instance (got #{provider.class})."
  end
  CONFIG_MUTEX.synchronize do
    configuration.providers[name.to_sym] = provider
  end
end

.registered_provider_namesArray<Symbol>

Names of currently-registered providers (does NOT include the implicit :fixture fallback unless it's been instantiated).

Returns:



236
237
238
# File 'lib/parse/embeddings.rb', line 236

def registered_provider_names
  configuration.providers.keys
end

.reset!

This method returns an undefined value.

Reset the entire registry — intended for test teardown only. Production code should never call this; use register to override a single provider.



245
246
247
248
249
250
251
# File 'lib/parse/embeddings.rb', line 245

def reset!
  CONFIG_MUTEX.synchronize do
    @configuration = nil
    @allowed_image_hosts = nil
    @trust_provider_url_fetch = nil
  end
end

.trust_provider_url_fetch=(value) ⇒ Object

Sentinel-gated opt-in for forwarding image URLs to embedding providers. Assign the exact TRUST_PROVIDER_URL_FETCH_SENTINEL String to unlock; any other value (including true, 1, "true", or a non-matching String) raises ConfirmationRequired. Reset to nil to disable.

Parameters:

Raises:



309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
# File 'lib/parse/embeddings.rb', line 309

def trust_provider_url_fetch=(value)
  if value.nil?
    CONFIG_MUTEX.synchronize { @trust_provider_url_fetch = nil }
    return
  end
  unless value.is_a?(String) && value == TRUST_PROVIDER_URL_FETCH_SENTINEL
    raise ConfirmationRequired,
          "Parse::Embeddings.trust_provider_url_fetch= requires the exact sentinel " \
          "String #{TRUST_PROVIDER_URL_FETCH_SENTINEL.inspect}. Plain `true` and " \
          "other values are refused — forwarding image URLs to a third-party " \
          "provider lets that provider issue an HTTP request from its own network " \
          "with attacker-controllable host/path. Set the sentinel only after you " \
          "have configured Parse::Embeddings.allowed_image_hosts AND reviewed the " \
          "provider's documented egress behavior (DNS rebinding window, redirect " \
          "policy)."
  end
  CONFIG_MUTEX.synchronize { @trust_provider_url_fetch = value }
end

.trust_provider_url_fetch?Boolean

Returns whether image-URL forwarding is currently unlocked.

Returns:

  • (Boolean)

    whether image-URL forwarding is currently unlocked.



329
330
331
# File 'lib/parse/embeddings.rb', line 329

def trust_provider_url_fetch?
  @trust_provider_url_fetch == TRUST_PROVIDER_URL_FETCH_SENTINEL
end

.validate_image_url!(url, allow_insecure: false) ⇒ String

Validate an image URL for forwarding to an embedding provider. Returns the canonicalized URL String on success; raises InvalidImageURL or ConfirmationRequired on failure.

Validation layers (in order):

  1. trust_provider_url_fetch? sentinel must be set. Without it, no URL — public or private — is forwarded.
  2. URL parses as https:// (or http:// if allow_insecure: is true; only intended for local development).
  3. No userinfo (basic-auth credentials in the URL).
  4. Port is in File.allowed_remote_ports.
  5. Host resolves only to addresses NOT in File::BLOCKED_CIDRS (CIDR check via Parse::File.assert_host_allowed!). The same primitive is used by File.safe_open_url, so the SSRF mechanism is shared.
  6. Host matches allowed_image_hosts. Empty allowlist denies every host — see allowed_image_hosts= for rationale.

The DNS-rebinding window between this validation and the provider's own fetch is the residual risk that trust_provider_url_fetch= forces the operator to acknowledge.

Parameters:

  • url (String)

    image URL.

  • allow_insecure (Boolean) (defaults to: false)

    permit http:// (default false). Only meaningful for local development / container- internal CDN proxies.

Returns:

  • (String)

    canonicalized URL (URI.parse(url).to_s).

Raises:



363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
# File 'lib/parse/embeddings.rb', line 363

def validate_image_url!(url, allow_insecure: false)
  unless trust_provider_url_fetch?
    hint =
      if allowed_image_hosts.empty?
        " First populate Parse::Embeddings.allowed_image_hosts with the CDN " \
        "hostnames you trust (currently empty — every host would be denied " \
        "even after the sentinel is set)."
      else
        ""
      end
    raise ConfirmationRequired,
          "Parse::Embeddings.validate_image_url! refused: image-URL forwarding is " \
          "disabled. Set Parse::Embeddings.trust_provider_url_fetch = " \
          "#{TRUST_PROVIDER_URL_FETCH_SENTINEL.inspect} to enable it.#{hint}"
  end

  unless url.is_a?(String) && !url.empty?
    raise InvalidImageURL.new(:parse,
      "Parse::Embeddings.validate_image_url!: url must be a non-empty String " \
      "(got #{url.class}).")
  end

  uri = begin
    URI.parse(url)
  rescue URI::InvalidURIError => e
    raise InvalidImageURL.new(:parse,
      "Parse::Embeddings.validate_image_url!: invalid URL (#{e.message}).")
  end

  valid_schemes = allow_insecure ? %w[http https] : %w[https]
  unless valid_schemes.include?(uri.scheme)
    raise InvalidImageURL.new(:scheme,
      "Parse::Embeddings.validate_image_url!: scheme must be #{valid_schemes.join(' or ')} " \
      "(got #{uri.scheme.inspect}). Forwarding non-HTTPS image URLs to a provider " \
      "leaks any embedded query-string secrets in cleartext.")
  end

  if uri.userinfo
    raise InvalidImageURL.new(:userinfo,
      "Parse::Embeddings.validate_image_url!: URL must not include userinfo " \
      "credentials. Embedding providers will forward the full URL in their fetch " \
      "and may log it.")
  end

  # `uri.hostname` returns the IDNA-decoded form WITHOUT IPv6
  # brackets, where `uri.host` keeps the brackets. Using
  # `hostname` makes the allowlist comparison work uniformly for
  # IPv6 literals (operators write `::1`, not `[::1]`) and
  # matches the form `Parse::File.assert_host_allowed!` expects.
  host = uri.hostname
  if host.nil? || host.empty?
    raise InvalidImageURL.new(:parse,
      "Parse::Embeddings.validate_image_url!: URL is missing a host.")
  end

  # Reject non-canonical IPv4 forms (decimal `2130706433`,
  # octal `0177.0.0.1`, hex `0x7f.0.0.1`) before they reach
  # resolution. Most stacks' Resolv returns [] for these, so
  # they'd be blocked anyway — but via the resolution-failure
  # branch (`:parse` reason) rather than the CIDR branch, which
  # makes the failure mode look like a benign typo when it's
  # actually an obfuscated-localhost SSRF attempt. Explicitly
  # tagging the failure as `:host_blocked` keeps operator logs
  # honest. We allow exactly: dotted-quad IPv4 (4 decimal
  # octets), bracketed-or-bare IPv6 (parsed by IPAddr), and
  # DNS hostnames (anything containing a letter or non-numeric
  # character).
  if ip_shaped_but_not_canonical?(host)
    raise InvalidImageURL.new(:host_blocked,
      "Parse::Embeddings.validate_image_url!: host #{host.inspect} is an obfuscated " \
      "or non-canonical IP literal. Use dotted-quad IPv4 (a.b.c.d) or canonical IPv6. " \
      "Decimal/octal/hex IP forms are refused to prevent localhost-bypass attempts.")
  end

  # **Image-host allowlist runs BEFORE the resolver hop.** Round-2
  # audit (LOW finding #3) noted that a caller passing N URLs to
  # a public `embed_image` API could amplify DNS traffic at ~N×
  # before the allowlist filtered them out — the pure-string
  # match is cheap, the resolution is a syscall. Allowlist-first
  # ordering eliminates the amplification surface.
  allowed = allowed_image_hosts
  if allowed.empty?
    raise InvalidImageURL.new(:host_not_allowlisted,
      "Parse::Embeddings.validate_image_url!: Parse::Embeddings.allowed_image_hosts " \
      "is empty — every image URL is denied. Add the CDN hostnames you trust before " \
      "forwarding image URLs to a provider.")
  end
  permitted = allowed.any? do |entry|
    if entry.start_with?(".")
      host.downcase.end_with?(entry.downcase) ||
        host.casecmp(entry[1..]).zero?
    else
      host.casecmp(entry).zero?
    end
  end
  unless permitted
    raise InvalidImageURL.new(:host_not_allowlisted,
      "Parse::Embeddings.validate_image_url!: host #{host.inspect} not in " \
      "Parse::Embeddings.allowed_image_hosts (#{allowed.inspect}).")
  end

  # Port allowlist runs after the host allowlist (cheap string
  # check first). Reuses Parse::File's port allowlist — same
  # threat model (internal-port probing via DNS rebinding).
  port = uri.port || (uri.scheme == "https" ? 443 : 80)
  require_relative "model/file"
  unless Parse::File.allowed_remote_ports.include?(port)
    raise InvalidImageURL.new(:port,
      "Parse::Embeddings.validate_image_url!: port #{port} not in " \
      "Parse::File.allowed_remote_ports.")
  end

  # CIDR + DNS resolution last — most expensive (syscall). An
  # allowlisted CDN hostname pointing at a private IP (DNS
  # poisoning / hostile-allowlist-entry / first-party rebind)
  # is the residual surface this catches. Delegates to
  # Parse::File's shared SSRF primitive.
  begin
    Parse::File.assert_host_allowed!(host)
  rescue ArgumentError => e
    tag = e.message.include?("private/internal address") ? :host_blocked : :parse
    raise InvalidImageURL.new(tag,
      "Parse::Embeddings.validate_image_url!: #{e.message}")
  end

  # Return the canonicalized URL so callers store/forward
  # exactly what was validated, not the raw input.
  uri.to_s
end