Module: Parse::Embeddings
- Defined in:
- lib/parse/embeddings.rb,
lib/parse/embeddings.rb,
lib/parse/embeddings/jina.rb,
lib/parse/embeddings/qwen.rb,
lib/parse/embeddings/cohere.rb,
lib/parse/embeddings/openai.rb,
lib/parse/embeddings/voyage.rb,
lib/parse/embeddings/fixture.rb,
lib/parse/embeddings/provider.rb,
lib/parse/embeddings/local_http.rb
Overview
Pluggable embedding-provider registry for :vector properties and
the upcoming find_similar(text:) / Parse::Retrieval.retrieve
surfaces.
Text-only providers shipped:
- Fixture — deterministic, zero-network. Auto-registered as
:fixtureso tests can callParse::Embeddings.provider(:fixture)with no setup. - OpenAI — text-embedding-3-small,large and ada-002.
- Cohere — embed-english,multilingual-v3.0 and
*-light-v3.0. Distinguishes:search_query/:search_documentat the wire. - Voyage — voyage-4 family (incl. open-weight
voyage-4-nano), voyage-3 family, voyage-code-3, voyage-finance-2, voyage-law-2. Distinguishes input types. - Jina — jina-embeddings-v3/v4/v5 (text + omni-text mode),
jina-code-embeddings-00.5b,10.5b,1.5b. Matryoshka via
dimensions:. - Qwen — qwen3-embedding-00.6b,4b,8b via Alibaba Cloud DashScope compatible-mode. All Matryoshka. The same checkpoints are open-weight on Hugging Face (Apache 2.0) for self-hosting behind LocalHTTP.
- LocalHTTP — generic OpenAI-compatible client for Ollama,
LM Studio, vLLM, etc. Configure-time SSRF gate; requires
allow_private_endpoint: trueto talk to localhost.
Image / multimodal embedding (embed_image) is a forthcoming
feature — the Provider#embed_image hook is defined but only the
multimodal-capable providers will override it.
== Registration
Two equivalent forms. Embeddings.register is the canonical one-liner and what every example in the gem uses; Embeddings.configure is the block form for registering several providers at once or for Rails-style initializers. Both end up at the same ProviderRegistry, so pick whichever reads better in context.
Defined Under Namespace
Classes: Cohere, Configuration, ConfirmationRequired, Error, Fixture, InvalidImageURL, InvalidResponseError, Jina, LocalHTTP, OpenAI, Provider, ProviderNotRegistered, ProviderRegistry, Qwen, Voyage
Constant Summary collapse
- CONFIG_MUTEX =
Monitor guarding configuration memoization and register writes. MRI's GVL would normally absorb the race on
@configuration ||= ..., but JRuby and TruffleRuby can produce twoConfigurationinstances when two threads race at boot (and lose any provider written to the loser). A Monitor (rather than a Mutex) is used so thatregister— which holds the lock and then callsconfiguration— can re-enter without deadlocking on the first-touch allocation path. Monitor.new
- TRUST_PROVIDER_URL_FETCH_SENTINEL =
The sentinel value that trust_provider_url_fetch= requires. An exact match unlocks validate_image_url! for URL forwarding to embedding providers. Any other value is refused with ConfirmationRequired. The constant is frozen so callers cannot mutate it in-place.
"PROVIDER_EGRESS_VERIFIED"
Class Method Summary collapse
-
.allowed_image_hosts ⇒ Array<String>
Currently-configured image-host allowlist (frozen).
-
.allowed_image_hosts=(hosts) ⇒ Array<String>
Configure the host allowlist that Embeddings.validate_image_url! checks an incoming image URL's host against.
-
.configuration ⇒ Configuration
The singleton configuration object.
-
.configure {|config| ... } ⇒ Configuration
Block form for registering multiple providers at once.
-
.ip_shaped_but_not_canonical?(host) ⇒ Boolean
private
Return true when
hostlooks like an obfuscated IP literal — rejecting hex (0x7f.0.0.1), octal-leading-zero (0177.0.0.1), decimal-blob (2130706433), and IPv4 short-forms (127.1,127.0.1) BEFORE they reach DNS resolution. -
.provider(name) ⇒ Provider
Look up a registered provider.
-
.register(name, provider) ⇒ Provider
Canonical one-liner: register a single provider under
name. -
.registered_provider_names ⇒ Array<Symbol>
Names of currently-registered providers (does NOT include the implicit
:fixturefallback unless it's been instantiated). -
.reset!
Reset the entire registry — intended for test teardown only.
-
.trust_provider_url_fetch=(value) ⇒ Object
Sentinel-gated opt-in for forwarding image URLs to embedding providers.
-
.trust_provider_url_fetch? ⇒ Boolean
Whether image-URL forwarding is currently unlocked.
-
.validate_image_url!(url, allow_insecure: false) ⇒ String
Validate an image URL for forwarding to an embedding provider.
Class Method Details
.allowed_image_hosts ⇒ Array<String>
Returns currently-configured image-host allowlist (frozen).
297 298 299 |
# File 'lib/parse/embeddings.rb', line 297 def allowed_image_hosts @allowed_image_hosts ||= [].freeze end |
.allowed_image_hosts=(hosts) ⇒ Array<String>
Configure the host allowlist that validate_image_url! checks
an incoming image URL's host against. Entries that begin with
. match suffixes (.cdn.example.com matches
images.cdn.example.com and cdn.example.com itself);
entries without a leading . are exact-match.
Empty allowlist means "deny all". This is the opposite default from File.allowed_remote_hosts (where empty means "any public host"). The asymmetry is deliberate: image URLs that reach validate_image_url! typically originate from attacker-controlled inputs (chat queries, agent tool args, user-submitted document fields), so opening the surface requires an explicit operator declaration of which CDNs are trusted.
287 288 289 290 291 292 293 294 |
# File 'lib/parse/embeddings.rb', line 287 def allowed_image_hosts=(hosts) unless hosts.is_a?(Array) && hosts.all? { |h| h.is_a?(String) && !h.empty? } raise ArgumentError, "Parse::Embeddings.allowed_image_hosts= expects Array<String> of " \ "non-empty hostnames or '.suffix' patterns (got #{hosts.inspect})." end CONFIG_MUTEX.synchronize { @allowed_image_hosts = hosts.dup.freeze } end |
.configuration ⇒ Configuration
Returns the singleton configuration object.
172 173 174 175 176 177 |
# File 'lib/parse/embeddings.rb', line 172 def configuration # Double-checked memoization. The fast path is a single ivar # read; the slow path enters the mutex only when the # configuration is unallocated. @configuration || CONFIG_MUTEX.synchronize { @configuration ||= Configuration.new } end |
.configure {|config| ... } ⇒ Configuration
Block form for registering multiple providers at once. Prefer the one-liner register when adding a single provider; this form pays off when an initializer needs to set several or to mutate the registry conditionally.
166 167 168 169 |
# File 'lib/parse/embeddings.rb', line 166 def configure yield configuration if block_given? configuration end |
.ip_shaped_but_not_canonical?(host) ⇒ Boolean
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Return true when host looks like an obfuscated IP literal —
rejecting hex (0x7f.0.0.1), octal-leading-zero (0177.0.0.1),
decimal-blob (2130706433), and IPv4 short-forms (127.1,
127.0.1) BEFORE they reach DNS resolution. Anything that's
clearly a hostname (contains a letter) falls through; canonical
dotted-quad IPv4 and canonical IPv6 fall through; everything
else is treated as obfuscated.
Round-2 audit identified two bypasses in the prior version:
(1) 0x7f.0.0.1 passed the [a-zA-Z] early-out because of
the x, and (2) bare-digit hostnames like 127.1 were
accepted as DNS hostnames. This rewrite makes the check
whitelist-shaped: explicit accept for canonical IPv4 / IPv6 /
alpha-containing hostnames; explicit reject for hex prefix and
any pure digits-and-dots that isn't a canonical 4-octet form.
509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 |
# File 'lib/parse/embeddings.rb', line 509 def ip_shaped_but_not_canonical?(host) # Hex prefix anywhere in the host (`0x7f`, `0.0X7f.0.1`) → # obfuscated. Case-insensitive `x`. return true if host =~ /(\A|\.)0[xX]/ # Strict canonical dotted-quad IPv4: exactly 4 decimal octets, # 0..255, no leading zeros (except `0` itself). if host =~ /\A\d+(?:\.\d+){3}\z/ octets = host.split(".") return true if octets.any? { |s| s.length > 1 && s.start_with?("0") } # octal return true if octets.map(&:to_i).any? { |o| o > 255 } # > 255 return false end # Numeric-only with dots but not 4 octets (`127.1`, `1.2.3`, # `1.2.3.4.5`) → IPv4 short-form / oversized. Refuse. return true if host =~ /\A\d+(?:\.\d+)+\z/ # Pure-digit single label (`2130706433`, `0`, `42`) → decimal # IP blob. Refuse. return true if host =~ /\A\d+\z/ # Anything else: try parsing as IPv6 (canonical IPv6 literals # like `::1`, `2001:db8::1`, `::ffff:1.2.3.4` succeed; the # CIDR check downstream catches private ranges including # IPv4-mapped IPv6 of private IPv4). begin IPAddr.new(host) false rescue IPAddr::InvalidAddressError # Not an IP, not numeric-shaped → must be a hostname. # Resolver downstream will validate or reject. false end end |
.provider(name) ⇒ Provider
Look up a registered provider.
Zero-config fallback: :fixture returns a default
Fixture instance (64-dim, deterministic) when nothing is
registered. Every other name raises ProviderNotRegistered.
Tests can rely on provider(:fixture) working out of the box;
production code must register what it uses.
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 |
# File 'lib/parse/embeddings.rb', line 208 def provider(name) # Avoid blindly `to_sym`-ing the caller's input. An LLM tool or # webhook handler that pipes its `name:` argument through here # would otherwise let a remote caller grow the symbol table at # will. Ruby 3.2+ GCs symbols so the practical impact is small, # but a string-matched lookup costs nothing and closes the gap. if name.is_a?(Symbol) return configuration.providers[name] if configuration.providers.key?(name) key_string = name.to_s else key_string = name.to_s found = configuration.providers.keys.find { |k| k.to_s == key_string } return configuration.providers[found] if found end if key_string == "fixture" CONFIG_MUTEX.synchronize do return configuration.providers[:fixture] ||= Fixture.new end end raise ProviderNotRegistered, "Parse::Embeddings.provider(#{name.inspect}): no provider registered. " \ "Register one via Parse::Embeddings.register(#{name.inspect}, …)." end |
.register(name, provider) ⇒ Provider
Canonical one-liner: register a single provider under name.
Overwrites any previous registration. Use configure for
multi-provider blocks.
186 187 188 189 190 191 192 193 194 195 |
# File 'lib/parse/embeddings.rb', line 186 def register(name, provider) unless provider.is_a?(Provider) raise ArgumentError, "Parse::Embeddings.register: #{name.inspect} expects a Parse::Embeddings::Provider " \ "instance (got #{provider.class})." end CONFIG_MUTEX.synchronize do configuration.providers[name.to_sym] = provider end end |
.registered_provider_names ⇒ Array<Symbol>
Names of currently-registered providers (does NOT include the
implicit :fixture fallback unless it's been instantiated).
236 237 238 |
# File 'lib/parse/embeddings.rb', line 236 def registered_provider_names configuration.providers.keys end |
.reset!
This method returns an undefined value.
Reset the entire registry — intended for test teardown only. Production code should never call this; use register to override a single provider.
245 246 247 248 249 250 251 |
# File 'lib/parse/embeddings.rb', line 245 def reset! CONFIG_MUTEX.synchronize do @configuration = nil @allowed_image_hosts = nil @trust_provider_url_fetch = nil end end |
.trust_provider_url_fetch=(value) ⇒ Object
Sentinel-gated opt-in for forwarding image URLs to embedding
providers. Assign the exact TRUST_PROVIDER_URL_FETCH_SENTINEL
String to unlock; any other value (including true, 1,
"true", or a non-matching String) raises
ConfirmationRequired. Reset to nil to disable.
309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 |
# File 'lib/parse/embeddings.rb', line 309 def trust_provider_url_fetch=(value) if value.nil? CONFIG_MUTEX.synchronize { @trust_provider_url_fetch = nil } return end unless value.is_a?(String) && value == TRUST_PROVIDER_URL_FETCH_SENTINEL raise ConfirmationRequired, "Parse::Embeddings.trust_provider_url_fetch= requires the exact sentinel " \ "String #{TRUST_PROVIDER_URL_FETCH_SENTINEL.inspect}. Plain `true` and " \ "other values are refused — forwarding image URLs to a third-party " \ "provider lets that provider issue an HTTP request from its own network " \ "with attacker-controllable host/path. Set the sentinel only after you " \ "have configured Parse::Embeddings.allowed_image_hosts AND reviewed the " \ "provider's documented egress behavior (DNS rebinding window, redirect " \ "policy)." end CONFIG_MUTEX.synchronize { @trust_provider_url_fetch = value } end |
.trust_provider_url_fetch? ⇒ Boolean
Returns whether image-URL forwarding is currently unlocked.
329 330 331 |
# File 'lib/parse/embeddings.rb', line 329 def trust_provider_url_fetch? @trust_provider_url_fetch == TRUST_PROVIDER_URL_FETCH_SENTINEL end |
.validate_image_url!(url, allow_insecure: false) ⇒ String
Validate an image URL for forwarding to an embedding provider. Returns the canonicalized URL String on success; raises InvalidImageURL or ConfirmationRequired on failure.
Validation layers (in order):
- trust_provider_url_fetch? sentinel must be set. Without it, no URL — public or private — is forwarded.
- URL parses as
https://(orhttp://ifallow_insecure:is true; only intended for local development). - No userinfo (basic-auth credentials in the URL).
- Port is in File.allowed_remote_ports.
- Host resolves only to addresses NOT in
File::BLOCKED_CIDRS (CIDR check via
Parse::File.assert_host_allowed!). The same primitive is used by File.safe_open_url, so the SSRF mechanism is shared. - Host matches allowed_image_hosts. Empty allowlist denies every host — see allowed_image_hosts= for rationale.
The DNS-rebinding window between this validation and the provider's own fetch is the residual risk that trust_provider_url_fetch= forces the operator to acknowledge.
363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 |
# File 'lib/parse/embeddings.rb', line 363 def validate_image_url!(url, allow_insecure: false) unless trust_provider_url_fetch? hint = if allowed_image_hosts.empty? " First populate Parse::Embeddings.allowed_image_hosts with the CDN " \ "hostnames you trust (currently empty — every host would be denied " \ "even after the sentinel is set)." else "" end raise ConfirmationRequired, "Parse::Embeddings.validate_image_url! refused: image-URL forwarding is " \ "disabled. Set Parse::Embeddings.trust_provider_url_fetch = " \ "#{TRUST_PROVIDER_URL_FETCH_SENTINEL.inspect} to enable it.#{hint}" end unless url.is_a?(String) && !url.empty? raise InvalidImageURL.new(:parse, "Parse::Embeddings.validate_image_url!: url must be a non-empty String " \ "(got #{url.class}).") end uri = begin URI.parse(url) rescue URI::InvalidURIError => e raise InvalidImageURL.new(:parse, "Parse::Embeddings.validate_image_url!: invalid URL (#{e.}).") end valid_schemes = allow_insecure ? %w[http https] : %w[https] unless valid_schemes.include?(uri.scheme) raise InvalidImageURL.new(:scheme, "Parse::Embeddings.validate_image_url!: scheme must be #{valid_schemes.join(' or ')} " \ "(got #{uri.scheme.inspect}). Forwarding non-HTTPS image URLs to a provider " \ "leaks any embedded query-string secrets in cleartext.") end if uri.userinfo raise InvalidImageURL.new(:userinfo, "Parse::Embeddings.validate_image_url!: URL must not include userinfo " \ "credentials. Embedding providers will forward the full URL in their fetch " \ "and may log it.") end # `uri.hostname` returns the IDNA-decoded form WITHOUT IPv6 # brackets, where `uri.host` keeps the brackets. Using # `hostname` makes the allowlist comparison work uniformly for # IPv6 literals (operators write `::1`, not `[::1]`) and # matches the form `Parse::File.assert_host_allowed!` expects. host = uri.hostname if host.nil? || host.empty? raise InvalidImageURL.new(:parse, "Parse::Embeddings.validate_image_url!: URL is missing a host.") end # Reject non-canonical IPv4 forms (decimal `2130706433`, # octal `0177.0.0.1`, hex `0x7f.0.0.1`) before they reach # resolution. Most stacks' Resolv returns [] for these, so # they'd be blocked anyway — but via the resolution-failure # branch (`:parse` reason) rather than the CIDR branch, which # makes the failure mode look like a benign typo when it's # actually an obfuscated-localhost SSRF attempt. Explicitly # tagging the failure as `:host_blocked` keeps operator logs # honest. We allow exactly: dotted-quad IPv4 (4 decimal # octets), bracketed-or-bare IPv6 (parsed by IPAddr), and # DNS hostnames (anything containing a letter or non-numeric # character). if ip_shaped_but_not_canonical?(host) raise InvalidImageURL.new(:host_blocked, "Parse::Embeddings.validate_image_url!: host #{host.inspect} is an obfuscated " \ "or non-canonical IP literal. Use dotted-quad IPv4 (a.b.c.d) or canonical IPv6. " \ "Decimal/octal/hex IP forms are refused to prevent localhost-bypass attempts.") end # **Image-host allowlist runs BEFORE the resolver hop.** Round-2 # audit (LOW finding #3) noted that a caller passing N URLs to # a public `embed_image` API could amplify DNS traffic at ~N× # before the allowlist filtered them out — the pure-string # match is cheap, the resolution is a syscall. Allowlist-first # ordering eliminates the amplification surface. allowed = allowed_image_hosts if allowed.empty? raise InvalidImageURL.new(:host_not_allowlisted, "Parse::Embeddings.validate_image_url!: Parse::Embeddings.allowed_image_hosts " \ "is empty — every image URL is denied. Add the CDN hostnames you trust before " \ "forwarding image URLs to a provider.") end permitted = allowed.any? do |entry| if entry.start_with?(".") host.downcase.end_with?(entry.downcase) || host.casecmp(entry[1..]).zero? else host.casecmp(entry).zero? end end unless permitted raise InvalidImageURL.new(:host_not_allowlisted, "Parse::Embeddings.validate_image_url!: host #{host.inspect} not in " \ "Parse::Embeddings.allowed_image_hosts (#{allowed.inspect}).") end # Port allowlist runs after the host allowlist (cheap string # check first). Reuses Parse::File's port allowlist — same # threat model (internal-port probing via DNS rebinding). port = uri.port || (uri.scheme == "https" ? 443 : 80) require_relative "model/file" unless Parse::File.allowed_remote_ports.include?(port) raise InvalidImageURL.new(:port, "Parse::Embeddings.validate_image_url!: port #{port} not in " \ "Parse::File.allowed_remote_ports.") end # CIDR + DNS resolution last — most expensive (syscall). An # allowlisted CDN hostname pointing at a private IP (DNS # poisoning / hostile-allowlist-entry / first-party rebind) # is the residual surface this catches. Delegates to # Parse::File's shared SSRF primitive. begin Parse::File.assert_host_allowed!(host) rescue ArgumentError => e tag = e..include?("private/internal address") ? :host_blocked : :parse raise InvalidImageURL.new(tag, "Parse::Embeddings.validate_image_url!: #{e.}") end # Return the canonicalized URL so callers store/forward # exactly what was validated, not the raw input. uri.to_s end |