Atlas Vector Search Guide
Parse Stack ships first-class support for MongoDB Atlas $vectorSearch
against Parse classes. This guide covers the full surface: declaring
:vector properties, registering embedding providers, running
find_similar queries, the embed and embed_image write-side
macros, Atlas index management, AS::N telemetry, and the constraint
and logging behavior callers need to know about.
v5.0 introduced the text-embedding path; v5.1 adds image embedding via
the new embed_image macro, Voyage#embed_image
(voyage-multimodal-3, 1024-dim), and Cohere#embed_image
(embed-v4.0, 1536-dim). Image inputs are URL-only in v5.1 (the SDK
forwards the file URL to the provider; the SDK does not fetch image
bytes) and are gated behind an explicit operator opt-in plus a CDN
allowlist — see §Image embedding below.
For the underlying mongo-direct enforcement model that vector search inherits, see mongodb_direct_guide.md.
When to use vector search
Use Atlas vector search when:
- You need semantic similarity ("articles about X" where "about X" is a meaning, not a substring) rather than substring / token matches.
- Your records have a natural text or image embedding source (title + body, transcript, caption, etc.).
- You are already running on MongoDB Atlas, or on a self-managed cluster with the search/vectorSearch extension available. Atlas Local works for development and integration tests.
Do NOT use vector search for:
- Exact / substring matching — use Parse's normal query operators or
Atlas
$searchtext indexes (see Atlas Search docs). - Tiny corpora (< a few hundred docs) where a brute-force cosine in application code would be cheaper than maintaining an index.
Declaring a :vector property
:vector is a first-class Parse property type. The declaration
captures the vector's width, the provider that produces it, the model
name, and the similarity function the Atlas index will use.
class Document < Parse::Object
property :title, :string
property :body, :string
property :body_embedding, :vector,
dimensions: 1536,
provider: :openai,
model: "text-embedding-3-small",
similarity: :cosine
end
dimensions:(required) — fixed output width. Must match what the registered provider returns and what the Atlas vectorSearch index declares. Mismatches raiseParse::Embeddings::InvalidResponseErroron write orParse::VectorSearch::InvalidQueryVectoron read.provider:— name registered viaParse::Embeddings.registerorParse::Embeddings.configure. Required for theembedmacro and for thefind_similar(text:)overload; optional if you only ever pass pre-computedvector:Arrays.model:— stable identifier, persisted toembedding_metaand used in cache keys. Changing this on an existing field is a migration — see the §Re-embedding section below.similarity:— one of:cosine,:dotProduct,:euclidean. Determines how the Atlas index ranks. Pick:cosinefor normalized text embeddings;:dotProductfor raw OpenAI/Cohere output if you want to skip the unit-normalize step.
Storage shape
:vector properties serialize as plain BSON arrays of floats. There
is no Parse-side wrapper class on the wire. In memory they are
Parse::Vector instances which respond to to_a, dimensions, and
arithmetic helpers.
Constraint refusal
Vector fields are NOT general-purpose query targets. The query builder
refuses every operator on :vector columns except :exists and
:null. Attempting where(body_embedding: <Array>) or
where(:body_embedding.gt => 0.5) raises at query build time — semantic
similarity must go through find_similar, not the normal where DSL.
Body builder compaction
When Parse::Object#inspect or the request logger has to print a
record carrying a vector, the formatter replaces the array with a
compact <vector dims=N> placeholder once the length is ≥ 32. This
keeps multi-thousand-dim arrays out of error trackers and stack
traces. The wire payload itself is unchanged.
Registering an embedding provider
Parse::Embeddings is a pluggable registry. v5.1 ships seven built-in
providers:
Parse::Embeddings::OpenAI— text-only.text-embedding-3-small(1536-dim, default),text-embedding-3-large(3072-dim, Matryoshka viadimensions:), legacytext-embedding-ada-002. ForwardsOpenAI-Organization/OpenAI-Projectheaders when supplied.Parse::Embeddings::Cohere— v3 family (embed-english-v3.0,embed-multilingual-v3.0, and-light-v3.0siblings; 1024 / 384 dim) plusembed-v4.0(1536 native, 128k token context, Matryoshka- truncatable to 512, 1024, 1536 viadimensions:).embed-v4.0is Cohere's text+image multimodal endpoint; the text path routes through/v1/embedand the v5.1 image path routes through/v2/embedwith OpenAI-style nested{ type: "image_url", image_url: { url: ... } }content rows.Parse::Embeddings::Voyage— voyage-4 family (voyage-4-large2048, Matryoshka;voyage-41024;voyage-4-lite512;voyage-4-nano256), voyage-3 family, domain models (voyage-code-3,voyage-finance-2,voyage-law-2), andvoyage-multimodal-3(1024-dim, 32k token context, routes to/v1/multimodalembeddingswith the wrapped{inputs: [{content: [{type: "text", text: ...}]}]}envelope for text and{type: "image_url", image_url: <url>}content rows for the v5.1embed_imagepath).Parse::Embeddings::Jina—jina-embeddings-v3(1024, Matryoshka 32–1024),jina-embeddings-v4(2048, Matryoshka), v5 family (jina-embeddings-v5-text-{small,nano},jina-embeddings-v5-omni-{small,nano}— omni accepts plain-text here), andjina-code-embeddings-{0.5b,1.5b}. Distinguishesinput_type:via Jina'staskfield (retrieval.query/retrieval.passage/classification/separation). Rerankers and image-only models are out of scope.Parse::Embeddings::Qwen—qwen3-embedding-0.6b(1024),qwen3-embedding-4b(2560),qwen3-embedding-8b(4096), all Matryoshka. Targets Alibaba Cloud DashScope's OpenAI-compatible endpoint; operators in mainland China overridebase_url:tohttps://dashscope.aliyuncs.com/compatible-mode/v1. Same checkpoints are open-weight on Hugging Face (Apache 2.0) — self-host withLocalHTTP.Parse::Embeddings::LocalHTTP— generic OpenAI-compatible client for self-hosted gateways (Ollama, LM Studio, vLLM, Text Embeddings Inference, llama.cpp). Configure-time SSRF gate refuses loopback / RFC1918 / link-local / cloud-metadata bases unless opted in withallow_private_endpoint: true(emits aKernel#warnaudit line).Parse::Embeddings::Fixture— deterministic, zero-network. Used by the test suite. Auto-registered under:fixture, no setup required.
Production: OpenAI
Parse::Embeddings.configure do |c|
c.providers[:openai] = Parse::Embeddings::OpenAI.new(
api_key: ENV.fetch("OPENAI_API_KEY"),
model: "text-embedding-3-small",
)
end
The OpenAI provider self-bounds at 30 s read / 5 s connect with
capped exponential retry on 429 and 5xx. There is no implicit
wall-clock deadline imposed by find_similar or by the embed
macro — the provider is responsible for bounding its own request
time. Custom providers MUST follow the same convention.
Tests: Fixture
provider = Parse::Embeddings.provider(:fixture) # zero-config
vec = provider.(["hello"]).first # deterministic
Vectors are derived from SHA-256 over (model_name, input_type, input)
and unit-normalized. Same input always yields the same vector;
:search_query and :search_document yield different vectors for the
same string, so cache-key bugs and input-type confusion in higher
layers surface in tests rather than only against real providers in
production.
Custom providers
Subclass Parse::Embeddings::Provider and override embed_text,
dimensions, and model_name. Call instrument_embed(input_count,
input_type) { ... } inside embed_text to emit the standard AS::N
event (see §Telemetry below). Always call validate_response! before
returning so off-by-one batches and NaN/±Inf poisoning surface as
typed InvalidResponseError at the provider boundary, not deep inside
a later $vectorSearch call.
Creating the Atlas vectorSearch index
find_similar requires a deployed Atlas vectorSearch index covering
the target field. Create one via Parse::AtlasSearch::IndexCatalog:
Parse::AtlasSearch::IndexCatalog.create_index(
"Document", # Parse class / collection name
"body_embedding_v1", # index name (your choice)
{
type: "vectorSearch",
fields: [
{
type: "vector",
path: "body_embedding",
numDimensions: 1536,
similarity: "cosine",
},
# Optional: filter fields for pre-search $match acceleration.
{ type: "filter", path: "tag" },
{ type: "filter", path: "_rperm" },
],
},
)
Including _rperm as a filter field lets the per-row ACL match
short-circuit at the index level — strongly recommended for any
field that ACL-scoped agents will search against.
Index creation runs asynchronously. Use wait_for_ready to block
until the index is queryable:
Parse::AtlasSearch::IndexCatalog.wait_for_ready(
"Document", "body_embedding_v1", timeout: 600,
)
# => :ready | :failed | :timeout
Auto-discovery: when find_similar is called without an explicit
index: kwarg, the catalog scans the collection's vectorSearch
indexes for one whose definition covers the requested path. The
first match wins; pass index: explicitly when you have more than
one covering index and want a specific one.
Running similarity queries: find_similar
# Pre-computed vector
hits = Document.find_similar(vector: , k: 10)
# Auto-embed query text using the field's declared provider
hits = Document.find_similar(text: "ruby parse stack", k: 10)
hits.first.vector_score # => Float, Atlas vectorSearchScore
hits.first.title # => String, normal Parse attribute
Full kwarg surface:
vector:—Array<Float>orParse::Vector. Mutually exclusive withtext:.text:—String. Embedded withinput_type: :search_queryusing the field's declaredprovider:. Capped at 256 KiB; chunk client- side before calling if larger.k:— number of hits to return (default 10).field:— explicit:vectorproperty. Auto-resolves when the class has exactly one; required when multiple are declared.filter:— post-$vectorSearch$match. Use for ordinary Parse- side filtering (e.g.{ status: "published" }).vector_filter:— Atlas-native pre-search filter. Fields must be declaredtype: "filter"in the index. Faster thanfilter:when the field is filter-indexed.index:— explicit vectorSearch index name. Skips auto-discovery.num_candidates:— HNSW search width hint. Higher = better recall, slower. Default ~10×k.max_time_ms:— server-side timeout; translates toParse::MongoDB::ExecutionTimeouton cancel.raw:— when true, return rawBSON::Documenthashes (each carries_vscore). When false (default), buildParse::Objectinstances.session_token:/master:/acl_user:/acl_role:— scope kwargs forwarded to the underlyingParse::MongoDB.aggregateso the 5-layer enforcement (denylist, ACL_rpermmatch, CLP, protectedFields, master-key escape) runs against the result rows.
Dimension validation
find_similar compares the query vector's length to the property's
declared dimensions: before sending the pipeline. A mismatch raises
Parse::VectorSearch::InvalidQueryVector locally, before Atlas sees
it — callers get "expected 1536, got 768" instead of a server-side
error after a round-trip.
Index drift verification (v5.5)
On the first auto-discovered use of a vectorSearch index per
(class, field, index) per process, the SDK compares the deployed
index's latestDefinition against the model declaration:
numDimensionsvs the property's declareddimensions:— a mismatch means every query will be rejected or return nonsense (usually an index that predates a model change).similarityvs the property's declaredsimilarity:(checked only when both sides declare one).- When the class registers an
agent_tenant_scope, the scope field must appear among the index'stype: "filter"paths — without it, every tenant-scoped$vectorSearch.filterfails Atlas-side at query time.
Findings are computed once per (class, field, index) per process and
governed by Parse::VectorSearch.index_drift_policy:
Parse::VectorSearch.index_drift_policy = :warn # default — [Parse::VectorSearch:DRIFT] warning on first check
Parse::VectorSearch.index_drift_policy = :raise # IndexDriftError on EVERY query against a drifted index
Parse::VectorSearch.index_drift_policy = :ignore # skip verification
Under :raise the cached findings keep raising — strict mode means a
drifted index never serves results, not "fails once, then passes".
Auto-discovery verification costs no extra round-trip (the definition
is already in hand from index discovery). An explicit index: kwarg
is verified best-effort: when the catalog's covering index for the
field carries the same name, its definition is checked too; catalog
lookup failures never fail the query.
Query-embed caching and spend caps (v5.5)
Every text:-overload query funnels through one embed path
(find_similar(text:), hybrid_search(text:),
Parse::Retrieval.retrieve all share it), which gives two controls:
# Opt-in query-embed cache: repeated identical queries skip the
# provider round-trip. Keyed by (provider, model, dimensions,
# input_type, SHA-256(input)) — plaintext never lands in the store.
Parse::Embeddings::Cache.enable!(max_entries: 2048, ttl: 600)
Parse::Embeddings::Cache.stats # => { enabled:, hits:, misses:, size: }
# Per-tenant spend cap now covers DIRECT callers too, not just the
# semantic_search agent tool. Tenant identity resolves to the ambient
# Parse.with_cache_tenant scope when set, else a shared default bucket.
# warn_at: adds a soft cap — crossing 80% of the limit emits a
# parse.embeddings.spend_cap_warning AS::N event (alert, never refuse).
Parse::Embeddings::SpendCap.configure(limit_tokens: 1_000_000, window: 3600,
warn_at: 0.8)
Parse.with_cache_tenant("tenant_abc") do
Document.find_similar(text: query) # charged against tenant_abc
end
Cache hits emit the standard parse.embeddings.embed notification
with cached: true, so existing spend subscribers see hits and misses
on one stream. The cache is in-process by default; for a persistent
layer shared across processes, wrap any Moneta-compatible backend in
the bundled adapter:
moneta = Moneta.new(:Redis, url: ENV["REDIS_URL"])
Parse::Embeddings::Cache.enable!(
store: Parse::Embeddings::Cache::MonetaStore.new(moneta, ttl: 30 * 24 * 3600),
)
MonetaStore namespaces keys, forwards TTL via Moneta's expires:,
and fails open (a backend error is a cache miss, never a failed
embed). Keys are input hashes — plaintext queries never land in the
shared store; the VALUES are embeddings, so give the store the same
access controls as the database. A query the agent tool already
charged per-tenant is not double-billed (SpendCap.with_precharged
wraps the tool's retrieval).
ACL/CLP inheritance
Vector search routes through Parse::MongoDB.aggregate. Every layer
documented in mongodb_direct_guide.md §Security
applies to vector search result rows too:
- Pipeline-security denylist (always on).
- Row-level ACL
_rpermmatch — scoped agents only. - CLP read enforcement — scoped agents only.
protectedFieldsstripping — scoped agents only.- Master-key escape hatch.
REST /aggregate is NOT a valid path for vector search with a
scoped caller. Parse Server's REST aggregate endpoint is master-
key-only and would bypass every per-row ACL and CLP check. The built-
in agent tools auto-promote mongo_direct: false to true for any
agent carrying session_token, acl_user, acl_role, or a non-
master scope so this enforcement always runs.
Managing embeddings on write: embed macro
The embed class macro declares which source fields feed a managed
vector. The embedding is recomputed automatically on save whenever
the source fields change.
class Document < Parse::Object
property :title, :string
property :body, :string
property :body_embedding, :vector, dimensions: 1536, provider: :openai
:title, :body, into: :body_embedding
end
doc = Document.new(title: "hello", body: "world")
doc.save # provider :openai called once; body_embedding populated
doc.body = "updated body"
doc.save # provider called again; new embedding written
doc.save # no source field changed → zero provider calls
Mechanics:
- A
<into>_digest:stringsibling field is auto-declared (override withdigest_field:). The before_save callback computes SHA-256 over the concatenated source text; if it matches the stored digest AND the target vector is non-nil, the callback returns without contacting the provider. - The target
:vectorproperty is write-protected. Direct assignment (doc.body_embedding = some_vector) raisesProtectedFieldError. The guard lifts only inside the managed write path. This prevents silent desync between the stored vector and the digest. - Source fields are concatenated with
"\n\n",niland blank values skipped. If every source is blank, the target and digest are both cleared on save.
Single vector per record
embed produces exactly one vector per record. Long source text whose
concatenation exceeds the provider's per-call token budget is truncated
provider-side, and the stored vector represents only the leading portion
of the document. Chunking happens at retrieval time, not embed time
(see Retrieval (RAG) below): the embedding stays
one-vector-per-record by design.
If you instead want each passage to have its OWN embedding (true embed-time chunking), use one of these patterns:
- Pre-chunk client-side and write each chunk as its own
Parse::Objectrecord with its ownembeddeclaration. - Dedicated chunk subclass that
belongs_tothe parent, withembed :content, into: :embeddingon the chunk class itself. Run similarity search against the chunk collection, then hydrate parents as needed.
Retrieval (RAG)
For an end-to-end runnable script — managed
embed,agent_searchable,semantic_search, and an OpenAI/Anthropic generation add-in — seeexamples/rag_chatbot.rb.
Parse::Retrieval (Parse::RAG is an alias) sits on top of
find_similar. Parse::Retrieval.retrieve embeds a natural-language
query, runs Atlas $vectorSearch through find_similar (so ACL/CLP are
enforced mongo-direct — there is no REST two-stage re-query), and splits
each retrieved document's text field into scored, citable chunks.
Chunking here is presentation-only: every chunk inherits its parent
document's single $vectorSearch score.
chunks = Parse::Retrieval.retrieve(
query: "how do I reset my password?",
klass: KnowledgeArticle, # or "KnowledgeArticle"
field: :embedding, # optional; auto-resolves a single :vector field
k: 5,
filter: { published: true }, # post-$vectorSearch $match
vector_filter: nil, # Atlas-native pre-filter (fields must be type:"filter")
tenant_scope: nil, # { field:, value: } merged into vector_filter
score_quantize: false,
session_token: user.session_token, # ACL scope kwargs pass through to find_similar
)
# => Array<Parse::Retrieval::Chunk> — { id, score, content, source, metadata }
retrieve also accepts hybrid: (fuse a lexical branch with the vector
branch — see Hybrid search below) and
rerank: (reorder retrieved documents with a cross-encoder before
chunking — see Reranking). Both were reserved in earlier
releases and now ship in 5.4.0.
Pointer values in filters translate automatically (v5.5). A filter
like { owner: some_user } (a Parse::Pointer / Parse::Object, or a
wire-form {"__type" => "Pointer", ...} hash — including inside $in
/ $eq / $ne operator hashes) is rewritten to its MongoDB storage
form { "_p_owner" => "_User$abc123" } before the $match /
$vectorSearch.filter is built, so pointer filters match rows instead
of silently matching nothing. Translation runs after the
underscore-key gate (callers still cannot name _p_* columns
directly) and before the tenant-scope fold; the semantic_search
agent tool inherits it. For vector_filter: use, the pointer column
(_p_owner) must be declared type: "filter" in the index.
Hybrid search (vector + lexical)
Class.hybrid_search runs a lexical Atlas Search ($search) branch and a
$vectorSearch branch as two independent aggregations, then fuses
their ranked results with reciprocal-rank fusion (RRF). Two aggregations
(not a single $facet) is mandatory: $vectorSearch is prohibited inside
$facet / $lookup / $unionWith and must be stage 0 of its pipeline.
Each branch enforces ACL/CLP/protectedFields independently before
fusion (via Parse::AtlasSearch.search and Parse::VectorSearch.search),
so the fused rows are already access-filtered — there is no separate
hydration fetch.
hits = Article.hybrid_search(
text: "how do I reset my password", # embedded for the vector branch;
# also the default lexical query
lexical: { index: "article_search", fields: %w[title body] },
vector: { index: "article_embedding_idx", num_candidates: 200 },
k: 20,
fusion: { k_constant: 60, weights: { lexical: 0.4, vector: 0.6 } },
session_token: user.session_token, # ACL scope, applied to BOTH branches
)
# => Array<Parse::Object>; each carries #hybrid_score, #hybrid_ranks,
# and #vector_score / #search_score when that branch contributed.
RRF math. fused_score(d) = Σ_b weight_b / (k_constant + rank_b(d)),
where rank_b(d) is the document's 1-based rank in branch b. A larger
k_constant (default 60) flattens the contribution curve. weights
defaults to 1.0 per branch. Parse::VectorSearch::Hybrid.rrf exposes the
pure fusion if you want to fuse pre-fetched ranked lists yourself.
Native $rankFusion (Atlas 8.0+).
Parse::VectorSearch::Hybrid.rank_fusion_supported?(collection) detects
the native server-side fusion stage via a cached behavioural probe (1-hour
TTL — not version-string parsing). Native execution is opt-in
(fusion: { method: :rrf_native }) and falls back to the client-side path
when the cluster does not support it; the default :rrf always fuses
client-side, which is the fully-enforced, deterministic path. $rankFusion
is admitted to PipelineSecurity::ALLOWED_STAGES for the native path.
Parse::Retrieval.retrieve(hybrid: true, ...) routes through
hybrid_search and chunks the fused results; pass hybrid: { lexical:,
vector:, fusion: } to configure the branches. Tenant scope is folded into
both branches (the vector Atlas pre-filter and the lexical
post-$search $match) so neither leaks cross-tenant document existence.
Reranking
A reranker reorders retrieved documents by a cross-encoder relevance score
before chunking. Pass any object answering
#rerank(query:, documents:, top_n:) — typically a
Parse::Retrieval::Reranker::Base subclass:
reranker = Parse::Retrieval::Reranker::Cohere.new(
api_key: ENV.fetch("COHERE_API_KEY"), model: "rerank-v3.5",
)
chunks = Parse::Retrieval.retrieve(
query: "reset my password", klass: Article, k: 30,
rerank: reranker, rerank_top_n: 5, # keep the 5 most relevant docs
)
# Reranked chunks' score is the cross-encoder relevance_score.
Reranker::Fixture is a deterministic, zero-network reranker (lexical
token overlap) for tests. The Reranker::Base protocol validates inputs,
bounds top_n, rejects out-of-range indices, and sorts descending —
adapters implement only the network call (#rerank_scores).
Spend cap. The
semantic_searchagent tool charges the estimated query-embedding tokens against the caller's tenant budget viaParse::Embeddings::SpendCap(opt-in;configure(limit_tokens:, window:)). A breach hard-refuses (surfaced to the agent as a rate-limited tool error). Admin agents are exempt; directfind_similar/retrievecallers are not metered.
Chunkers
The default is a fixed-size sliding window with overlap. Subclass
Parse::Retrieval::Chunker::Base (implement #chunk(text) -> Array<String>)
for semantic / sentence-aware strategies.
Parse::Retrieval::Chunker::FixedSizeOverlap.new(
size: 800, # window width
overlap: 100, # units shared between consecutive windows (must be < size)
by: :chars, # :chars (default) or :tokens (whitespace tokens)
max_chunks_per_document: 200, # amplification cap — TRUNCATES with a signal, never raises
)
agent_searchable + the semantic_search agent tool
Opt a model in to agentic retrieval, declaring the vector field and the fields an agent may filter on:
class KnowledgeArticle < Parse::Object
property :title, :string
property :body, :string
property :embedding, :vector, dimensions: 1536, provider: :openai
:title, :body, into: :embedding
agent_searchable field: :embedding, filter_fields: %i[published category]
end
Every property referenced by embed must be declared — omitting
property :title here raises InvalidEmbedDeclaration at class load.
Because this model embeds two text sources (:title and :body),
semantic_search cannot guess which one to chunk and return as the
result content. Pass text_field: to choose (it must name one of the
embedded sources); a single-source model infers it automatically and the
parameter is optional:
# via the agent tool (LLM-facing parameter)
semantic_search(class_name: "KnowledgeArticle", query: "vector indexes",
text_field: "body")
# or directly
Parse::Retrieval.retrieve(query: "vector indexes", klass: KnowledgeArticle,
text_field: :body)
The readonly, client_safe semantic_search tool then routes through
Parse::Retrieval.retrieve with the full agent security envelope:
searchable-class allowlist (MetadataRegistry.resolve_searchable!),
recursive underscore-key refusal + filter-field allowlist on caller
input, tenant scope folded into the Atlas pre-filter AND re-asserted on
every returned record, field_allowlist projection of each source, and
score quantization in non-admin contexts. In a tenant-aware deployment
(any class declares agent_tenant_scope), a searchable class without its
own tenant scope is refused at dispatch. See the
MCP guide for the agent-side wiring.
Result shape (token-economy). The tool returns
{ chunks:, documents:, count: }. Each chunk's parent record is hoisted
once into documents (keyed by objectId) rather than duplicated on
every chunk — map a chunk to its source via metadata.object_id. A
max_total_tokens: budget (default 20,000; estimated chars/4) trims the
lowest-ranked chunks so a few long documents can't silently blow the
context window, adding budget_truncated: true / budget_dropped: <n>
when it trims (pass 0 to disable). The library-level
Parse::Retrieval.retrieve still returns the flat Array<Chunk> with
source on each chunk — the dedup and budget live in the agent tool's
envelope. See the MCP guide's Token Economy section.
Image embedding: embed_image macro (v5.1 URL mode, v5.5 bytes mode)
embed_image is the image-source counterpart to embed. The source
property must be :file-typed; the target must be a :vector property
whose declared provider: supports multimodal input (currently
:voyage with voyage-multimodal-3, or :cohere with embed-v4.0).
Two fetch modes, selected per declaration with source::
source: :url(default) — the SDK validates the file's URL and forwards it; the provider performs the fetch from its own network. Requires thetrust_provider_url_fetchsentinel (see operator setup below).source: :bytes(v5.5) — the SDK downloads the image throughParse::File.safe_open_url, verifies the content by magic-byte sniff, strips EXIF/XMP metadata, and forwards the bytes to the provider as a base64 data URI. No provider-side URL fetch occurs, so the sentinel is NOT required — theallowed_image_hostsallowlist still is.
class Post < Parse::Object
property :cover_image, :file
property :cover_image_embedding, :vector,
dimensions: 1024,
provider: :voyage,
model: "voyage-multimodal-3"
:cover_image, into: :cover_image_embedding
end
Operator setup (required before any save)
Image embedding hands an attacker-influenced URL (a user-uploaded
Parse::File, a chat message, an agent tool argument) to a third-party
provider that will issue an HTTP request from its own network. The
provider's fetch happens after SDK-side validation, so DNS rebinding
and redirect-following are residual risks the SDK cannot eliminate.
The setup must happen in this exact order — skipping (1) or (2) raises a typed error at save time with a message naming the missing prerequisite:
# (1) Declare which CDNs the validator will accept. Empty allowlist
# denies every host — opposite of Parse::File.allowed_remote_hosts.
Parse::Embeddings.allowed_image_hosts = [
".cloudfront.net", # suffix match (leading ".")
"files.example.com", # exact match
]
# (2) Sentinel-gated opt-in. Only the exact frozen String unlocks;
# `true`, `"true"`, `1`, or any other value raises
# Parse::Embeddings::ConfirmationRequired.
Parse::Embeddings.trust_provider_url_fetch = "PROVIDER_EGRESS_VERIFIED"
# (3) Declare embed_image on the model.
class Post < Parse::Object
:cover_image, into: :cover_image_embedding
end
URL validator (Parse::Embeddings.validate_image_url!)
Every embed_image save path routes through
Parse::Embeddings.validate_image_url!(url, allow_insecure:), which
runs layered cheap-first checks: sentinel set, https:// (or
http:// with allow_insecure: true), no userinfo, host not an
obfuscated-IP form (0x7f.0.0.1, 127.1, 2130706433), host in the
allowlist, port in Parse::File.allowed_remote_ports, host resolves
only to public addresses (delegated to
Parse::File.assert_host_allowed! so the SSRF mechanism is shared
with Parse::File, not parallelized). Failures raise
Parse::Embeddings::InvalidImageURL with a :reason Symbol
(:scheme, :port, :userinfo, :host_blocked,
:host_not_allowlisted, :parse).
Bytes mode (source: :bytes, v5.5)
# Operator setup — only the host allowlist is required (the sentinel
# applies to URL forwarding, not SDK-side fetches):
Parse::Embeddings.allowed_image_hosts = [".cloudfront.net"]
class Post < Parse::Object
property :cover_image, :file
property :cover_image_embedding, :vector,
dimensions: 1024, provider: :voyage, model: "voyage-multimodal-3"
:cover_image, into: :cover_image_embedding,
source: :bytes # exif_strip: true is the default
end
What happens on each (digest-miss) save:
- The file URL is validated through
Parse::Embeddings.validate_image_url!(url, mode: :fetch)— the same host allowlist (deny-all when empty), obfuscated-IP screen, port allowlist, and CIDR resolution check as URL mode, minus the provider-egress sentinel. Parse::File.safe_open_urldownloads the bytes — CIDR blocks, DNS-rebinding re-check, port allowlist,max_remote_sizecap, timeouts. No parallel fetch mechanism exists.- Magic-byte verification (
Parse::Embeddings::ImageFetch): the MIME type is determined exclusively from the leading bytes (JPEG / PNG / GIF / WebP). The HTTPContent-Typeheader is never consulted. The sniffed type must be inParse::Embeddings.allowed_image_types(default those four; SVG is deliberately excluded as script-capable active content), and when the URL carries a recognized image extension, the extension must AGREE with the magic bytes — a.jpgURL serving PNG bytes (or HTML) is refused as MIME laundering (ImageFetch::InvalidImageType, with a:reasontag). - EXIF/XMP stripping, default ON. JPEG APP1 segments (Exif and
XMP), PNG
eXIfchunks, and WebPEXIF/XMPRIFF chunks (with the VP8X flag bits cleared) are removed before the bytes leave the process — user photos commonly carry GPS coordinates and device serials. Opt out per declaration withexif_strip: falsewhen orientation metadata must survive. - The verified bytes ride to the provider as a base64 data URI
(Voyage
image_base64content row; Cohereimage_urldata-URI form).
Direct provider calls accept the same shape:
provider.embed_image([Parse::Embeddings::ImageFetch.fetch!(url)]) —
FetchedImage sources and URL Strings may be mixed in one batch.
Save-side semantics
- Digest is the SHA-256 of the URL String, not the file bytes.
Replacing the
Parse::Filewith one pointing at a different URL re-embeds; resaving the same URL is a no-op (zero provider calls). Parse-managed file URLs are stable unless overwritten in place — if you PUT-replace bytes at the same URL (S3 without renaming), null the digest field to force re-embed. - The same
EmbedManagedwrite-guard applies: direct assignment to the managed vector raisesProtectedFieldError. The write path is the only way to populate the target vector. embedandembed_imagecan co-declare on the same record (different source properties → different:vectortargets), so a record can have one text-embedding column and one image-embedding column queried by separate Atlas vectorSearch indexes.
Re-embedding existing rows
Provenance: the <into>_meta sibling (v5.5)
Every embed / embed_image declaration auto-declares an
<into>_meta :object sibling (override with meta_field:) stamped
on each recompute and cleared with the vector:
doc.
# => { "provider" => "openai",
# "model" => "text-embedding-3-small",
# "dimensions" => 1536,
# "modality" => "text",
# "embedded_at" => "2026-06-09T17:32:11Z" }
This is the record migration tooling reads to know which model produced any stored vector.
Same-shape migrations: Class.reembed! (v5.5)
When the new model has the same dimensions (e.g. swapping
text-embedding-3-small for a same-width replacement, or a provider
change at equal width), re-embed in place:
# Re-embed every row through the CURRENT provider/model declaration.
Document.(batch_size: 100)
# Resumable: skip rows whose <into>_meta already matches the current
# provider + model + dimensions (rows with no meta count as stale).
Document.(only_stale: true)
# Scope it
Document.(field: :body_embedding, where: { published: true }, limit: 10_000)
reembed! walks the class with objectId-cursor pagination, clears
each row's digest sibling (so the save-path recompute cannot elide the
provider call), and saves. Unlike embed_pending! — which only fills
NULL vectors — reembed! recomputes populated rows too. Run it with a
master-key client (or pass save_opts: with a session token that can
write every row). Each row's save makes one provider call; pace bulk
runs against provider rate limits (see BatchEmbedder below for the
pattern, or just throttle the loop).
Changed-width migrations: dual-field workflow
Changing dimensions: is a different beast — the existing
vectorSearch index can't serve the new width. Use the shadow-field
workflow:
- Add the new property alongside the old one
(
property :body_embedding_v2, :vector, ...) and anembedorembed_imageblock targeting it. - Backfill with
embed_pending!(field: :body_embedding_v2)— the new field is null everywhere, so the null-filling walk is exactly right. - Deploy a new vectorSearch index covering the new field and migrate
find_similarcallers. - Drop the old property and index.
Do NOT mutate a model's dimensions: in place — the digest mechanism
will see unchanged source text and skip recompute, leaving stale
vectors, and the drift verifier will flag every query against the old
index (index numDimensions=1536 but property declares ...). For
embed_image, also remember the digest is over the URL String: if you
replace bytes at the same URL (PUT-replace on S3 without renaming),
null the digest field — or run reembed! — to force re-embed.
Bulk embedding: BatchEmbedder (v5.5)
Provider#embed_text_batched only slices input into provider-sized
chunks; retry lives inside each provider's single HTTP call. For bulk
jobs (ingest pipelines, chunk-corpus embedding) use
Parse::Embeddings::BatchEmbedder, which adds batch-level pacing and
backoff:
= Parse::Embeddings::BatchEmbedder.new(
Parse::Embeddings.provider(:openai),
requests_per_minute: 60, # inter-batch pacing
max_attempts: 5, # per-batch tries (exponential backoff + jitter)
on_progress: ->(done:, total:, batch_index:, batch_count:) {
puts "#{done}/#{total}"
},
)
vectors = .(texts, input_type: :search_document)
Rate-limit and transient errors (any provider error class ending in
RateLimitError / TransientError; override with retry_on:) retry
with exponential backoff; other errors propagate immediately. A batch
that exhausts its attempts raises BatchEmbedder::BatchFailed
carrying batch_index and completed_count, so a resumable job knows
exactly where to pick up.
Telemetry: parse.embeddings.embed AS::N
Every provider emits parse.embeddings.embed via
ActiveSupport::Notifications.instrument. Subscribe to track cost,
latency, and error rate across all embedding spend:
ActiveSupport::Notifications.subscribe("parse.embeddings.embed") do |*args|
event = ActiveSupport::Notifications::Event.new(*args)
StatsD.increment(
"parse.embeddings.embed",
tags: [
"provider:#{event.payload[:provider]}",
"model:#{event.payload[:model]}",
"input_type:#{event.payload[:input_type]}",
"error:#{event.payload[:error] || 'none'}",
],
)
StatsD.histogram("parse.embeddings.tokens", event.payload[:total_tokens]) if event.payload[:total_tokens]
StatsD.timing("parse.embeddings.duration_ms", event.duration)
end
Payload contract (keys always present; values may be nil):
| Key | Type | Notes |
|---|---|---|
:provider |
String |
provider.class.name (e.g. "Parse::Embeddings::OpenAI") |
:model |
String |
provider.model_name |
:dimensions |
Integer |
provider.dimensions |
:input_count |
Integer |
batch size |
:input_type |
Symbol |
:search_query / :search_document |
:total_tokens |
Integer/nil |
provider-reported usage; nil for Fixture and providers without usage |
:cached |
Boolean |
always false in v5.0; reserved for v5.1 embed cache |
:error |
String/nil |
exception.class.name when the block raised — class name only |
Notes:
:erroris the class name, never the message. Provider exceptions can contain user-supplied text from the API; surfacing only the class name keeps PII out of operator dashboards.- Pre-validation failures (
embed_textcalled with non-Array, or with non-String elements) do not emit an event. The validation runs before the instrument block so caller-shape errors aren't recorded as embed attempts. - Subscribers run synchronously on the request thread. A slow subscriber blocks every embed call. Push to non-blocking sinks (StatsD-over-UDP, batched OTel exporters) rather than doing filesystem or HTTP I/O inside the subscriber.
Logging and PII considerations
When find_similar(text:) is called, the query text is sent over the
wire to the embedding provider. Operators with global Faraday request
logging enabled on the embedding connection will capture the full
query text in the JSON request body. Treat text: as user-visible
content for log-handling purposes; redact at the Faraday middleware
layer if your logging pipeline retains payloads.
The vector itself never appears in OpenAI request bodies (text in,
floats out). Vectors only flow through the Parse↔Mongo path, where
the body builder's <vector dims=N> compaction prevents them from
landing in stdout / error trackers.
When the embedded source is PII: deployment checklist
An embedding of PII is PII-equivalent. Inversion attacks reconstruct
substantial source text from dense embeddings, and a vector's nearest
neighbors leak the source's meaning even without reconstruction. If
the fields you embed contain personal data (names, addresses, health
or financial details, free-text user messages), treat the vector
column with the same handling as the source column:
- Provider contract. You are sending the raw source text (and in
bytes mode, image content) to the embedding provider on every
recompute. Confirm the provider's data-retention and training-use
terms cover PII, and that a DPA is in place where required.
Self-hosting via
LocalHTTP(Ollama / vLLM / TEI) keeps the text in your network. - Keep vectors off the wire. Leave
vector_visibilityat its:owner_onlydefault so vectors are omitted fromas_jsonand webhook payloads. Do not flip a PII class to:public. - Row ACL still governs. Vector hits route mongo-direct with
_rpermenforcement — verify your rows carry real ACLs and that callers use scoped credentials (session_token:/acl_user:), not blanket master key. - Tenant isolation. Multi-tenant deployments must declare
agent_tenant_scopeon searchable classes; the scope folds into$vectorSearch.filter(and v5.5's drift verification confirms the index covers it). Without it, similarity scores leak cross-tenant document existence. - Score exposure. Keep score quantization on for non-admin agent contexts (the default) — full-precision scores enable membership-inference probing.
- EXIF stays stripped. For image embedding, keep the bytes-mode
default
exif_strip: true; user photos carry GPS coordinates and device serials that would otherwise reach the provider. - Log and cache hygiene. Redact query text at the Faraday layer
(above); if you enable the persistent L2 cache, note that cache
KEYS are hashes (no plaintext) but cache VALUES are the embeddings
themselves — point
MonetaStoreat a store with the same access controls as the database. - Deletion propagation. When a user exercises erasure rights,
the vector, its
<field>_digest, and its<field>_metasiblings live on the same row and delete with it — but check external copies: provider-side logs (their retention policy), your L2 embedding cache (TTL or explicit flush), and any analytics sink subscribed to embedding events. - Migration hygiene.
reembed!re-sends every row's source text to the provider — schedule PII-class migrations under the same approvals as a data export.
Troubleshooting
NoVectorProperty: no :vector property declared on this class
The class has no field declared as :vector. Add one.
AmbiguousVectorField: class declares multiple :vector properties
Pass field: :which_one to disambiguate.
IndexNotResolved: no vectorSearch index found covering Class.field
Create the index (see §Creating the Atlas vectorSearch index) or pass
index: explicitly.
InvalidQueryVector: expected 1536, got 768
The query vector's length doesn't match the declared dimensions:.
Almost always means the query embedding came from a different model
than the stored embeddings.
EmbedderNotConfigured
The :vector property has no provider: declared but find_similar
was called with text:. Either declare a provider on the property, or
pass an explicit vector: Array.
ProtectedFieldError: <Class>#<field> is managed by 'embed'
User code tried to assign directly to a managed vector field. Update
the declared source fields instead and save.
InvalidResponseError: response length 5 != input count 4
The provider returned a different number of vectors than inputs. The
provider has a bug — the validation in
Parse::Embeddings::Provider#validate_response! caught it before the
misaligned vectors could be stored.
Atlas Local: index stays BUILDING forever
Atlas Local's internal supervisor periodically restarts mongod
during replica-set sync. Use IndexCatalog.wait_for_ready (which
bypasses the IndexManager's 300-second cache via force_refresh: true
on every poll) rather than a until index_ready?; sleep loop.
Reference
Key files:
lib/parse/embeddings.rb— registry,Configuration,register,provider,configure,validate_image_url!(mode: :forward | :fetch),trust_provider_url_fetch=,allowed_image_hosts=,allowed_image_types=.lib/parse/embeddings/provider.rb— abstract base,validate_response!,instrument_embed, AS::N payload contract.lib/parse/embeddings/image_fetch.rb— bytes-fetch path:ImageFetch.fetch!, magic-bytesniff_mime/verify!, EXIF/XMP stripping,FetchedImage.lib/parse/embeddings/batch_embedder.rb—BatchEmbedderbulk orchestration (pacing, batch-level backoff,BatchFailed).lib/parse/embeddings/cache.rb— opt-in query-embed cache (Cache.enable!/fetch_vector/stats).lib/parse/embeddings/spend_cap.rb— per-tenant token cap (charge!,charge_query!,with_precharged).lib/parse/embeddings/openai.rb— OpenAI provider.lib/parse/embeddings/cohere.rb— Cohere v3 + v4.0 text-mode provider.lib/parse/embeddings/voyage.rb— Voyage text + multimodal-3 text-mode provider.lib/parse/embeddings/jina.rb— Jina v3 / v4 / v5 / code provider.lib/parse/embeddings/qwen.rb— Qwen3-Embedding via DashScope.lib/parse/embeddings/local_http.rb— generic OpenAI-compatible local-gateway client.lib/parse/embeddings/fixture.rb— deterministic test provider.lib/parse/model/core/vector_searchable.rb—find_similar,hybrid_search, index drift verification (Parse::VectorSearch.index_drift_policy).lib/parse/model/core/embed_managed.rb—embedandembed_imagemacros,EmbedDirective(carriesmodality:,allow_insecure:,source_mode:,exif_strip:,meta_field:),embed_pending!,reembed!.lib/parse/vector_search.rb— low-levelParse::VectorSearch.search.lib/parse/atlas_search/index_manager.rb—IndexCatalog.create_index,find_vector_index,wait_for_ready.lib/parse/mongodb.rb— direct MongoDB access, 5-layer enforcement.