Class: Woods::Embedding::Provider::OpenAI

Inherits:
Object
  • Object
show all
Includes:
Interface
Defined in:
lib/woods/embedding/openai.rb

Overview

OpenAI adapter for cloud embeddings via the OpenAI HTTP API.

Uses the ‘/v1/embeddings` endpoint to generate embeddings. Requires a valid OpenAI API key.

Examples:

provider = Woods::Embedding::Provider::OpenAI.new(api_key: ENV['OPENAI_API_KEY'])
vector = provider.embed("class User < ApplicationRecord; end")
vectors = provider.embed_batch(["text1", "text2"])

Constant Summary collapse

ENDPOINT =
URI('https://api.openai.com/v1/embeddings')
DEFAULT_MODEL =
'text-embedding-3-small'
DIMENSIONS =
{
  'text-embedding-3-small' => 1536,
  'text-embedding-3-large' => 3072
}.freeze
MAX_INPUT_TOKENS =

OpenAI embedding models share an 8191-token input cap across text-embedding-3-small / -3-large / ada-002. The chunker uses this as a hard ceiling — the actual chunk size lands well below it once chars-per-token estimation and the prefix allowance are factored in (see Builder#build_chunker).

8191

Instance Method Summary collapse

Constructor Details

#initialize(api_key:, model: DEFAULT_MODEL) ⇒ OpenAI

Returns a new instance of OpenAI.

Parameters:

  • api_key (String)

    OpenAI API key

  • model (String) (defaults to: DEFAULT_MODEL)

    OpenAI embedding model name (default: text-embedding-3-small)



36
37
38
39
# File 'lib/woods/embedding/openai.rb', line 36

def initialize(api_key:, model: DEFAULT_MODEL)
  @api_key = api_key
  @model = model
end

Instance Method Details

#dimensionsInteger

Return the dimensionality of vectors produced by this model.

Uses the known dimensions for standard models, falling back to a test embedding for unknown models.

Returns:

  • (Integer)

    number of dimensions



80
81
82
# File 'lib/woods/embedding/openai.rb', line 80

def dimensions
  DIMENSIONS[@model] || embed('test').length
end

#embed(text) ⇒ Array<Float>

Embed a single text string.

Parameters:

  • text (String)

    the text to embed

Returns:

  • (Array<Float>)

    the embedding vector

Raises:

  • (Woods::Error)

    if the API returns an error

  • (ArgumentError)

    if the text is nil or empty (OpenAI rejects these with 400)



47
48
49
50
51
52
# File 'lib/woods/embedding/openai.rb', line 47

def embed(text)
  raise ArgumentError, 'embed(text) requires a non-empty string' if text.nil? || text.to_s.strip.empty?

  response = post_request({ model: @model, input: text })
  response['data'].first['embedding']
end

#embed_batch(texts) ⇒ Array<Array<Float>>

Embed multiple texts in a single request.

Sorts results by the index field to guarantee ordering matches input.

Parameters:

  • texts (Array<String>)

    the texts to embed

Returns:

  • (Array<Array<Float>>)

    array of embedding vectors

Raises:

  • (Woods::Error)

    if the API returns an error

  • (ArgumentError)

    if the array is empty or any element is nil/empty



62
63
64
65
66
67
68
69
70
71
72
# File 'lib/woods/embedding/openai.rb', line 62

def embed_batch(texts) # rubocop:disable Metrics/CyclomaticComplexity
  raise ArgumentError, 'embed_batch(texts) requires a non-empty array' if texts.nil? || texts.empty?
  if texts.any? { |t| t.nil? || t.to_s.strip.empty? }
    raise ArgumentError, 'embed_batch(texts) rejects nil/empty entries (OpenAI returns 400)'
  end

  response = post_request({ model: @model, input: texts })
  response['data']
    .sort_by { |item| item['index'] }
    .map { |item| item['embedding'] }
end

#max_input_tokensInteger

Maximum input length OpenAI will accept for a single embedding text. All current text-embedding-* models cap at ~8k tokens.

Returns:

  • (Integer)


95
96
97
# File 'lib/woods/embedding/openai.rb', line 95

def max_input_tokens
  MAX_INPUT_TOKENS
end

#model_nameString

Return the model name.

Returns:

  • (String)

    the OpenAI model name



87
88
89
# File 'lib/woods/embedding/openai.rb', line 87

def model_name
  @model
end