Class: Kiribi::Gemma4::E2B::VisionEncoder

Inherits:

Object

Object
Kiribi::Gemma4::E2B::VisionEncoder

show all

Defined in:: lib/kiribi/gemma4/e2b/vision_encoder.rb

Constant Summary collapse

PATCH_SIZE =

RESCALE_FACTOR =

1.0 / 255

MAX_SOFT_TOKENS =

POOLING_KERNEL =

MAX_PATCHES =

MAX_SOFT_TOKENS * POOLING_KERNEL**2

SIDE_MULT =

POOLING_KERNEL * PATCH_SIZE

Instance Method Summary collapse

#encode(blob_rgb, width, height) ⇒ Object

blob_rgb: RGB24 raw bytes（既に width × height へリサイズ済み） image_features 配列を返す.
#initialize ⇒ VisionEncoder constructor

A new instance of VisionEncoder.
#input_size_of(original_width, original_height) ⇒ Object

元画像サイズ (original_width, original_height) を受け取り、 encode 前にリサイズすべきサイズ [width, height] を返す。.

Constructor Details

#initialize ⇒ `VisionEncoder`

Returns a new instance of VisionEncoder.



18
19
20

# File 'lib/kiribi/gemma4/e2b/vision_encoder.rb', line 18

def initialize
  @model = OnnxRuntime::Model.new(VISION_ENCODER_FILEPATH)
end

Instance Method Details

#encode(blob_rgb, width, height) ⇒ `Object`

blob_rgb: RGB24 raw bytes（既に width × height へリサイズ済み）image_features 配列を返す

# File 'lib/kiribi/gemma4/e2b/vision_encoder.rb', line 46

def encode(blob_rgb, width, height)
  blob = blob_rgb.is_a?(String) ? blob_rgb.unpack("C*") : blob_rgb
  patches_w = width / PATCH_SIZE
  patches_h = height / PATCH_SIZE

  pixel_values = []
  pixel_position_ids = []

  patches_w.times do |col|
    patches_h.times do |row|
      patch = []
      PATCH_SIZE.times do |dy|
        PATCH_SIZE.times do |dx|
          y = row * PATCH_SIZE + dy
          x = col * PATCH_SIZE + dx
          idx = (y * width + x) * 3
          patch << blob[idx] * RESCALE_FACTOR
          patch << blob[idx + 1] * RESCALE_FACTOR
          patch << blob[idx + 2] * RESCALE_FACTOR
        end
      end
      pixel_values << patch
      pixel_position_ids << [col, row]
    end
  end

  while pixel_values.length < MAX_PATCHES
    pixel_values << Array.new(PATCH_SIZE**2 * 3, 0.0)
    pixel_position_ids << [-1, -1]
  end

  @model.predict({
    "pixel_values" => [pixel_values],
    "pixel_position_ids" => [pixel_position_ids],
  })["image_features"]
end

#input_size_of(original_width, original_height) ⇒ `Object`

元画像サイズ (original_width, original_height) を受け取り、encode 前にリサイズすべきサイズ [width, height] を返す。

# File 'lib/kiribi/gemma4/e2b/vision_encoder.rb', line 24

def input_size_of(original_width, original_height)
  target_px = MAX_PATCHES * PATCH_SIZE**2
  factor = Math.sqrt(target_px.to_f / (original_height * original_width))

  width  = (factor * original_width  / SIDE_MULT).floor * SIDE_MULT
  height = (factor * original_height / SIDE_MULT).floor * SIDE_MULT

  if width == 0 && height == 0
    raise "Image too small to resize"
  elsif height == 0
    height = SIDE_MULT
    width = [(original_width / original_height) * SIDE_MULT, MAX_SOFT_TOKENS * SIDE_MULT].min
  elsif width == 0
    width = SIDE_MULT
    height = [(original_height / original_width) * SIDE_MULT, MAX_SOFT_TOKENS * SIDE_MULT].min
  end

  [width, height]
end