Module: Kreuzberg::ExtractionAPI

Defined in:
lib/kreuzberg/extraction_api.rb

Instance Method Summary collapse

Instance Method Details

#batch_extract_bytes(data_array:, mime_types:, config: nil) ⇒ Array<Result>

Asynchronously extract content from multiple byte data sources.

Non-blocking batch extraction from multiple in-memory binary documents. Results maintain the same order as the input data array. This method is preferred when processing multiple documents without blocking (e.g., handling multiple uploads).

Examples:

Batch extract uploaded documents asynchronously

# From a web request with multiple file uploads
uploaded_files = params[:files]  # Array of uploaded file objects
data = uploaded_files.map(&:read)
types = uploaded_files.map(&:content_type)

results = Kreuzberg.batch_extract_bytes(data, types)
results.each { |r| puts r.content }

Batch extract with OCR

data = [scan_1_bytes, scan_2_bytes, scan_3_bytes]
types = ["image/png", "image/png", "image/png"]
config = Kreuzberg::Config::Extraction.new(force_ocr: true)
results = Kreuzberg.batch_extract_bytes(data, types, config: config)

Parameters:

  • data_array (Array<String>)

    Array of binary document data. Each element can contain any byte values (e.g., PDF binary data).

  • mime_types (Array<String>)

    Array of MIME types corresponding to each data item. Must be the same length as data_array (e.g., [“application/pdf”, “application/msword”]).

  • config (Config::Extraction, Hash, nil) (defaults to: nil)

    Extraction configuration applied to all items. Accepts either a Config::Extraction object or a configuration hash.

Returns:

  • (Array<Result>)

    Array of extraction results in the same order as input data. Array length matches the data_array length.

Raises:



349
350
351
352
353
354
355
# File 'lib/kreuzberg/extraction_api.rb', line 349

def batch_extract_bytes(data_array:, mime_types:, config: nil)
  opts = normalize_config(config)
  hashes = native_batch_extract_bytes(data_array.map(&:to_s), mime_types.map(&:to_s), **opts)
  results = hashes.map { |hash| Result.new(hash) }
  record_cache_entry!(results, opts)
  results
end

#batch_extract_bytes_sync(data_array:, mime_types:, config: nil) ⇒ Array<Result>

Synchronously extract content from multiple byte data sources.

Processes multiple in-memory binary documents in a single batch operation. Results maintain the same order as the input data array. The mime_types array must have the same length as the data_array.

Examples:

Batch extract binary documents

pdf_data_1 = File.read("doc1.pdf", binmode: true)
pdf_data_2 = File.read("doc2.pdf", binmode: true)
docx_data = File.read("report.docx", binmode: true)

data = [pdf_data_1, pdf_data_2, docx_data]
types = ["application/pdf", "application/pdf", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"]
results = Kreuzberg.batch_extract_bytes_sync(data, types)
results.each { |r| puts r.content }

Parameters:

  • data_array (Array<String>)

    Array of binary document data. Each element can contain any byte values (e.g., PDF binary data).

  • mime_types (Array<String>)

    Array of MIME types corresponding to each data item. Must be the same length as data_array (e.g., [“application/pdf”, “application/msword”]).

  • config (Config::Extraction, Hash, nil) (defaults to: nil)

    Extraction configuration applied to all items. Accepts either a Config::Extraction object or a configuration hash.

Returns:

  • (Array<Result>)

    Array of extraction results in the same order as input data. Array length matches the data_array length.

Raises:



305
306
307
308
309
310
311
# File 'lib/kreuzberg/extraction_api.rb', line 305

def batch_extract_bytes_sync(data_array:, mime_types:, config: nil)
  opts = normalize_config(config)
  hashes = native_batch_extract_bytes_sync(data_array.map(&:to_s), mime_types.map(&:to_s), **opts)
  results = hashes.map { |hash| Result.new(hash) }
  record_cache_entry!(results, opts)
  results
end

#batch_extract_files(paths:, config: nil) ⇒ Array<Result>

Asynchronously extract content from multiple files.

Non-blocking batch extraction from multiple files. Results maintain the same order as input paths. This is the preferred method for bulk processing when non-blocking I/O is required (e.g., in web servers or async applications).

Examples:

Batch extract multiple files asynchronously

paths = ["invoice_1.pdf", "invoice_2.pdf", "invoice_3.pdf"]
results = Kreuzberg.batch_extract_files(paths)
results.each_with_index do |result, idx|
  puts "Invoice #{idx}: #{result.detected_languages}"
end

Batch extract with chunking

paths = Dir.glob("reports/*.docx")
config = Kreuzberg::Config::Extraction.new(
  chunking: Kreuzberg::Config::Chunking.new(max_chars: 1000, max_overlap: 200)
)
results = Kreuzberg.batch_extract_files(paths, config: config)

Parameters:

  • paths (Array<String, Pathname>)

    Array of file paths to extract. Each path is converted to a string and MIME type is auto-detected from extension.

  • config (Config::Extraction, Hash, nil) (defaults to: nil)

    Extraction configuration applied to all files. Accepts either a Config::Extraction object or a configuration hash.

Returns:

  • (Array<Result>)

    Array of extraction results in the same order as input paths. Array length matches the input paths length.

Raises:



231
232
233
234
235
236
237
# File 'lib/kreuzberg/extraction_api.rb', line 231

def batch_extract_files(paths:, config: nil)
  opts = normalize_config(config)
  hashes = native_batch_extract_files(paths.map(&:to_s), **opts)
  results = hashes.map { |hash| Result.new(hash) }
  record_cache_entry!(results, opts)
  results
end

#batch_extract_files_sync(paths:, config: nil) ⇒ Array<Result>

Synchronously extract content from multiple files.

Processes multiple files in a single batch operation. Files are extracted sequentially, and results maintain the same order as the input paths. This is useful for bulk processing multiple documents with consistent configuration.

Examples:

Batch extract multiple PDFs

paths = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
results = Kreuzberg.batch_extract_files_sync(paths)
results.each_with_index do |result, idx|
  puts "File #{idx}: #{result.content.length} characters"
end

Batch extract with consistent configuration

paths = Dir.glob("documents/*.pdf")
config = Kreuzberg::Config::Extraction.new(force_ocr: true)
results = Kreuzberg.batch_extract_files_sync(paths, config: config)

Parameters:

  • paths (Array<String, Pathname>)

    Array of file paths to extract. Each path is converted to a string and MIME type is auto-detected from extension.

  • config (Config::Extraction, Hash, nil) (defaults to: nil)

    Extraction configuration applied to all files. Accepts either a Config::Extraction object or a configuration hash.

Returns:

  • (Array<Result>)

    Array of extraction results in the same order as input paths. Array length matches the input paths length.

Raises:



100
101
102
103
104
105
106
107
108
109
110
111
112
# File 'lib/kreuzberg/extraction_api.rb', line 100

def batch_extract_files_sync(paths:, config: nil)
  # Validate that all files exist
  paths.each do |path|
    path_str = path.to_s
    raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str)
  end

  opts = normalize_config(config)
  hashes = native_batch_extract_files_sync(paths.map(&:to_s), **opts)
  results = hashes.map { |hash| Result.new(hash) }
  record_cache_entry!(results, opts)
  results
end

#embed(texts:, config: nil) ⇒ Array<Array<Float>>

Asynchronously generate embeddings for multiple texts.

Non-blocking embedding generation from a list of strings.

Examples:

Generate embeddings asynchronously

texts = ["Hello, world!", "Kreuzberg is awesome."]
embeddings = Kreuzberg.embed(texts: texts)
puts embeddings.first.length # 384

Parameters:

  • texts (Array<String>)

    List of strings to embed.

  • config (Config::Embedding, Hash, nil) (defaults to: nil)

    Embedding configuration.

Returns:

  • (Array<Array<Float>>)

    Array of embedding vectors.

Raises:



254
255
256
257
# File 'lib/kreuzberg/extraction_api.rb', line 254

def embed(texts:, config: nil)
  opts = normalize_config(config)
  native_embed(texts: texts.map(&:to_s), config: opts)
end

#embed_sync(texts:, config: nil) ⇒ Array<Array<Float>>

Synchronously generate embeddings for multiple texts.

Blocking embedding generation from a list of strings.

Parameters:

  • texts (Array<String>)

    List of strings to embed.

  • config (Config::Embedding, Hash, nil) (defaults to: nil)

    Embedding configuration.

Returns:

  • (Array<Array<Float>>)

    Array of embedding vectors.

Raises:



269
270
271
272
# File 'lib/kreuzberg/extraction_api.rb', line 269

def embed_sync(texts:, config: nil)
  opts = normalize_config(config)
  native_embed_sync(texts: texts.map(&:to_s), config: opts)
end

#extract_bytes(data:, mime_type:, config: nil) ⇒ Result

Asynchronously extract content from byte data.

Non-blocking extraction from in-memory binary data. Like #extract_file, this performs extraction in the background, making it suitable for handling high-volume extraction workloads without blocking the main thread.

Examples:

Extract PDF from memory asynchronously

pdf_data = File.read("document.pdf", binmode: true)
result = Kreuzberg.extract_bytes(pdf_data, "application/pdf")
puts result.content

Extract with image extraction

data = File.read("file.docx", binmode: true)
config = Kreuzberg::Config::Extraction.new(
  image_extraction: Kreuzberg::Config::ImageExtraction.new(extract_images: true)
)
result = Kreuzberg.extract_bytes(data, "application/vnd.openxmlformats-officedocument.wordprocessingml.document", config: config)

Parameters:

  • data (String)

    Binary document data (can contain any byte values)

  • mime_type (String)

    MIME type of the data (required, e.g., ‘application/pdf’). This parameter is mandatory to guide the extraction engine.

  • config (Config::Extraction, Hash, nil) (defaults to: nil)

    Extraction configuration. Accepts either a Config::Extraction object or a configuration hash.

Returns:

  • (Result)

    Extraction result containing content, metadata, tables, and images

Raises:



190
191
192
193
194
195
196
# File 'lib/kreuzberg/extraction_api.rb', line 190

def extract_bytes(data:, mime_type:, config: nil)
  opts = normalize_config(config)
  hash = native_extract_bytes(data.to_s, mime_type.to_s, **opts)
  result = Result.new(hash)
  record_cache_entry!(result, opts)
  result
end

#extract_bytes_sync(data:, mime_type:, config: nil) ⇒ Result

Synchronously extract content from byte data.

Performs document extraction directly from binary data in memory. Useful for extracting content from files already loaded into memory or from network streams.

Examples:

Extract PDF from memory

pdf_data = File.read("document.pdf", binmode: true)
result = Kreuzberg.extract_bytes_sync(pdf_data, "application/pdf")
puts result.content

Extract from a network stream

response = HTTParty.get("https://example.com/document.docx")
result = Kreuzberg.extract_bytes_sync(response.body, "application/vnd.openxmlformats-officedocument.wordprocessingml.document")

Parameters:

  • data (String)

    Binary document data (can contain any byte values)

  • mime_type (String)

    MIME type of the data (required, e.g., ‘application/pdf’). This parameter is mandatory to guide the extraction engine.

  • config (Config::Extraction, Hash, nil) (defaults to: nil)

    Extraction configuration. Accepts either a Config::Extraction object or a configuration hash.

Returns:

  • (Result)

    Extraction result containing content, metadata, tables, and images

Raises:



59
60
61
62
63
64
65
66
67
# File 'lib/kreuzberg/extraction_api.rb', line 59

def extract_bytes_sync(data:, mime_type:, config: nil)
  raise TypeError, "mime_type must be a String, got #{mime_type.inspect}" if mime_type.nil?

  opts = normalize_config(config)
  hash = native_extract_bytes_sync(data.to_s, mime_type.to_s, **opts)
  result = Result.new(hash)
  record_cache_entry!(result, opts)
  result
end

#extract_file(path:, mime_type: nil, config: nil) ⇒ Result

Asynchronously extract content from a file.

Non-blocking extraction that returns a Result promise. Extraction is performed in the background using native threads or the Tokio runtime. This method is preferred for I/O-bound operations and integrating with async workflows.

Examples:

Extract a PDF file asynchronously

result = Kreuzberg.extract_file("large_document.pdf")
puts result.content

Extract with custom OCR configuration

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(language: "deu")
)
result = Kreuzberg.extract_file("document.pdf", config: config)

Parameters:

  • path (String, Pathname)

    Path to the document file to extract

  • mime_type (String, nil) (defaults to: nil)

    Optional MIME type for the file (e.g., ‘application/pdf’). If omitted, type is detected from file extension.

  • config (Config::Extraction, Hash, nil) (defaults to: nil)

    Extraction configuration. Accepts either a Config::Extraction object or a configuration hash.

Returns:

  • (Result)

    Extraction result containing content, metadata, tables, and images. In async contexts, this result is available upon method return.

Raises:



144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
# File 'lib/kreuzberg/extraction_api.rb', line 144

def extract_file(path:, mime_type: nil, config: nil)
  # Validate that the file exists
  path_str = path.to_s
  raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str)

  opts = normalize_config(config)
  hash = if mime_type
           native_extract_file(path_str, mime_type.to_s, **opts)
         else
           native_extract_file(path_str, **opts)
         end
  result = Result.new(hash)
  record_cache_entry!(result, opts)
  result
end

#extract_file_sync(path:, mime_type: nil, config: nil) ⇒ Result

Returns Extraction result containing content, metadata, tables, and images.

Examples:

Extract a PDF file

Extract with explicit MIME type

Extract with OCR enabled

Parameters:

  • path (String, Pathname)

    Path to the document file to extract

  • mime_type (String, nil) (defaults to: nil)

    Optional MIME type for the file (e.g., ‘application/pdf’).

  • config (Config::Extraction, Hash, nil) (defaults to: nil)

    Extraction configuration controlling

Returns:

  • (Result)

    Extraction result containing content, metadata, tables, and images

Raises:



17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# File 'lib/kreuzberg/extraction_api.rb', line 17

def extract_file_sync(path:, mime_type: nil, config: nil)
  # Validate that the file exists
  path_str = path.to_s
  raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str)

  opts = normalize_config(config)
  hash = if mime_type
           native_extract_file_sync(path_str, mime_type.to_s, **opts)
         else
           native_extract_file_sync(path_str, **opts)
         end
  result = Result.new(hash)
  record_cache_entry!(result, opts)
  result
end

#normalize_config(config) ⇒ Object



394
395
396
397
398
399
# File 'lib/kreuzberg/extraction_api.rb', line 394

def normalize_config(config)
  return {} if config.nil?
  return config if config.is_a?(Hash)

  config.to_h
end

#render_pdf_page(path, page_index, dpi: 150) ⇒ String

Render a single PDF page as a PNG image.

Parameters:

  • path (String, Pathname)

    Path to the PDF file

  • page_index (Integer)

    Zero-based page index

  • dpi (Integer) (defaults to: 150)

    Rendering resolution (default 150)

Returns:

  • (String)

    PNG-encoded binary string

Raises:



365
366
367
368
369
370
371
# File 'lib/kreuzberg/extraction_api.rb', line 365

def render_pdf_page(path, page_index, dpi: 150)
  path_str = path.to_s
  raise ArgumentError, 'page_index must be non-negative' if page_index.negative?
  raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str)

  native_render_pdf_page(path_str, page_index, dpi)
end

#render_pdf_pages_iter(path, dpi: 150) {|page_index, png_bytes| ... } ⇒ Enumerator

Iterate over pages of a PDF lazily, yielding each page as it is rendered.

Each page is rendered via the native FFI iterator, so only one page is in memory at a time.

Parameters:

  • path (String, Pathname)

    Path to the PDF file

  • dpi (Integer) (defaults to: 150)

    Rendering resolution (default 150)

Yield Parameters:

  • page_index (Integer)

    Zero-based page index

  • png_bytes (String)

    PNG-encoded binary string for the page

Returns:

  • (Enumerator)

    if no block is given

Raises:



385
386
387
388
389
390
391
392
# File 'lib/kreuzberg/extraction_api.rb', line 385

def render_pdf_pages_iter(path, dpi: 150, &block)
  path_str = path.to_s
  raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str)

  return enum_for(:render_pdf_pages_iter, path, dpi: dpi) unless block

  native_render_pdf_pages_iter(path_str, dpi, &block)
end