Module: Kreuzberg::ExtractionAPI

Defined in:
lib/kreuzberg/extraction_api.rb

Instance Method Summary collapse

Instance Method Details

#batch_extract_bytes(data_array:, mime_types:, config: nil) ⇒ Array<Result>

Asynchronously extract content from multiple byte data sources.

Non-blocking batch extraction from multiple in-memory binary documents. Results maintain the same order as the input data array. This method is preferred when processing multiple documents without blocking (e.g., handling multiple uploads).

Examples:

Batch extract uploaded documents asynchronously

# From a web request with multiple file uploads
uploaded_files = params[:files]  # Array of uploaded file objects
data = uploaded_files.map(&:read)
types = uploaded_files.map(&:content_type)

results = Kreuzberg.batch_extract_bytes(data, types)
results.each { |r| puts r.content }

Batch extract with OCR

data = [scan_1_bytes, scan_2_bytes, scan_3_bytes]
types = ["image/png", "image/png", "image/png"]
config = Kreuzberg::Config::Extraction.new(force_ocr: true)
results = Kreuzberg.batch_extract_bytes(data, types, config: config)

Parameters:

  • data_array (Array<String>)

    Array of binary document data. Each element can contain any byte values (e.g., PDF binary data).

  • mime_types (Array<String>)

    Array of MIME types corresponding to each data item. Must be the same length as data_array (e.g., [“application/pdf”, “application/msword”]).

  • config (Config::Extraction, Hash, nil) (defaults to: nil)

    Extraction configuration applied to all items. Accepts either a Config::Extraction object or a configuration hash.

Returns:

  • (Array<Result>)

    Array of extraction results in the same order as input data. Array length matches the data_array length.

Raises:



314
315
316
317
318
319
320
# File 'lib/kreuzberg/extraction_api.rb', line 314

def batch_extract_bytes(data_array:, mime_types:, config: nil)
  opts = normalize_config(config)
  hashes = native_batch_extract_bytes(data_array.map(&:to_s), mime_types.map(&:to_s), **opts)
  results = hashes.map { |hash| Result.new(hash) }
  record_cache_entry!(results, opts)
  results
end

#batch_extract_bytes_sync(data_array:, mime_types:, config: nil) ⇒ Array<Result>

Synchronously extract content from multiple byte data sources.

Processes multiple in-memory binary documents in a single batch operation. Results maintain the same order as the input data array. The mime_types array must have the same length as the data_array.

Examples:

Batch extract binary documents

pdf_data_1 = File.read("doc1.pdf", binmode: true)
pdf_data_2 = File.read("doc2.pdf", binmode: true)
docx_data = File.read("report.docx", binmode: true)

data = [pdf_data_1, pdf_data_2, docx_data]
types = ["application/pdf", "application/pdf", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"]
results = Kreuzberg.batch_extract_bytes_sync(data, types)
results.each { |r| puts r.content }

Parameters:

  • data_array (Array<String>)

    Array of binary document data. Each element can contain any byte values (e.g., PDF binary data).

  • mime_types (Array<String>)

    Array of MIME types corresponding to each data item. Must be the same length as data_array (e.g., [“application/pdf”, “application/msword”]).

  • config (Config::Extraction, Hash, nil) (defaults to: nil)

    Extraction configuration applied to all items. Accepts either a Config::Extraction object or a configuration hash.

Returns:

  • (Array<Result>)

    Array of extraction results in the same order as input data. Array length matches the data_array length.

Raises:



270
271
272
273
274
275
276
# File 'lib/kreuzberg/extraction_api.rb', line 270

def batch_extract_bytes_sync(data_array:, mime_types:, config: nil)
  opts = normalize_config(config)
  hashes = native_batch_extract_bytes_sync(data_array.map(&:to_s), mime_types.map(&:to_s), **opts)
  results = hashes.map { |hash| Result.new(hash) }
  record_cache_entry!(results, opts)
  results
end

#batch_extract_files(paths:, config: nil) ⇒ Array<Result>

Asynchronously extract content from multiple files.

Non-blocking batch extraction from multiple files. Results maintain the same order as input paths. This is the preferred method for bulk processing when non-blocking I/O is required (e.g., in web servers or async applications).

Examples:

Batch extract multiple files asynchronously

paths = ["invoice_1.pdf", "invoice_2.pdf", "invoice_3.pdf"]
results = Kreuzberg.batch_extract_files(paths)
results.each_with_index do |result, idx|
  puts "Invoice #{idx}: #{result.detected_languages}"
end

Batch extract with chunking

paths = Dir.glob("reports/*.docx")
config = Kreuzberg::Config::Extraction.new(
  chunking: Kreuzberg::Config::Chunking.new(max_chars: 1000, max_overlap: 200)
)
results = Kreuzberg.batch_extract_files(paths, config: config)

Parameters:

  • paths (Array<String, Pathname>)

    Array of file paths to extract. Each path is converted to a string and MIME type is auto-detected from extension.

  • config (Config::Extraction, Hash, nil) (defaults to: nil)

    Extraction configuration applied to all files. Accepts either a Config::Extraction object or a configuration hash.

Returns:

  • (Array<Result>)

    Array of extraction results in the same order as input paths. Array length matches the input paths length.

Raises:



231
232
233
234
235
236
237
# File 'lib/kreuzberg/extraction_api.rb', line 231

def batch_extract_files(paths:, config: nil)
  opts = normalize_config(config)
  hashes = native_batch_extract_files(paths.map(&:to_s), **opts)
  results = hashes.map { |hash| Result.new(hash) }
  record_cache_entry!(results, opts)
  results
end

#batch_extract_files_sync(paths:, config: nil) ⇒ Array<Result>

Synchronously extract content from multiple files.

Processes multiple files in a single batch operation. Files are extracted sequentially, and results maintain the same order as the input paths. This is useful for bulk processing multiple documents with consistent configuration.

Examples:

Batch extract multiple PDFs

paths = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
results = Kreuzberg.batch_extract_files_sync(paths)
results.each_with_index do |result, idx|
  puts "File #{idx}: #{result.content.length} characters"
end

Batch extract with consistent configuration

paths = Dir.glob("documents/*.pdf")
config = Kreuzberg::Config::Extraction.new(force_ocr: true)
results = Kreuzberg.batch_extract_files_sync(paths, config: config)

Parameters:

  • paths (Array<String, Pathname>)

    Array of file paths to extract. Each path is converted to a string and MIME type is auto-detected from extension.

  • config (Config::Extraction, Hash, nil) (defaults to: nil)

    Extraction configuration applied to all files. Accepts either a Config::Extraction object or a configuration hash.

Returns:

  • (Array<Result>)

    Array of extraction results in the same order as input paths. Array length matches the input paths length.

Raises:



100
101
102
103
104
105
106
107
108
109
110
111
112
# File 'lib/kreuzberg/extraction_api.rb', line 100

def batch_extract_files_sync(paths:, config: nil)
  # Validate that all files exist
  paths.each do |path|
    path_str = path.to_s
    raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str)
  end

  opts = normalize_config(config)
  hashes = native_batch_extract_files_sync(paths.map(&:to_s), **opts)
  results = hashes.map { |hash| Result.new(hash) }
  record_cache_entry!(results, opts)
  results
end

#extract_bytes(data:, mime_type:, config: nil) ⇒ Result

Asynchronously extract content from byte data.

Non-blocking extraction from in-memory binary data. Like #extract_file, this performs extraction in the background, making it suitable for handling high-volume extraction workloads without blocking the main thread.

Examples:

Extract PDF from memory asynchronously

pdf_data = File.read("document.pdf", binmode: true)
result = Kreuzberg.extract_bytes(pdf_data, "application/pdf")
puts result.content

Extract with image extraction

data = File.read("file.docx", binmode: true)
config = Kreuzberg::Config::Extraction.new(
  image_extraction: Kreuzberg::Config::ImageExtraction.new(extract_images: true)
)
result = Kreuzberg.extract_bytes(data, "application/vnd.openxmlformats-officedocument.wordprocessingml.document", config: config)

Parameters:

  • data (String)

    Binary document data (can contain any byte values)

  • mime_type (String)

    MIME type of the data (required, e.g., ‘application/pdf’). This parameter is mandatory to guide the extraction engine.

  • config (Config::Extraction, Hash, nil) (defaults to: nil)

    Extraction configuration. Accepts either a Config::Extraction object or a configuration hash.

Returns:

  • (Result)

    Extraction result containing content, metadata, tables, and images

Raises:



190
191
192
193
194
195
196
# File 'lib/kreuzberg/extraction_api.rb', line 190

def extract_bytes(data:, mime_type:, config: nil)
  opts = normalize_config(config)
  hash = native_extract_bytes(data.to_s, mime_type.to_s, **opts)
  result = Result.new(hash)
  record_cache_entry!(result, opts)
  result
end

#extract_bytes_sync(data:, mime_type:, config: nil) ⇒ Result

Synchronously extract content from byte data.

Performs document extraction directly from binary data in memory. Useful for extracting content from files already loaded into memory or from network streams.

Examples:

Extract PDF from memory

pdf_data = File.read("document.pdf", binmode: true)
result = Kreuzberg.extract_bytes_sync(pdf_data, "application/pdf")
puts result.content

Extract from a network stream

response = HTTParty.get("https://example.com/document.docx")
result = Kreuzberg.extract_bytes_sync(response.body, "application/vnd.openxmlformats-officedocument.wordprocessingml.document")

Parameters:

  • data (String)

    Binary document data (can contain any byte values)

  • mime_type (String)

    MIME type of the data (required, e.g., ‘application/pdf’). This parameter is mandatory to guide the extraction engine.

  • config (Config::Extraction, Hash, nil) (defaults to: nil)

    Extraction configuration. Accepts either a Config::Extraction object or a configuration hash.

Returns:

  • (Result)

    Extraction result containing content, metadata, tables, and images

Raises:



59
60
61
62
63
64
65
66
67
# File 'lib/kreuzberg/extraction_api.rb', line 59

def extract_bytes_sync(data:, mime_type:, config: nil)
  raise TypeError, "mime_type must be a String, got #{mime_type.inspect}" if mime_type.nil?

  opts = normalize_config(config)
  hash = native_extract_bytes_sync(data.to_s, mime_type.to_s, **opts)
  result = Result.new(hash)
  record_cache_entry!(result, opts)
  result
end

#extract_file(path:, mime_type: nil, config: nil) ⇒ Result

Asynchronously extract content from a file.

Non-blocking extraction that returns a Result promise. Extraction is performed in the background using native threads or the Tokio runtime. This method is preferred for I/O-bound operations and integrating with async workflows.

Examples:

Extract a PDF file asynchronously

result = Kreuzberg.extract_file("large_document.pdf")
puts result.content

Extract with custom OCR configuration

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(language: "deu")
)
result = Kreuzberg.extract_file("document.pdf", config: config)

Parameters:

  • path (String, Pathname)

    Path to the document file to extract

  • mime_type (String, nil) (defaults to: nil)

    Optional MIME type for the file (e.g., ‘application/pdf’). If omitted, type is detected from file extension.

  • config (Config::Extraction, Hash, nil) (defaults to: nil)

    Extraction configuration. Accepts either a Config::Extraction object or a configuration hash.

Returns:

  • (Result)

    Extraction result containing content, metadata, tables, and images. In async contexts, this result is available upon method return.

Raises:



144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
# File 'lib/kreuzberg/extraction_api.rb', line 144

def extract_file(path:, mime_type: nil, config: nil)
  # Validate that the file exists
  path_str = path.to_s
  raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str)

  opts = normalize_config(config)
  hash = if mime_type
           native_extract_file(path_str, mime_type.to_s, **opts)
         else
           native_extract_file(path_str, **opts)
         end
  result = Result.new(hash)
  record_cache_entry!(result, opts)
  result
end

#extract_file_sync(path:, mime_type: nil, config: nil) ⇒ Result

Returns Extraction result containing content, metadata, tables, and images.

Examples:

Extract a PDF file

Extract with explicit MIME type

Extract with OCR enabled

Parameters:

  • path (String, Pathname)

    Path to the document file to extract

  • mime_type (String, nil) (defaults to: nil)

    Optional MIME type for the file (e.g., ‘application/pdf’).

  • config (Config::Extraction, Hash, nil) (defaults to: nil)

    Extraction configuration controlling

Returns:

  • (Result)

    Extraction result containing content, metadata, tables, and images

Raises:



17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# File 'lib/kreuzberg/extraction_api.rb', line 17

def extract_file_sync(path:, mime_type: nil, config: nil)
  # Validate that the file exists
  path_str = path.to_s
  raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str)

  opts = normalize_config(config)
  hash = if mime_type
           native_extract_file_sync(path_str, mime_type.to_s, **opts)
         else
           native_extract_file_sync(path_str, **opts)
         end
  result = Result.new(hash)
  record_cache_entry!(result, opts)
  result
end

#normalize_config(config) ⇒ Object



322
323
324
325
326
327
# File 'lib/kreuzberg/extraction_api.rb', line 322

def normalize_config(config)
  return {} if config.nil?
  return config if config.is_a?(Hash)

  config.to_h
end