Module: Kreuzberg::ExtractionAPI

Defined in:: lib/kreuzberg/extraction_api.rb

Instance Method Summary collapse

#batch_extract_bytes(data_array:, mime_types:, config: nil) ⇒ Array<Result>

Asynchronously extract content from multiple byte data sources.
#batch_extract_bytes_sync(data_array:, mime_types:, config: nil) ⇒ Array<Result>

Synchronously extract content from multiple byte data sources.
#batch_extract_files(paths:, config: nil) ⇒ Array<Result>

Asynchronously extract content from multiple files.
#batch_extract_files_sync(paths:, config: nil) ⇒ Array<Result>

Synchronously extract content from multiple files.
#embed(texts:, config: nil) ⇒ Array<Array<Float>>

Asynchronously generate embeddings for multiple texts.
#embed_sync(texts:, config: nil) ⇒ Array<Array<Float>>

Synchronously generate embeddings for multiple texts.
#extract_bytes(data:, mime_type:, config: nil) ⇒ Result

Asynchronously extract content from byte data.
#extract_bytes_sync(data:, mime_type:, config: nil) ⇒ Result

Synchronously extract content from byte data.
#extract_file(path:, mime_type: nil, config: nil) ⇒ Result

Asynchronously extract content from a file.
#extract_file_sync(path:, mime_type: nil, config: nil) ⇒ Result

Extraction result containing content, metadata, tables, and images.
#normalize_config(config) ⇒ Object
#render_pdf_page(path, page_index, dpi: 150) ⇒ String

Render a single PDF page as a PNG image.
#render_pdf_pages_iter(path, dpi: 150) {|page_index, png_bytes| ... } ⇒ Enumerator

Iterate over pages of a PDF lazily, yielding each page as it is rendered.

Instance Method Details

#batch_extract_bytes(data_array:, mime_types:, config: nil) ⇒ `Array<Result>`

Asynchronously extract content from multiple byte data sources.

Non-blocking batch extraction from multiple in-memory binary documents. Results maintain the same order as the input data array. This method is preferred when processing multiple documents without blocking (e.g., handling multiple uploads).

Examples:

Batch extract uploaded documents asynchronously

# From a web request with multiple file uploads
uploaded_files = params[:files]  # Array of uploaded file objects
data = uploaded_files.map(&:read)
types = uploaded_files.map(&:content_type)

results = Kreuzberg.batch_extract_bytes(data, types)
results.each { |r| puts r.content }

Batch extract with OCR

data = [scan_1_bytes, scan_2_bytes, scan_3_bytes]
types = ["image/png", "image/png", "image/png"]
config = Kreuzberg::Config::Extraction.new(force_ocr: true)
results = Kreuzberg.batch_extract_bytes(data, types, config: config)

Parameters:

data_array (Array<String>) —

Array of binary document data. Each element can contain any byte values (e.g., PDF binary data).
mime_types (Array<String>) —

Array of MIME types corresponding to each data item. Must be the same length as data_array (e.g., [“application/pdf”, “application/msword”]).
config (Config::Extraction, Hash, nil) (defaults to: nil) —

Extraction configuration applied to all items. Accepts either a Config::Extraction object or a configuration hash.

Returns:

(Array<Result>) —

Array of extraction results in the same order as input data. Array length matches the data_array length.

Raises:

(ArgumentError) —

If data_array and mime_types have different lengths
(Errors::ParsingError) —

If any document parsing fails
(Errors::UnsupportedFormatError) —

If any MIME type is not supported
(Errors::OCRError) —

If OCR is enabled and fails on any document
(Errors::MissingDependencyError) —

If a required dependency is missing

# File 'lib/kreuzberg/extraction_api.rb', line 349

def batch_extract_bytes(data_array:, mime_types:, config: nil)
  opts = normalize_config(config)
  hashes = native_batch_extract_bytes(data_array.map(&:to_s), mime_types.map(&:to_s), **opts)
  results = hashes.map { |hash| Result.new(hash) }
  record_cache_entry!(results, opts)
  results
end

#batch_extract_bytes_sync(data_array:, mime_types:, config: nil) ⇒ `Array<Result>`

Synchronously extract content from multiple byte data sources.

Processes multiple in-memory binary documents in a single batch operation. Results maintain the same order as the input data array. The mime_types array must have the same length as the data_array.

Examples:

Batch extract binary documents

pdf_data_1 = File.read("doc1.pdf", binmode: true)
pdf_data_2 = File.read("doc2.pdf", binmode: true)
docx_data = File.read("report.docx", binmode: true)

data = [pdf_data_1, pdf_data_2, docx_data]
types = ["application/pdf", "application/pdf", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"]
results = Kreuzberg.batch_extract_bytes_sync(data, types)
results.each { |r| puts r.content }

Parameters:

data_array (Array<String>) —

Array of binary document data. Each element can contain any byte values (e.g., PDF binary data).
mime_types (Array<String>) —

Array of MIME types corresponding to each data item. Must be the same length as data_array (e.g., [“application/pdf”, “application/msword”]).
config (Config::Extraction, Hash, nil) (defaults to: nil) —

Extraction configuration applied to all items. Accepts either a Config::Extraction object or a configuration hash.

Returns:

(Array<Result>) —

Array of extraction results in the same order as input data. Array length matches the data_array length.

Raises:

(ArgumentError) —

If data_array and mime_types have different lengths
(Errors::ParsingError) —

If any document parsing fails
(Errors::UnsupportedFormatError) —

If any MIME type is not supported
(Errors::OCRError) —

If OCR is enabled and fails on any document
(Errors::MissingDependencyError) —

If a required dependency is missing

# File 'lib/kreuzberg/extraction_api.rb', line 305

def batch_extract_bytes_sync(data_array:, mime_types:, config: nil)
  opts = normalize_config(config)
  hashes = native_batch_extract_bytes_sync(data_array.map(&:to_s), mime_types.map(&:to_s), **opts)
  results = hashes.map { |hash| Result.new(hash) }
  record_cache_entry!(results, opts)
  results
end

#batch_extract_files(paths:, config: nil) ⇒ `Array<Result>`

Asynchronously extract content from multiple files.

Non-blocking batch extraction from multiple files. Results maintain the same order as input paths. This is the preferred method for bulk processing when non-blocking I/O is required (e.g., in web servers or async applications).

Examples:

Batch extract multiple files asynchronously

paths = ["invoice_1.pdf", "invoice_2.pdf", "invoice_3.pdf"]
results = Kreuzberg.batch_extract_files(paths)
results.each_with_index do |result, idx|
  puts "Invoice #{idx}: #{result.detected_languages}"
end

Batch extract with chunking

paths = Dir.glob("reports/*.docx")
config = Kreuzberg::Config::Extraction.new(
  chunking: Kreuzberg::Config::Chunking.new(max_chars: 1000, max_overlap: 200)
)
results = Kreuzberg.batch_extract_files(paths, config: config)

Parameters:

paths (Array<String, Pathname>) —

Array of file paths to extract. Each path is converted to a string and MIME type is auto-detected from extension.
config (Config::Extraction, Hash, nil) (defaults to: nil) —

Extraction configuration applied to all files. Accepts either a Config::Extraction object or a configuration hash.

Returns:

(Array<Result>) —

Array of extraction results in the same order as input paths. Array length matches the input paths length.

Raises:

(Errors::IOError) —

If any file cannot be read
(Errors::ParsingError) —

If any document parsing fails
(Errors::UnsupportedFormatError) —

If any file format is not supported
(Errors::OCRError) —

If OCR is enabled and fails on any document
(Errors::MissingDependencyError) —

If a required dependency is missing

# File 'lib/kreuzberg/extraction_api.rb', line 231

def batch_extract_files(paths:, config: nil)
  opts = normalize_config(config)
  hashes = native_batch_extract_files(paths.map(&:to_s), **opts)
  results = hashes.map { |hash| Result.new(hash) }
  record_cache_entry!(results, opts)
  results
end

#batch_extract_files_sync(paths:, config: nil) ⇒ `Array<Result>`

Synchronously extract content from multiple files.

Processes multiple files in a single batch operation. Files are extracted sequentially, and results maintain the same order as the input paths. This is useful for bulk processing multiple documents with consistent configuration.

Examples:

Batch extract multiple PDFs

paths = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
results = Kreuzberg.batch_extract_files_sync(paths)
results.each_with_index do |result, idx|
  puts "File #{idx}: #{result.content.length} characters"
end

Batch extract with consistent configuration

paths = Dir.glob("documents/*.pdf")
config = Kreuzberg::Config::Extraction.new(force_ocr: true)
results = Kreuzberg.batch_extract_files_sync(paths, config: config)

Parameters:

paths (Array<String, Pathname>) —

Array of file paths to extract. Each path is converted to a string and MIME type is auto-detected from extension.
config (Config::Extraction, Hash, nil) (defaults to: nil) —

Extraction configuration applied to all files. Accepts either a Config::Extraction object or a configuration hash.

Returns:

(Array<Result>) —

Array of extraction results in the same order as input paths. Array length matches the input paths length.

Raises:

(Errors::IOError) —

If any file cannot be read
(Errors::ParsingError) —

If any document parsing fails
(Errors::UnsupportedFormatError) —

If any file format is not supported
(Errors::OCRError) —

If OCR is enabled and fails on any document
(Errors::MissingDependencyError) —

If a required dependency is missing

# File 'lib/kreuzberg/extraction_api.rb', line 100

def batch_extract_files_sync(paths:, config: nil)
  # Validate that all files exist
  paths.each do |path|
    path_str = path.to_s
    raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str)
  end

  opts = normalize_config(config)
  hashes = native_batch_extract_files_sync(paths.map(&:to_s), **opts)
  results = hashes.map { |hash| Result.new(hash) }
  record_cache_entry!(results, opts)
  results
end

#embed(texts:, config: nil) ⇒ `Array<Array<Float>>`

Asynchronously generate embeddings for multiple texts.

Non-blocking embedding generation from a list of strings.

Examples:

Generate embeddings asynchronously

texts = ["Hello, world!", "Kreuzberg is awesome."]
embeddings = Kreuzberg.embed(texts: texts)
puts embeddings.first.length # 384

Parameters:

texts (Array<String>) —

List of strings to embed.
config (Config::Embedding, Hash, nil) (defaults to: nil) —

Embedding configuration.

Returns:

(Array<Array<Float>>) —

Array of embedding vectors.

Raises:

(Errors::EmbeddingError) —

If embedding generation fails.

# File 'lib/kreuzberg/extraction_api.rb', line 254

def embed(texts:, config: nil)
  opts = normalize_config(config)
  native_embed(texts: texts.map(&:to_s), config: opts)
end

#embed_sync(texts:, config: nil) ⇒ `Array<Array<Float>>`

Synchronously generate embeddings for multiple texts.

Blocking embedding generation from a list of strings.

Parameters:

texts (Array<String>) —

List of strings to embed.
config (Config::Embedding, Hash, nil) (defaults to: nil) —

Embedding configuration.

Returns:

(Array<Array<Float>>) —

Array of embedding vectors.

Raises:

(Errors::EmbeddingError) —

If embedding generation fails.

# File 'lib/kreuzberg/extraction_api.rb', line 269

def embed_sync(texts:, config: nil)
  opts = normalize_config(config)
  native_embed_sync(texts: texts.map(&:to_s), config: opts)
end

#extract_bytes(data:, mime_type:, config: nil) ⇒ `Result`

Asynchronously extract content from byte data.

Non-blocking extraction from in-memory binary data. Like #extract_file, this performs extraction in the background, making it suitable for handling high-volume extraction workloads without blocking the main thread.

Examples:

Extract PDF from memory asynchronously

pdf_data = File.read("document.pdf", binmode: true)
result = Kreuzberg.extract_bytes(pdf_data, "application/pdf")
puts result.content

Extract with image extraction

data = File.read("file.docx", binmode: true)
config = Kreuzberg::Config::Extraction.new(
  image_extraction: Kreuzberg::Config::ImageExtraction.new(extract_images: true)
)
result = Kreuzberg.extract_bytes(data, "application/vnd.openxmlformats-officedocument.wordprocessingml.document", config: config)

Parameters:

data (String) —

Binary document data (can contain any byte values)
mime_type (String) —

MIME type of the data (required, e.g., ‘application/pdf’). This parameter is mandatory to guide the extraction engine.
config (Config::Extraction, Hash, nil) (defaults to: nil) —

Extraction configuration. Accepts either a Config::Extraction object or a configuration hash.

Returns:

(Result) —

Extraction result containing content, metadata, tables, and images

Raises:

(Errors::ParsingError) —

If document parsing fails
(Errors::UnsupportedFormatError) —

If the MIME type is not supported
(Errors::OCRError) —

If OCR is enabled and fails
(Errors::MissingDependencyError) —

If a required dependency is missing

# File 'lib/kreuzberg/extraction_api.rb', line 190

def extract_bytes(data:, mime_type:, config: nil)
  opts = normalize_config(config)
  hash = native_extract_bytes(data.to_s, mime_type.to_s, **opts)
  result = Result.new(hash)
  record_cache_entry!(result, opts)
  result
end

#extract_bytes_sync(data:, mime_type:, config: nil) ⇒ `Result`

Synchronously extract content from byte data.

Performs document extraction directly from binary data in memory. Useful for extracting content from files already loaded into memory or from network streams.

Examples:

Extract PDF from memory

pdf_data = File.read("document.pdf", binmode: true)
result = Kreuzberg.extract_bytes_sync(pdf_data, "application/pdf")
puts result.content

Extract from a network stream

response = HTTParty.get("https://example.com/document.docx")
result = Kreuzberg.extract_bytes_sync(response.body, "application/vnd.openxmlformats-officedocument.wordprocessingml.document")

Parameters:

data (String) —

Binary document data (can contain any byte values)
mime_type (String) —

MIME type of the data (required, e.g., ‘application/pdf’). This parameter is mandatory to guide the extraction engine.
config (Config::Extraction, Hash, nil) (defaults to: nil) —

Extraction configuration. Accepts either a Config::Extraction object or a configuration hash.

Returns:

(Result) —

Extraction result containing content, metadata, tables, and images

Raises:

(Errors::ParsingError) —

If document parsing fails
(Errors::UnsupportedFormatError) —

If the MIME type is not supported
(Errors::OCRError) —

If OCR is enabled and fails
(Errors::MissingDependencyError) —

If a required dependency is missing

# File 'lib/kreuzberg/extraction_api.rb', line 59

def extract_bytes_sync(data:, mime_type:, config: nil)
  raise TypeError, "mime_type must be a String, got #{mime_type.inspect}" if mime_type.nil?

  opts = normalize_config(config)
  hash = native_extract_bytes_sync(data.to_s, mime_type.to_s, **opts)
  result = Result.new(hash)
  record_cache_entry!(result, opts)
  result
end

#extract_file(path:, mime_type: nil, config: nil) ⇒ `Result`

Asynchronously extract content from a file.

Non-blocking extraction that returns a Result promise. Extraction is performed in the background using native threads or the Tokio runtime. This method is preferred for I/O-bound operations and integrating with async workflows.

Examples:

Extract a PDF file asynchronously

result = Kreuzberg.extract_file("large_document.pdf")
puts result.content

Extract with custom OCR configuration

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(language: "deu")
)
result = Kreuzberg.extract_file("document.pdf", config: config)

Parameters:

path (String, Pathname) —

Path to the document file to extract
mime_type (String, nil) (defaults to: nil) —

Optional MIME type for the file (e.g., ‘application/pdf’). If omitted, type is detected from file extension.
config (Config::Extraction, Hash, nil) (defaults to: nil) —

Extraction configuration. Accepts either a Config::Extraction object or a configuration hash.

Returns:

(Result) —

Extraction result containing content, metadata, tables, and images. In async contexts, this result is available upon method return.

Raises:

(Errors::IOError) —

If the file cannot be read or access is denied
(Errors::ParsingError) —

If document parsing fails
(Errors::UnsupportedFormatError) —

If the file format is not supported
(Errors::OCRError) —

If OCR is enabled and fails
(Errors::MissingDependencyError) —

If a required dependency is missing

# File 'lib/kreuzberg/extraction_api.rb', line 144

def extract_file(path:, mime_type: nil, config: nil)
  # Validate that the file exists
  path_str = path.to_s
  raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str)

  opts = normalize_config(config)
  hash = if mime_type
           native_extract_file(path_str, mime_type.to_s, **opts)
         else
           native_extract_file(path_str, **opts)
         end
  result = Result.new(hash)
  record_cache_entry!(result, opts)
  result
end

#extract_file_sync(path:, mime_type: nil, config: nil) ⇒ `Result`

Returns Extraction result containing content, metadata, tables, and images.

Examples:

Extract a PDF file

Extract with explicit MIME type

Extract with OCR enabled

Parameters:

path (String, Pathname) —

Path to the document file to extract
mime_type (String, nil) (defaults to: nil) —

Optional MIME type for the file (e.g., ‘application/pdf’).
config (Config::Extraction, Hash, nil) (defaults to: nil) —

Extraction configuration controlling

Returns:

(Result) —

Extraction result containing content, metadata, tables, and images

Raises:

(Errors::IOError) —

If the file cannot be read or access is denied
(Errors::ParsingError) —

If document parsing fails
(Errors::UnsupportedFormatError) —

If the file format is not supported
(Errors::OCRError) —

If OCR is enabled and fails
(Errors::MissingDependencyError) —

If a required dependency is missing

# File 'lib/kreuzberg/extraction_api.rb', line 17

def extract_file_sync(path:, mime_type: nil, config: nil)
  # Validate that the file exists
  path_str = path.to_s
  raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str)

  opts = normalize_config(config)
  hash = if mime_type
           native_extract_file_sync(path_str, mime_type.to_s, **opts)
         else
           native_extract_file_sync(path_str, **opts)
         end
  result = Result.new(hash)
  record_cache_entry!(result, opts)
  result
end

#normalize_config(config) ⇒ `Object`

# File 'lib/kreuzberg/extraction_api.rb', line 394

def normalize_config(config)
  return {} if config.nil?
  return config if config.is_a?(Hash)

  config.to_h
end

#render_pdf_page(path, page_index, dpi: 150) ⇒ `String`

Render a single PDF page as a PNG image.

Parameters:

path (String, Pathname) —

Path to the PDF file
page_index (Integer) —

Zero-based page index
dpi (Integer) (defaults to: 150) —

Rendering resolution (default 150)

Returns:

(String) —

PNG-encoded binary string

Raises:

(Errors::IOError) —

If the file cannot be read
(Errors::ParsingError) —

If rendering fails

# File 'lib/kreuzberg/extraction_api.rb', line 365

def render_pdf_page(path, page_index, dpi: 150)
  path_str = path.to_s
  raise ArgumentError, 'page_index must be non-negative' if page_index.negative?
  raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str)

  native_render_pdf_page(path_str, page_index, dpi)
end

#render_pdf_pages_iter(path, dpi: 150) {|page_index, png_bytes| ... } ⇒ `Enumerator`

Iterate over pages of a PDF lazily, yielding each page as it is rendered.

Each page is rendered via the native FFI iterator, so only one page is in memory at a time.

Parameters:

path (String, Pathname) —

Path to the PDF file
dpi (Integer) (defaults to: 150) —

Rendering resolution (default 150)

Yield Parameters:

page_index (Integer) —

Zero-based page index
png_bytes (String) —

PNG-encoded binary string for the page

Returns:

(Enumerator) —

if no block is given

Raises:

(Errors::IOError) —

If the file cannot be read
(Errors::ParsingError) —

If rendering fails

# File 'lib/kreuzberg/extraction_api.rb', line 385

def render_pdf_pages_iter(path, dpi: 150, &block)
  path_str = path.to_s
  raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str)

  return enum_for(:render_pdf_pages_iter, path, dpi: dpi) unless block

  native_render_pdf_pages_iter(path_str, dpi, &block)
end

Module: Kreuzberg::ExtractionAPI

Instance Method Summary collapse

Instance Method Details

#batch_extract_bytes(data_array:, mime_types:, config: nil) ⇒ Array<Result>

Examples:

Batch extract uploaded documents asynchronously

Batch extract with OCR

#batch_extract_bytes_sync(data_array:, mime_types:, config: nil) ⇒ Array<Result>

Examples:

Batch extract binary documents

#batch_extract_files(paths:, config: nil) ⇒ Array<Result>

Examples:

Batch extract multiple files asynchronously

Batch extract with chunking

#batch_extract_files_sync(paths:, config: nil) ⇒ Array<Result>

Examples:

Batch extract multiple PDFs

Batch extract with consistent configuration

#embed(texts:, config: nil) ⇒ Array<Array<Float>>

Examples:

Generate embeddings asynchronously

#embed_sync(texts:, config: nil) ⇒ Array<Array<Float>>

#extract_bytes(data:, mime_type:, config: nil) ⇒ Result

Examples:

Extract PDF from memory asynchronously

Extract with image extraction

#extract_bytes_sync(data:, mime_type:, config: nil) ⇒ Result

Examples:

Extract PDF from memory

Extract from a network stream

#extract_file(path:, mime_type: nil, config: nil) ⇒ Result

Examples:

Extract a PDF file asynchronously

Extract with custom OCR configuration

#extract_file_sync(path:, mime_type: nil, config: nil) ⇒ Result

Examples:

Extract a PDF file

Extract with explicit MIME type

Extract with OCR enabled

#normalize_config(config) ⇒ Object

#render_pdf_page(path, page_index, dpi: 150) ⇒ String

#render_pdf_pages_iter(path, dpi: 150) {|page_index, png_bytes| ... } ⇒ Enumerator

#batch_extract_bytes(data_array:, mime_types:, config: nil) ⇒ `Array<Result>`

#batch_extract_bytes_sync(data_array:, mime_types:, config: nil) ⇒ `Array<Result>`

#batch_extract_files(paths:, config: nil) ⇒ `Array<Result>`

#batch_extract_files_sync(paths:, config: nil) ⇒ `Array<Result>`

#embed(texts:, config: nil) ⇒ `Array<Array<Float>>`

#embed_sync(texts:, config: nil) ⇒ `Array<Array<Float>>`

#extract_bytes(data:, mime_type:, config: nil) ⇒ `Result`

#extract_bytes_sync(data:, mime_type:, config: nil) ⇒ `Result`

#extract_file(path:, mime_type: nil, config: nil) ⇒ `Result`

#extract_file_sync(path:, mime_type: nil, config: nil) ⇒ `Result`

#normalize_config(config) ⇒ `Object`

#render_pdf_page(path, page_index, dpi: 150) ⇒ `String`

#render_pdf_pages_iter(path, dpi: 150) {|page_index, png_bytes| ... } ⇒ `Enumerator`