Module: Kreuzberg::ExtractionAPI
- Defined in:
- lib/kreuzberg/extraction_api.rb
Instance Method Summary collapse
-
#batch_extract_bytes(data_array:, mime_types:, config: nil) ⇒ Array<Result>
Asynchronously extract content from multiple byte data sources.
-
#batch_extract_bytes_sync(data_array:, mime_types:, config: nil) ⇒ Array<Result>
Synchronously extract content from multiple byte data sources.
-
#batch_extract_files(paths:, config: nil) ⇒ Array<Result>
Asynchronously extract content from multiple files.
-
#batch_extract_files_sync(paths:, config: nil) ⇒ Array<Result>
Synchronously extract content from multiple files.
-
#embed(texts:, config: nil) ⇒ Array<Array<Float>>
Asynchronously generate embeddings for multiple texts.
-
#embed_sync(texts:, config: nil) ⇒ Array<Array<Float>>
Synchronously generate embeddings for multiple texts.
-
#extract_bytes(data:, mime_type:, config: nil) ⇒ Result
Asynchronously extract content from byte data.
-
#extract_bytes_sync(data:, mime_type:, config: nil) ⇒ Result
Synchronously extract content from byte data.
-
#extract_file(path:, mime_type: nil, config: nil) ⇒ Result
Asynchronously extract content from a file.
-
#extract_file_sync(path:, mime_type: nil, config: nil) ⇒ Result
Extraction result containing content, metadata, tables, and images.
- #normalize_config(config) ⇒ Object
-
#render_pdf_page(path, page_index, dpi: 150) ⇒ String
Render a single PDF page as a PNG image.
-
#render_pdf_pages_iter(path, dpi: 150) {|page_index, png_bytes| ... } ⇒ Enumerator
Iterate over pages of a PDF lazily, yielding each page as it is rendered.
Instance Method Details
#batch_extract_bytes(data_array:, mime_types:, config: nil) ⇒ Array<Result>
Asynchronously extract content from multiple byte data sources.
Non-blocking batch extraction from multiple in-memory binary documents. Results maintain the same order as the input data array. This method is preferred when processing multiple documents without blocking (e.g., handling multiple uploads).
349 350 351 352 353 354 355 |
# File 'lib/kreuzberg/extraction_api.rb', line 349 def batch_extract_bytes(data_array:, mime_types:, config: nil) opts = normalize_config(config) hashes = native_batch_extract_bytes(data_array.map(&:to_s), mime_types.map(&:to_s), **opts) results = hashes.map { |hash| Result.new(hash) } record_cache_entry!(results, opts) results end |
#batch_extract_bytes_sync(data_array:, mime_types:, config: nil) ⇒ Array<Result>
Synchronously extract content from multiple byte data sources.
Processes multiple in-memory binary documents in a single batch operation. Results maintain the same order as the input data array. The mime_types array must have the same length as the data_array.
305 306 307 308 309 310 311 |
# File 'lib/kreuzberg/extraction_api.rb', line 305 def batch_extract_bytes_sync(data_array:, mime_types:, config: nil) opts = normalize_config(config) hashes = native_batch_extract_bytes_sync(data_array.map(&:to_s), mime_types.map(&:to_s), **opts) results = hashes.map { |hash| Result.new(hash) } record_cache_entry!(results, opts) results end |
#batch_extract_files(paths:, config: nil) ⇒ Array<Result>
Asynchronously extract content from multiple files.
Non-blocking batch extraction from multiple files. Results maintain the same order as input paths. This is the preferred method for bulk processing when non-blocking I/O is required (e.g., in web servers or async applications).
231 232 233 234 235 236 237 |
# File 'lib/kreuzberg/extraction_api.rb', line 231 def batch_extract_files(paths:, config: nil) opts = normalize_config(config) hashes = native_batch_extract_files(paths.map(&:to_s), **opts) results = hashes.map { |hash| Result.new(hash) } record_cache_entry!(results, opts) results end |
#batch_extract_files_sync(paths:, config: nil) ⇒ Array<Result>
Synchronously extract content from multiple files.
Processes multiple files in a single batch operation. Files are extracted sequentially, and results maintain the same order as the input paths. This is useful for bulk processing multiple documents with consistent configuration.
100 101 102 103 104 105 106 107 108 109 110 111 112 |
# File 'lib/kreuzberg/extraction_api.rb', line 100 def batch_extract_files_sync(paths:, config: nil) # Validate that all files exist paths.each do |path| path_str = path.to_s raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str) end opts = normalize_config(config) hashes = native_batch_extract_files_sync(paths.map(&:to_s), **opts) results = hashes.map { |hash| Result.new(hash) } record_cache_entry!(results, opts) results end |
#embed(texts:, config: nil) ⇒ Array<Array<Float>>
Asynchronously generate embeddings for multiple texts.
Non-blocking embedding generation from a list of strings.
254 255 256 257 |
# File 'lib/kreuzberg/extraction_api.rb', line 254 def (texts:, config: nil) opts = normalize_config(config) (texts: texts.map(&:to_s), config: opts) end |
#embed_sync(texts:, config: nil) ⇒ Array<Array<Float>>
Synchronously generate embeddings for multiple texts.
Blocking embedding generation from a list of strings.
269 270 271 272 |
# File 'lib/kreuzberg/extraction_api.rb', line 269 def (texts:, config: nil) opts = normalize_config(config) (texts: texts.map(&:to_s), config: opts) end |
#extract_bytes(data:, mime_type:, config: nil) ⇒ Result
Asynchronously extract content from byte data.
Non-blocking extraction from in-memory binary data. Like #extract_file, this performs extraction in the background, making it suitable for handling high-volume extraction workloads without blocking the main thread.
190 191 192 193 194 195 196 |
# File 'lib/kreuzberg/extraction_api.rb', line 190 def extract_bytes(data:, mime_type:, config: nil) opts = normalize_config(config) hash = native_extract_bytes(data.to_s, mime_type.to_s, **opts) result = Result.new(hash) record_cache_entry!(result, opts) result end |
#extract_bytes_sync(data:, mime_type:, config: nil) ⇒ Result
Synchronously extract content from byte data.
Performs document extraction directly from binary data in memory. Useful for extracting content from files already loaded into memory or from network streams.
59 60 61 62 63 64 65 66 67 |
# File 'lib/kreuzberg/extraction_api.rb', line 59 def extract_bytes_sync(data:, mime_type:, config: nil) raise TypeError, "mime_type must be a String, got #{mime_type.inspect}" if mime_type.nil? opts = normalize_config(config) hash = native_extract_bytes_sync(data.to_s, mime_type.to_s, **opts) result = Result.new(hash) record_cache_entry!(result, opts) result end |
#extract_file(path:, mime_type: nil, config: nil) ⇒ Result
Asynchronously extract content from a file.
Non-blocking extraction that returns a Result promise. Extraction is performed in the background using native threads or the Tokio runtime. This method is preferred for I/O-bound operations and integrating with async workflows.
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
# File 'lib/kreuzberg/extraction_api.rb', line 144 def extract_file(path:, mime_type: nil, config: nil) # Validate that the file exists path_str = path.to_s raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str) opts = normalize_config(config) hash = if mime_type native_extract_file(path_str, mime_type.to_s, **opts) else native_extract_file(path_str, **opts) end result = Result.new(hash) record_cache_entry!(result, opts) result end |
#extract_file_sync(path:, mime_type: nil, config: nil) ⇒ Result
Returns Extraction result containing content, metadata, tables, and images.
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
# File 'lib/kreuzberg/extraction_api.rb', line 17 def extract_file_sync(path:, mime_type: nil, config: nil) # Validate that the file exists path_str = path.to_s raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str) opts = normalize_config(config) hash = if mime_type native_extract_file_sync(path_str, mime_type.to_s, **opts) else native_extract_file_sync(path_str, **opts) end result = Result.new(hash) record_cache_entry!(result, opts) result end |
#normalize_config(config) ⇒ Object
394 395 396 397 398 399 |
# File 'lib/kreuzberg/extraction_api.rb', line 394 def normalize_config(config) return {} if config.nil? return config if config.is_a?(Hash) config.to_h end |
#render_pdf_page(path, page_index, dpi: 150) ⇒ String
Render a single PDF page as a PNG image.
365 366 367 368 369 370 371 |
# File 'lib/kreuzberg/extraction_api.rb', line 365 def render_pdf_page(path, page_index, dpi: 150) path_str = path.to_s raise ArgumentError, 'page_index must be non-negative' if page_index.negative? raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str) native_render_pdf_page(path_str, page_index, dpi) end |
#render_pdf_pages_iter(path, dpi: 150) {|page_index, png_bytes| ... } ⇒ Enumerator
Iterate over pages of a PDF lazily, yielding each page as it is rendered.
Each page is rendered via the native FFI iterator, so only one page is in memory at a time.
385 386 387 388 389 390 391 392 |
# File 'lib/kreuzberg/extraction_api.rb', line 385 def render_pdf_pages_iter(path, dpi: 150, &block) path_str = path.to_s raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str) return enum_for(:render_pdf_pages_iter, path, dpi: dpi) unless block native_render_pdf_pages_iter(path_str, dpi, &block) end |