Module: Kreuzberg::ExtractionAPI
- Defined in:
- lib/kreuzberg/extraction_api.rb
Instance Method Summary collapse
-
#batch_extract_bytes(data_array:, mime_types:, config: nil) ⇒ Array<Result>
Asynchronously extract content from multiple byte data sources.
-
#batch_extract_bytes_sync(data_array:, mime_types:, config: nil) ⇒ Array<Result>
Synchronously extract content from multiple byte data sources.
-
#batch_extract_files(paths:, config: nil) ⇒ Array<Result>
Asynchronously extract content from multiple files.
-
#batch_extract_files_sync(paths:, config: nil) ⇒ Array<Result>
Synchronously extract content from multiple files.
-
#extract_bytes(data:, mime_type:, config: nil) ⇒ Result
Asynchronously extract content from byte data.
-
#extract_bytes_sync(data:, mime_type:, config: nil) ⇒ Result
Synchronously extract content from byte data.
-
#extract_file(path:, mime_type: nil, config: nil) ⇒ Result
Asynchronously extract content from a file.
-
#extract_file_sync(path:, mime_type: nil, config: nil) ⇒ Result
Extraction result containing content, metadata, tables, and images.
- #normalize_config(config) ⇒ Object
Instance Method Details
#batch_extract_bytes(data_array:, mime_types:, config: nil) ⇒ Array<Result>
Asynchronously extract content from multiple byte data sources.
Non-blocking batch extraction from multiple in-memory binary documents. Results maintain the same order as the input data array. This method is preferred when processing multiple documents without blocking (e.g., handling multiple uploads).
314 315 316 317 318 319 320 |
# File 'lib/kreuzberg/extraction_api.rb', line 314 def batch_extract_bytes(data_array:, mime_types:, config: nil) opts = normalize_config(config) hashes = native_batch_extract_bytes(data_array.map(&:to_s), mime_types.map(&:to_s), **opts) results = hashes.map { |hash| Result.new(hash) } record_cache_entry!(results, opts) results end |
#batch_extract_bytes_sync(data_array:, mime_types:, config: nil) ⇒ Array<Result>
Synchronously extract content from multiple byte data sources.
Processes multiple in-memory binary documents in a single batch operation. Results maintain the same order as the input data array. The mime_types array must have the same length as the data_array.
270 271 272 273 274 275 276 |
# File 'lib/kreuzberg/extraction_api.rb', line 270 def batch_extract_bytes_sync(data_array:, mime_types:, config: nil) opts = normalize_config(config) hashes = native_batch_extract_bytes_sync(data_array.map(&:to_s), mime_types.map(&:to_s), **opts) results = hashes.map { |hash| Result.new(hash) } record_cache_entry!(results, opts) results end |
#batch_extract_files(paths:, config: nil) ⇒ Array<Result>
Asynchronously extract content from multiple files.
Non-blocking batch extraction from multiple files. Results maintain the same order as input paths. This is the preferred method for bulk processing when non-blocking I/O is required (e.g., in web servers or async applications).
231 232 233 234 235 236 237 |
# File 'lib/kreuzberg/extraction_api.rb', line 231 def batch_extract_files(paths:, config: nil) opts = normalize_config(config) hashes = native_batch_extract_files(paths.map(&:to_s), **opts) results = hashes.map { |hash| Result.new(hash) } record_cache_entry!(results, opts) results end |
#batch_extract_files_sync(paths:, config: nil) ⇒ Array<Result>
Synchronously extract content from multiple files.
Processes multiple files in a single batch operation. Files are extracted sequentially, and results maintain the same order as the input paths. This is useful for bulk processing multiple documents with consistent configuration.
100 101 102 103 104 105 106 107 108 109 110 111 112 |
# File 'lib/kreuzberg/extraction_api.rb', line 100 def batch_extract_files_sync(paths:, config: nil) # Validate that all files exist paths.each do |path| path_str = path.to_s raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str) end opts = normalize_config(config) hashes = native_batch_extract_files_sync(paths.map(&:to_s), **opts) results = hashes.map { |hash| Result.new(hash) } record_cache_entry!(results, opts) results end |
#extract_bytes(data:, mime_type:, config: nil) ⇒ Result
Asynchronously extract content from byte data.
Non-blocking extraction from in-memory binary data. Like #extract_file, this performs extraction in the background, making it suitable for handling high-volume extraction workloads without blocking the main thread.
190 191 192 193 194 195 196 |
# File 'lib/kreuzberg/extraction_api.rb', line 190 def extract_bytes(data:, mime_type:, config: nil) opts = normalize_config(config) hash = native_extract_bytes(data.to_s, mime_type.to_s, **opts) result = Result.new(hash) record_cache_entry!(result, opts) result end |
#extract_bytes_sync(data:, mime_type:, config: nil) ⇒ Result
Synchronously extract content from byte data.
Performs document extraction directly from binary data in memory. Useful for extracting content from files already loaded into memory or from network streams.
59 60 61 62 63 64 65 66 67 |
# File 'lib/kreuzberg/extraction_api.rb', line 59 def extract_bytes_sync(data:, mime_type:, config: nil) raise TypeError, "mime_type must be a String, got #{mime_type.inspect}" if mime_type.nil? opts = normalize_config(config) hash = native_extract_bytes_sync(data.to_s, mime_type.to_s, **opts) result = Result.new(hash) record_cache_entry!(result, opts) result end |
#extract_file(path:, mime_type: nil, config: nil) ⇒ Result
Asynchronously extract content from a file.
Non-blocking extraction that returns a Result promise. Extraction is performed in the background using native threads or the Tokio runtime. This method is preferred for I/O-bound operations and integrating with async workflows.
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
# File 'lib/kreuzberg/extraction_api.rb', line 144 def extract_file(path:, mime_type: nil, config: nil) # Validate that the file exists path_str = path.to_s raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str) opts = normalize_config(config) hash = if mime_type native_extract_file(path_str, mime_type.to_s, **opts) else native_extract_file(path_str, **opts) end result = Result.new(hash) record_cache_entry!(result, opts) result end |
#extract_file_sync(path:, mime_type: nil, config: nil) ⇒ Result
Returns Extraction result containing content, metadata, tables, and images.
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
# File 'lib/kreuzberg/extraction_api.rb', line 17 def extract_file_sync(path:, mime_type: nil, config: nil) # Validate that the file exists path_str = path.to_s raise Errors::IOError, "File not found: #{path_str}" unless File.exist?(path_str) opts = normalize_config(config) hash = if mime_type native_extract_file_sync(path_str, mime_type.to_s, **opts) else native_extract_file_sync(path_str, **opts) end result = Result.new(hash) record_cache_entry!(result, opts) result end |
#normalize_config(config) ⇒ Object
322 323 324 325 326 327 |
# File 'lib/kreuzberg/extraction_api.rb', line 322 def normalize_config(config) return {} if config.nil? return config if config.is_a?(Hash) config.to_h end |