Class: PdfOxide::AutoExtractor

Inherits:

Object

Object
PdfOxide::AutoExtractor

show all

Defined in:: lib/pdf_oxide/auto_extractor.rb

Overview

v0.3.51 #519 — auto-extraction with typed reasons.

Mirrors ‘fyi.oxide.pdf.AutoExtractor`. Given a PdfDocument, returns recoverable text (native or OCR), per-page or whole-document, with a typed reason naming any degraded outcome. When OCR is needed but unavailable, returns the native text layer with `:ocr_requested_but_unavailable` instead of raising —extraction is not a security operation (per `feedback_extraction_graceful_fallback`).

Examples:

doc = PdfOxide::PdfDocument.open('sample.pdf')
ax  = PdfOxide::AutoExtractor.new(doc)
result = ax.extract_page(0)
puts result[:text]
warn "degraded: #{result[:reason]}" unless ax.ok?(result[:reason])

Constant Summary collapse

REASONS = Typed reasons mirror the Rust serde-emitted snake_case tokens at the FFI JSON boundary. Renaming would break cross-binding parity with PHP / Python / Java.

%i[
  ok
  native_text_high_confidence
  no_text_layer_present
  text_layer_below_threshold
  glyph_mapping_missing
  encrypted_no_extract_permission
  image_table_reconstructed
  image_table_no_structure
  chart_not_transcribed
  ocr_requested_but_unavailable
  ocr_low_confidence_fallback
  empty
].freeze

PAGE_KINDS = Per-page kinds from the auto-classifier (Rust’s ‘PageKind` enum).

%i[text_layer scanned image_text mixed empty].freeze

Instance Attribute Summary collapse

#document ⇒ PdfDocument readonly

Class Method Summary collapse

.prefetch_available? ⇒ Boolean

Whether the build supports OCR provisioning (i.e. the ‘ocr` feature is compiled in).

Instance Method Summary collapse

#classify_document ⇒ Hash

Whole-document classifier.
#classify_page(page_index) ⇒ Hash

Cheap per-page classifier — no OCR, no rasterisation.
#extract_page(page_index, options: nil) ⇒ Object

Rich per-page extraction — returns the full PageExtraction JSON envelope (text + per-region bbox + reason + confidence) merged into a Hash.
#extract_text(page_index) ⇒ Hash

Extract a page’s text via the v0.3.51 auto-router (text-vs-OCR decision with graceful native fallback).
#initialize(document) ⇒ AutoExtractor constructor

A new instance of AutoExtractor.
#ocr_fallback?(reason) ⇒ Boolean

True when the OCR-unavailable graceful-fallback path engaged.
#ok?(reason) ⇒ Boolean

True when the reason represents a clean extract.

Constructor Details

#initialize(document) ⇒ `AutoExtractor`

Returns a new instance of AutoExtractor.

Raises:

(::PdfOxide::ArgumentError)

# File 'lib/pdf_oxide/auto_extractor.rb', line 47

def initialize(document)
  raise ::PdfOxide::ArgumentError, 'document cannot be nil' if document.nil?
  raise ::PdfOxide::StateError, 'document has been closed' if document.respond_to?(:closed?) && document.closed?

  @document = document
end

Instance Attribute Details

#document ⇒ `PdfDocument` (readonly)

Returns:

(PdfDocument)



45
46
47

# File 'lib/pdf_oxide/auto_extractor.rb', line 45

def document
  @document
end

Class Method Details

.prefetch_available? ⇒ `Boolean`

Returns whether the build supports OCR provisioning (i.e. the ‘ocr` feature is compiled in).

Returns:

(Boolean) —

whether the build supports OCR provisioning (i.e. the ‘ocr` feature is compiled in).



118
119
120

# File 'lib/pdf_oxide/auto_extractor.rb', line 118

def self.prefetch_available?
  Bindings.pdf_oxide_prefetch_available != 0
end

Instance Method Details

#classify_document ⇒ `Hash`

Whole-document classifier.

Returns:

(Hash) —

decoded JSON envelope.

# File 'lib/pdf_oxide/auto_extractor.rb', line 65

def classify_document
  call_json('classify_document') do |err|
    Bindings.pdf_document_classify_document(@document.handle, err)
  end
end

#classify_page(page_index) ⇒ `Hash`

Cheap per-page classifier — no OCR, no rasterisation.

Returns:

(Hash) —

{ reason:, kind:, confidence:, classification: }

# File 'lib/pdf_oxide/auto_extractor.rb', line 56

def classify_page(page_index)
  json = call_json('classify_page') do |err|
    Bindings.pdf_document_classify_page(@document.handle, page_index, err)
  end
  build_classification(json)
end

#extract_page(page_index, options: nil) ⇒ `Object`

Rich per-page extraction — returns the full PageExtraction JSON envelope (text + per-region bbox + reason + confidence) merged into a Hash.

Parameters:

page_index (Integer)
options (Hash, nil) (defaults to: nil) —

auto-extract options serialised to JSON.

# File 'lib/pdf_oxide/auto_extractor.rb', line 96

def extract_page(page_index, options: nil)
  options_json = options.nil? ? nil : JSON.generate(options)
  json = call_json('extract_page_auto') do |err|
    Bindings.pdf_document_extract_page_auto(@document.handle, page_index, options_json, err)
  end
  cls = build_classification(json)
  cls.merge(text: json['text'] || '', classification: json)
end

#extract_text(page_index) ⇒ `Hash`

Extract a page’s text via the v0.3.51 auto-router (text-vs-OCR decision with graceful native fallback). Surfaces a typed reason describing the quality.

Returns:

(Hash) —

{ text:, reason:, kind:, confidence:, classification: }

# File 'lib/pdf_oxide/auto_extractor.rb', line 75

def extract_text(page_index)
  text = call_text('extract_text_auto') do |err|
    Bindings.pdf_document_extract_text_auto(@document.handle, page_index, err)
  end
  cls = begin
    classify_page(page_index)
  rescue StandardError
    { reason: :ok, kind: :mixed, confidence: 0.0 }
  end
  # Graceful fallback: if classifier wants OCR and the build can't
  # supply it, surface OCR_REQUESTED_BUT_UNAVAILABLE regardless of
  # native-side state.
  cls[:reason] = :ocr_requested_but_unavailable if cls[:kind] == :scanned && !self.class.prefetch_available?
  cls.merge(text: text)
end

#ocr_fallback?(reason) ⇒ `Boolean`

Returns true when the OCR-unavailable graceful-fallback path engaged.

Returns:

(Boolean) —

true when the OCR-unavailable graceful-fallback path engaged.



112
113
114

# File 'lib/pdf_oxide/auto_extractor.rb', line 112

def ocr_fallback?(reason)
  %i[ocr_requested_but_unavailable ocr_low_confidence_fallback].include?(reason)
end

#ok?(reason) ⇒ `Boolean`

Returns true when the reason represents a clean extract.

Returns:

(Boolean) —

true when the reason represents a clean extract.



106
107
108

# File 'lib/pdf_oxide/auto_extractor.rb', line 106

def ok?(reason)
  %i[ok native_text_high_confidence].include?(reason)
end

Class: PdfOxide::AutoExtractor

Overview

Examples:

Constant Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(document) ⇒ AutoExtractor

Instance Attribute Details

#document ⇒ PdfDocument (readonly)

Class Method Details

.prefetch_available? ⇒ Boolean

Instance Method Details

#classify_document ⇒ Hash

#classify_page(page_index) ⇒ Hash

#extract_page(page_index, options: nil) ⇒ Object

#extract_text(page_index) ⇒ Hash

#ocr_fallback?(reason) ⇒ Boolean

#ok?(reason) ⇒ Boolean

#initialize(document) ⇒ `AutoExtractor`

#document ⇒ `PdfDocument` (readonly)

.prefetch_available? ⇒ `Boolean`

#classify_document ⇒ `Hash`

#classify_page(page_index) ⇒ `Hash`

#extract_page(page_index, options: nil) ⇒ `Object`

#extract_text(page_index) ⇒ `Hash`

#ocr_fallback?(reason) ⇒ `Boolean`

#ok?(reason) ⇒ `Boolean`