Class: PdfOxide::AutoExtractor

Inherits:
Object
  • Object
show all
Defined in:
lib/pdf_oxide/auto_extractor.rb

Overview

v0.3.51 #519 — auto-extraction with typed reasons.

Mirrors ‘fyi.oxide.pdf.AutoExtractor`. Given a PdfDocument, returns recoverable text (native or OCR), per-page or whole-document, with a typed reason naming any degraded outcome. When OCR is needed but unavailable, returns the native text layer with `:ocr_requested_but_unavailable` instead of raising —extraction is not a security operation (per `feedback_extraction_graceful_fallback`).

Examples:

doc = PdfOxide::PdfDocument.open('sample.pdf')
ax  = PdfOxide::AutoExtractor.new(doc)
result = ax.extract_page(0)
puts result[:text]
warn "degraded: #{result[:reason]}" unless ax.ok?(result[:reason])

Constant Summary collapse

REASONS =

Typed reasons mirror the Rust serde-emitted snake_case tokens at the FFI JSON boundary. Renaming would break cross-binding parity with PHP / Python / Java.

%i[
  ok
  native_text_high_confidence
  no_text_layer_present
  text_layer_below_threshold
  glyph_mapping_missing
  encrypted_no_extract_permission
  image_table_reconstructed
  image_table_no_structure
  chart_not_transcribed
  ocr_requested_but_unavailable
  ocr_low_confidence_fallback
  empty
].freeze
PAGE_KINDS =

Per-page kinds from the auto-classifier (Rust’s ‘PageKind` enum).

%i[text_layer scanned image_text mixed empty].freeze

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(document) ⇒ AutoExtractor

Returns a new instance of AutoExtractor.



47
48
49
50
51
52
# File 'lib/pdf_oxide/auto_extractor.rb', line 47

def initialize(document)
  raise ::PdfOxide::ArgumentError, 'document cannot be nil' if document.nil?
  raise ::PdfOxide::StateError, 'document has been closed' if document.respond_to?(:closed?) && document.closed?

  @document = document
end

Instance Attribute Details

#documentPdfDocument (readonly)

Returns:



45
46
47
# File 'lib/pdf_oxide/auto_extractor.rb', line 45

def document
  @document
end

Class Method Details

.prefetch_available?Boolean

Returns whether the build supports OCR provisioning (i.e. the ‘ocr` feature is compiled in).

Returns:

  • (Boolean)

    whether the build supports OCR provisioning (i.e. the ‘ocr` feature is compiled in).



118
119
120
# File 'lib/pdf_oxide/auto_extractor.rb', line 118

def self.prefetch_available?
  Bindings.pdf_oxide_prefetch_available != 0
end

Instance Method Details

#classify_documentHash

Whole-document classifier.

Returns:

  • (Hash)

    decoded JSON envelope.



65
66
67
68
69
# File 'lib/pdf_oxide/auto_extractor.rb', line 65

def classify_document
  call_json('classify_document') do |err|
    Bindings.pdf_document_classify_document(@document.handle, err)
  end
end

#classify_page(page_index) ⇒ Hash

Cheap per-page classifier — no OCR, no rasterisation.

Returns:

  • (Hash)

    { reason:, kind:, confidence:, classification: }



56
57
58
59
60
61
# File 'lib/pdf_oxide/auto_extractor.rb', line 56

def classify_page(page_index)
  json = call_json('classify_page') do |err|
    Bindings.pdf_document_classify_page(@document.handle, page_index, err)
  end
  build_classification(json)
end

#extract_page(page_index, options: nil) ⇒ Object

Rich per-page extraction — returns the full PageExtraction JSON envelope (text + per-region bbox + reason + confidence) merged into a Hash.

Parameters:

  • page_index (Integer)
  • options (Hash, nil) (defaults to: nil)

    auto-extract options serialised to JSON.



96
97
98
99
100
101
102
103
# File 'lib/pdf_oxide/auto_extractor.rb', line 96

def extract_page(page_index, options: nil)
  options_json = options.nil? ? nil : JSON.generate(options)
  json = call_json('extract_page_auto') do |err|
    Bindings.pdf_document_extract_page_auto(@document.handle, page_index, options_json, err)
  end
  cls = build_classification(json)
  cls.merge(text: json['text'] || '', classification: json)
end

#extract_text(page_index) ⇒ Hash

Extract a page’s text via the v0.3.51 auto-router (text-vs-OCR decision with graceful native fallback). Surfaces a typed reason describing the quality.

Returns:

  • (Hash)

    { text:, reason:, kind:, confidence:, classification: }



75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# File 'lib/pdf_oxide/auto_extractor.rb', line 75

def extract_text(page_index)
  text = call_text('extract_text_auto') do |err|
    Bindings.pdf_document_extract_text_auto(@document.handle, page_index, err)
  end
  cls = begin
    classify_page(page_index)
  rescue StandardError
    { reason: :ok, kind: :mixed, confidence: 0.0 }
  end
  # Graceful fallback: if classifier wants OCR and the build can't
  # supply it, surface OCR_REQUESTED_BUT_UNAVAILABLE regardless of
  # native-side state.
  cls[:reason] = :ocr_requested_but_unavailable if cls[:kind] == :scanned && !self.class.prefetch_available?
  cls.merge(text: text)
end

#ocr_fallback?(reason) ⇒ Boolean

Returns true when the OCR-unavailable graceful-fallback path engaged.

Returns:

  • (Boolean)

    true when the OCR-unavailable graceful-fallback path engaged.



112
113
114
# File 'lib/pdf_oxide/auto_extractor.rb', line 112

def ocr_fallback?(reason)
  %i[ocr_requested_but_unavailable ocr_low_confidence_fallback].include?(reason)
end

#ok?(reason) ⇒ Boolean

Returns true when the reason represents a clean extract.

Returns:

  • (Boolean)

    true when the reason represents a clean extract.



106
107
108
# File 'lib/pdf_oxide/auto_extractor.rb', line 106

def ok?(reason)
  %i[ok native_text_high_confidence].include?(reason)
end