Class: PdfOxide::AutoExtractor
- Inherits:
-
Object
- Object
- PdfOxide::AutoExtractor
- Defined in:
- lib/pdf_oxide/auto_extractor.rb
Overview
v0.3.51 #519 — auto-extraction with typed reasons.
Mirrors ‘fyi.oxide.pdf.AutoExtractor`. Given a PdfDocument, returns recoverable text (native or OCR), per-page or whole-document, with a typed reason naming any degraded outcome. When OCR is needed but unavailable, returns the native text layer with `:ocr_requested_but_unavailable` instead of raising —extraction is not a security operation (per `feedback_extraction_graceful_fallback`).
Constant Summary collapse
- REASONS =
Typed reasons mirror the Rust serde-emitted snake_case tokens at the FFI JSON boundary. Renaming would break cross-binding parity with PHP / Python / Java.
%i[ ok native_text_high_confidence no_text_layer_present text_layer_below_threshold glyph_mapping_missing encrypted_no_extract_permission image_table_reconstructed image_table_no_structure chart_not_transcribed ocr_requested_but_unavailable ocr_low_confidence_fallback empty ].freeze
- PAGE_KINDS =
Per-page kinds from the auto-classifier (Rust’s ‘PageKind` enum).
%i[text_layer scanned image_text mixed empty].freeze
Instance Attribute Summary collapse
- #document ⇒ PdfDocument readonly
Class Method Summary collapse
-
.prefetch_available? ⇒ Boolean
Whether the build supports OCR provisioning (i.e. the ‘ocr` feature is compiled in).
Instance Method Summary collapse
-
#classify_document ⇒ Hash
Whole-document classifier.
-
#classify_page(page_index) ⇒ Hash
Cheap per-page classifier — no OCR, no rasterisation.
-
#extract_page(page_index, options: nil) ⇒ Object
Rich per-page extraction — returns the full PageExtraction JSON envelope (text + per-region bbox + reason + confidence) merged into a Hash.
-
#extract_text(page_index) ⇒ Hash
Extract a page’s text via the v0.3.51 auto-router (text-vs-OCR decision with graceful native fallback).
-
#initialize(document) ⇒ AutoExtractor
constructor
A new instance of AutoExtractor.
-
#ocr_fallback?(reason) ⇒ Boolean
True when the OCR-unavailable graceful-fallback path engaged.
-
#ok?(reason) ⇒ Boolean
True when the reason represents a clean extract.
Constructor Details
#initialize(document) ⇒ AutoExtractor
Returns a new instance of AutoExtractor.
47 48 49 50 51 52 |
# File 'lib/pdf_oxide/auto_extractor.rb', line 47 def initialize(document) raise ::PdfOxide::ArgumentError, 'document cannot be nil' if document.nil? raise ::PdfOxide::StateError, 'document has been closed' if document.respond_to?(:closed?) && document.closed? @document = document end |
Instance Attribute Details
#document ⇒ PdfDocument (readonly)
45 46 47 |
# File 'lib/pdf_oxide/auto_extractor.rb', line 45 def document @document end |
Class Method Details
.prefetch_available? ⇒ Boolean
Returns whether the build supports OCR provisioning (i.e. the ‘ocr` feature is compiled in).
118 119 120 |
# File 'lib/pdf_oxide/auto_extractor.rb', line 118 def self.prefetch_available? Bindings.pdf_oxide_prefetch_available != 0 end |
Instance Method Details
#classify_document ⇒ Hash
Whole-document classifier.
65 66 67 68 69 |
# File 'lib/pdf_oxide/auto_extractor.rb', line 65 def classify_document call_json('classify_document') do |err| Bindings.pdf_document_classify_document(@document.handle, err) end end |
#classify_page(page_index) ⇒ Hash
Cheap per-page classifier — no OCR, no rasterisation.
56 57 58 59 60 61 |
# File 'lib/pdf_oxide/auto_extractor.rb', line 56 def classify_page(page_index) json = call_json('classify_page') do |err| Bindings.pdf_document_classify_page(@document.handle, page_index, err) end build_classification(json) end |
#extract_page(page_index, options: nil) ⇒ Object
Rich per-page extraction — returns the full PageExtraction JSON envelope (text + per-region bbox + reason + confidence) merged into a Hash.
96 97 98 99 100 101 102 103 |
# File 'lib/pdf_oxide/auto_extractor.rb', line 96 def extract_page(page_index, options: nil) = .nil? ? nil : JSON.generate() json = call_json('extract_page_auto') do |err| Bindings.pdf_document_extract_page_auto(@document.handle, page_index, , err) end cls = build_classification(json) cls.merge(text: json['text'] || '', classification: json) end |
#extract_text(page_index) ⇒ Hash
Extract a page’s text via the v0.3.51 auto-router (text-vs-OCR decision with graceful native fallback). Surfaces a typed reason describing the quality.
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
# File 'lib/pdf_oxide/auto_extractor.rb', line 75 def extract_text(page_index) text = call_text('extract_text_auto') do |err| Bindings.pdf_document_extract_text_auto(@document.handle, page_index, err) end cls = begin classify_page(page_index) rescue StandardError { reason: :ok, kind: :mixed, confidence: 0.0 } end # Graceful fallback: if classifier wants OCR and the build can't # supply it, surface OCR_REQUESTED_BUT_UNAVAILABLE regardless of # native-side state. cls[:reason] = :ocr_requested_but_unavailable if cls[:kind] == :scanned && !self.class.prefetch_available? cls.merge(text: text) end |
#ocr_fallback?(reason) ⇒ Boolean
Returns true when the OCR-unavailable graceful-fallback path engaged.
112 113 114 |
# File 'lib/pdf_oxide/auto_extractor.rb', line 112 def ocr_fallback?(reason) %i[ocr_requested_but_unavailable ocr_low_confidence_fallback].include?(reason) end |
#ok?(reason) ⇒ Boolean
Returns true when the reason represents a clean extract.
106 107 108 |
# File 'lib/pdf_oxide/auto_extractor.rb', line 106 def ok?(reason) %i[ok native_text_high_confidence].include?(reason) end |