Class: PdfOxide::PdfDocument
- Inherits:
-
Object
- Object
- PdfOxide::PdfDocument
- Defined in:
- lib/pdf_oxide/pdf_document.rb
Overview
The primary read-only entry point to a PDF.
Mirrors ‘fyi.oxide.pdf.PdfDocument`. Lifecycle: a PdfDocument owns native memory and **must be closed** when no longer in use. The idiomatic Ruby pattern is the block form `PdfDocument.open(path) do |doc| … end` which closes automatically; for parity with the Java `AutoCloseable` contract, an explicit `#close` is also supported and is idempotent (a second call is a no-op, not a crash).
A ‘Finalizer` backstop frees leaked handles on GC; callers must not rely on it for timely cleanup.
Instance Attribute Summary collapse
-
#path ⇒ String
readonly
Absolute path the document was opened from (or a synthetic ‘<in-memory>` token for byte-opened docs).
Class Method Summary collapse
-
.extract_text(source, page: 0) ⇒ String
One-shot: open + extract page text + close.
-
.finalizer(tracker) ⇒ Object
private
Finalizer for GC cleanup.
-
.open(source, password: nil) {|PdfDocument| ... } ⇒ PdfDocument, Object
Open a PDF from disk or in-memory bytes.
Instance Method Summary collapse
-
#authenticate(password) ⇒ Boolean
Authenticate against this document’s encryption.
-
#auto_extractor ⇒ AutoExtractor
Convenience accessor: get the configured AutoExtractor for this doc.
-
#close ⇒ Object
Free the native handle.
-
#closed? ⇒ Boolean
True after #close.
-
#encrypted? ⇒ Boolean
Whether this PDF carries an encryption dictionary.
-
#extract_structured(page) ⇒ Hash
Extract a structured representation of a single page (#536).
-
#extract_text(page_index) ⇒ String
Extract plain text from a single page.
-
#extract_text_auto(page_index) ⇒ String
Auto-routed extraction for a single page (v0.3.51 #517).
-
#form_fields ⇒ Array<Hash>
AcroForm fields as an array of ‘value:, type:, page:` hashes.
-
#handle ⇒ FFI::Pointer
Raw handle for sibling classes (MarkdownConverter, AutoExtractor, PdfValidator, PdfSigner) that need to pass the pointer to their own FFI calls.
-
#initialize(source, password: nil) ⇒ PdfDocument
constructor
Open a PDF.
-
#open? ⇒ Boolean
True if #close has not been called.
-
#page(index) ⇒ PdfPage
A lightweight view of the page at ‘index`.
-
#page_count ⇒ Integer
Number of pages.
-
#pages ⇒ Array<PdfPage>
Every page in the document (eager).
-
#pdf_version ⇒ String
PDF version string (e.g. “1.7”).
-
#render(page_index, dpi: 150) ⇒ String
Render a single page to PNG bytes at the supplied DPI.
-
#search(query, case_sensitive: false, regex: false) ⇒ Array<Hash>
Search this document.
-
#to_html(page_index = nil) ⇒ String
Convert one page to HTML.
-
#to_markdown(page_index = nil) ⇒ String
Convert one page to Markdown.
Constructor Details
#initialize(source, password: nil) ⇒ PdfDocument
Open a PDF. See open for the block-form factory.
65 66 67 68 69 70 71 72 73 74 75 76 |
# File 'lib/pdf_oxide/pdf_document.rb', line 65 def initialize(source, password: nil) raise ::PdfOxide::ArgumentError, 'source cannot be nil' if source.nil? @path, @handle = open_native(source) @closed = false # Mutable tracker lets an explicit `#close` defuse the finalizer # so the GC pass doesn't double-free. @tracker = [@handle] ObjectSpace.define_finalizer(self, self.class.finalizer(@tracker)) authenticate(password) if password end |
Instance Attribute Details
#path ⇒ String (readonly)
Returns absolute path the document was opened from (or a synthetic ‘<in-memory>` token for byte-opened docs).
31 32 33 |
# File 'lib/pdf_oxide/pdf_document.rb', line 31 def path @path end |
Class Method Details
.extract_text(source, page: 0) ⇒ String
One-shot: open + extract page text + close.
58 59 60 61 62 |
# File 'lib/pdf_oxide/pdf_document.rb', line 58 def self.extract_text(source, page: 0) # rubocop:disable Security/Open — PdfDocument.open opens a PDF, not a process. open(source) { |d| d.extract_text(page) } # rubocop:enable Security/Open end |
.finalizer(tracker) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Finalizer for GC cleanup. The mutable tracker lets explicit ‘#close` zero out the handle so a follow-up GC pass doesn’t double-free (the cdylib’s ‘pdf_document_free` is not idempotent on the same pointer).
312 313 314 315 316 317 318 319 320 |
# File 'lib/pdf_oxide/pdf_document.rb', line 312 def self.finalizer(tracker) proc do handle = tracker[0] if handle && !handle.null? Bindings.pdf_document_free(handle) tracker[0] = nil end end end |
.open(source, password: nil) {|PdfDocument| ... } ⇒ PdfDocument, Object
Open a PDF from disk or in-memory bytes.
43 44 45 46 47 48 49 50 51 52 |
# File 'lib/pdf_oxide/pdf_document.rb', line 43 def self.open(source, password: nil, &block) doc = new(source, password: password) return doc unless block_given? begin yield doc ensure doc.close end end |
Instance Method Details
#authenticate(password) ⇒ Boolean
Authenticate against this document’s encryption.
91 92 93 94 95 96 97 98 99 100 101 102 |
# File 'lib/pdf_oxide/pdf_document.rb', line 91 def authenticate(password) raise ::PdfOxide::ArgumentError, 'password cannot be nil' if password.nil? return true unless encrypted? # v0.3.55 cdylib doesn't expose a stable 3-arg unlock entry; # the legacy `pdf_document_unlock_with_password` is a phantom # (REMOVED) and `pdf_document_authenticate` only has the # 8-pointer placeholder shape. Return false on encrypted docs # rather than crash — Java's PdfDocument#authenticate has the # same fail-closed contract. false end |
#auto_extractor ⇒ AutoExtractor
Convenience accessor: get the configured AutoExtractor for this doc.
280 281 282 |
# File 'lib/pdf_oxide/pdf_document.rb', line 280 def auto_extractor @auto_extractor ||= AutoExtractor.new(self) end |
#close ⇒ Object
Free the native handle. Idempotent — calling more than once is a no-op, not a crash. Safe to call from an ensure block.
286 287 288 289 290 291 292 293 294 295 |
# File 'lib/pdf_oxide/pdf_document.rb', line 286 def close return if @closed h = @handle @handle = nil @closed = true # Defuse the finalizer (was @tracker[0] == @handle). @tracker[0] = nil if @tracker Bindings.pdf_document_free(h) if h && !h.null? end |
#closed? ⇒ Boolean
Returns true after #close.
303 304 305 |
# File 'lib/pdf_oxide/pdf_document.rb', line 303 def closed? @closed end |
#encrypted? ⇒ Boolean
Returns whether this PDF carries an encryption dictionary.
123 124 125 126 127 128 |
# File 'lib/pdf_oxide/pdf_document.rb', line 123 def encrypted? # bool pdf_document_is_encrypted(const PdfDocument *handle) — no err arg. # The cdylib silently swallowed the extra err pointer pre-v0.3.55, so # encryption-detection failures were never surfaced. Bindings.pdf_document_is_encrypted(handle) end |
#extract_structured(page) ⇒ Hash
Extract a structured representation of a single page (#536). Returns the parsed ‘StructuredPage` JSON as a Hash: `{ “page_index”, “page_width”, “page_height”,
"regions" => [ { "kind", "text", "bbox", "spans", "column_index" } ] }`.
147 148 149 150 151 152 153 154 155 156 |
# File 'lib/pdf_oxide/pdf_document.rb', line 147 def extract_structured(page) validate_page_index(page) err = ::FFI::MemoryPointer.new(:int32) ptr = Bindings.pdf_document_extract_structured_to_json(handle, page, err) raise_for_code(err.read_int32, 'extract_structured') json = StringMarshaller.from_c_string(ptr) || '' require 'json' JSON.parse(json) end |
#extract_text(page_index) ⇒ String
Extract plain text from a single page.
133 134 135 136 137 138 139 |
# File 'lib/pdf_oxide/pdf_document.rb', line 133 def extract_text(page_index) validate_page_index(page_index) err = ::FFI::MemoryPointer.new(:int32) ptr = Bindings.pdf_document_extract_text(handle, page_index, err) raise_for_code(err.read_int32, 'extract_text') StringMarshaller.from_c_string(ptr) || '' end |
#extract_text_auto(page_index) ⇒ String
Auto-routed extraction for a single page (v0.3.51 #517). Returns native text where present, OCR’d text for scanned regions when the ‘ocr` feature is available, and gracefully falls back to native + empty/partial text when OCR is not available — never raises an “OCR unavailable” error on this path.
165 166 167 168 169 170 171 |
# File 'lib/pdf_oxide/pdf_document.rb', line 165 def extract_text_auto(page_index) validate_page_index(page_index) err = ::FFI::MemoryPointer.new(:int32) ptr = Bindings.pdf_document_extract_text_auto(handle, page_index, err) raise_for_code(err.read_int32, 'extract_text_auto') StringMarshaller.from_c_string(ptr) || '' end |
#form_fields ⇒ Array<Hash>
Returns AcroForm fields as an array of ‘value:, type:, page:` hashes. v0.3.55 limitation: per-field `page` is -1 because pdf_oxide’s form extractor doesn’t yet surface per-field page placement; field is identified by ‘name`. When the cdylib build lacks the form-extract accessor, returns `[]` rather than raising — the simple-PDF case is “no form fields”.
215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 |
# File 'lib/pdf_oxide/pdf_document.rb', line 215 def form_fields return [] unless Bindings.respond_to?(:pdf_document_get_form_fields) err = ::FFI::MemoryPointer.new(:int32) ptr = begin Bindings.pdf_document_get_form_fields(handle, err) rescue ::ArgumentError # Phantom 8-pointer skeleton — graceful empty. return [] end raise_for_code(err.read_int32, 'form_fields') return [] if ptr.nil? || ptr.null? json = StringMarshaller.from_c_string(ptr) || '' return [] if json.empty? require 'json' arr = JSON.parse(json) Array(arr).map do |f| { name: f['name'], value: f['value'], type: f['type'], page: f.fetch('page', -1) } end rescue JSON::ParserError [] end |
#handle ⇒ FFI::Pointer
Returns raw handle for sibling classes (MarkdownConverter, AutoExtractor, PdfValidator, PdfSigner) that need to pass the pointer to their own FFI calls.
82 83 84 85 86 |
# File 'lib/pdf_oxide/pdf_document.rb', line 82 def handle raise InvalidStateError, 'PdfDocument has been closed' if @closed || @handle.nil? @handle end |
#open? ⇒ Boolean
Returns true if #close has not been called.
298 299 300 |
# File 'lib/pdf_oxide/pdf_document.rb', line 298 def open? !@closed end |
#page(index) ⇒ PdfPage
Returns a lightweight view of the page at ‘index`. The page borrows from this document; using it after the doc closes raises `InvalidStateError`.
267 268 269 270 |
# File 'lib/pdf_oxide/pdf_document.rb', line 267 def page(index) validate_page_index(index) PdfPage.new(self, index) end |
#page_count ⇒ Integer
Returns number of pages.
105 106 107 108 109 110 |
# File 'lib/pdf_oxide/pdf_document.rb', line 105 def page_count err = ::FFI::MemoryPointer.new(:int32) n = Bindings.pdf_document_get_page_count(handle, err) raise_for_code(err.read_int32, 'page_count') n end |
#pages ⇒ Array<PdfPage>
Returns every page in the document (eager).
273 274 275 276 |
# File 'lib/pdf_oxide/pdf_document.rb', line 273 def pages n = page_count Array.new(n) { |i| PdfPage.new(self, i) } end |
#pdf_version ⇒ String
Returns PDF version string (e.g. “1.7”).
113 114 115 116 117 118 119 120 |
# File 'lib/pdf_oxide/pdf_document.rb', line 113 def pdf_version maj = ::FFI::MemoryPointer.new(:uint8) min = ::FFI::MemoryPointer.new(:uint8) Bindings.pdf_document_get_version(handle, maj, min) "#{maj.read_uint8}.#{min.read_uint8}" rescue ::FFI::NotFoundError 'unknown' end |
#render(page_index, dpi: 150) ⇒ String
Render a single page to PNG bytes at the supplied DPI.
249 250 251 252 253 254 255 256 257 258 259 260 261 262 |
# File 'lib/pdf_oxide/pdf_document.rb', line 249 def render(page_index, dpi: 150) validate_page_index(page_index) err = ::FFI::MemoryPointer.new(:int32) img_ptr = Bindings.pdf_render_page_zoom(handle, page_index, dpi.to_f / 72.0, 0, err) raise_for_code(err.read_int32, 'render') raise InternalError, 'render returned null' if img_ptr.nil? || img_ptr.null? # Read length + bytes via rendered image helpers. The cdylib # exposes `pdf_oxide_rendered_image_*` accessors; the simpler # path is the byte-buffer accessor introduced for v0.3.5x. bytes = read_rendered_image_bytes(img_ptr) Bindings.pdf_rendered_image_free(img_ptr) if Bindings.respond_to?(:pdf_rendered_image_free) bytes.force_encoding(Encoding::BINARY) end |
#search(query, case_sensitive: false, regex: false) ⇒ Array<Hash>
Search this document.
193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 |
# File 'lib/pdf_oxide/pdf_document.rb', line 193 def search(query, case_sensitive: false, regex: false) raise ::PdfOxide::ArgumentError, 'query cannot be nil' if query.nil? raise UnsupportedFeatureError, 'regex search not supported by this cdylib build' \ if regex && !Bindings.respond_to?(:pdf_document_search_regex) err = ::FFI::MemoryPointer.new(:int32) query_utf8 = StringMarshaller.to_utf8(query) results = if regex Bindings.pdf_document_search_regex(handle, query_utf8, case_sensitive, err) else Bindings.pdf_document_search_all(handle, query_utf8, case_sensitive, err) end raise_for_code(err.read_int32, 'search') parse_search_results(results) end |
#to_html(page_index = nil) ⇒ String
Convert one page to HTML.
183 184 185 |
# File 'lib/pdf_oxide/pdf_document.rb', line 183 def to_html(page_index = nil) page_index.nil? ? MarkdownConverter.to_html(self) : MarkdownConverter.to_html(self, page_index) end |
#to_markdown(page_index = nil) ⇒ String
Convert one page to Markdown.
176 177 178 |
# File 'lib/pdf_oxide/pdf_document.rb', line 176 def to_markdown(page_index = nil) page_index.nil? ? MarkdownConverter.to_markdown(self) : MarkdownConverter.to_markdown(self, page_index) end |