Class: PdfOxide::PdfDocument
- Inherits:
-
Object
- Object
- PdfOxide::PdfDocument
- Defined in:
- lib/pdf_oxide/pdf_document.rb
Overview
The primary read-only entry point to a PDF.
Mirrors ‘fyi.oxide.pdf.PdfDocument`. Lifecycle: a PdfDocument owns native memory and **must be closed** when no longer in use. The idiomatic Ruby pattern is the block form `PdfDocument.open(path) do |doc| … end` which closes automatically; for parity with the Java `AutoCloseable` contract, an explicit `#close` is also supported and is idempotent (a second call is a no-op, not a crash).
A ‘Finalizer` backstop frees leaked handles on GC; callers must not rely on it for timely cleanup.
Instance Attribute Summary collapse
-
#path ⇒ String
readonly
Absolute path the document was opened from (or a synthetic ‘<in-memory>` token for byte-opened docs).
Class Method Summary collapse
-
.extract_text(source, page: 0) ⇒ String
One-shot: open + extract page text + close.
-
.finalizer(tracker) ⇒ Object
private
Finalizer for GC cleanup.
-
.open(source, password: nil) {|PdfDocument| ... } ⇒ PdfDocument, Object
Open a PDF from disk or in-memory bytes.
Instance Method Summary collapse
-
#authenticate(password) ⇒ Boolean
Authenticate against this document’s encryption.
-
#auto_extractor ⇒ AutoExtractor
Convenience accessor: get the configured AutoExtractor for this doc.
-
#close ⇒ Object
Free the native handle.
-
#closed? ⇒ Boolean
True after #close.
-
#encrypted? ⇒ Boolean
Whether this PDF carries an encryption dictionary.
-
#extract_text(page_index) ⇒ String
Extract plain text from a single page.
-
#extract_text_auto(page_index) ⇒ String
Auto-routed extraction for a single page (v0.3.51 #517).
-
#form_fields ⇒ Array<Hash>
AcroForm fields as an array of ‘value:, type:, page:` hashes.
-
#handle ⇒ FFI::Pointer
Raw handle for sibling classes (MarkdownConverter, AutoExtractor, PdfValidator, PdfSigner) that need to pass the pointer to their own FFI calls.
-
#initialize(source, password: nil) ⇒ PdfDocument
constructor
Open a PDF.
-
#open? ⇒ Boolean
True if #close has not been called.
-
#page(index) ⇒ PdfPage
A lightweight view of the page at ‘index`.
-
#page_count ⇒ Integer
Number of pages.
-
#pages ⇒ Array<PdfPage>
Every page in the document (eager).
-
#pdf_version ⇒ String
PDF version string (e.g. “1.7”).
-
#render(page_index, dpi: 150) ⇒ String
Render a single page to PNG bytes at the supplied DPI.
-
#search(query, case_sensitive: false, regex: false) ⇒ Array<Hash>
Search this document.
-
#to_html(page_index = nil) ⇒ String
Convert one page to HTML.
-
#to_markdown(page_index = nil) ⇒ String
Convert one page to Markdown.
Constructor Details
#initialize(source, password: nil) ⇒ PdfDocument
Open a PDF. See open for the block-form factory.
65 66 67 68 69 70 71 72 73 74 75 76 |
# File 'lib/pdf_oxide/pdf_document.rb', line 65 def initialize(source, password: nil) raise ::PdfOxide::ArgumentError, 'source cannot be nil' if source.nil? @path, @handle = open_native(source) @closed = false # Mutable tracker lets an explicit `#close` defuse the finalizer # so the GC pass doesn't double-free. @tracker = [@handle] ObjectSpace.define_finalizer(self, self.class.finalizer(@tracker)) authenticate(password) if password end |
Instance Attribute Details
#path ⇒ String (readonly)
Returns absolute path the document was opened from (or a synthetic ‘<in-memory>` token for byte-opened docs).
31 32 33 |
# File 'lib/pdf_oxide/pdf_document.rb', line 31 def path @path end |
Class Method Details
.extract_text(source, page: 0) ⇒ String
One-shot: open + extract page text + close.
58 59 60 61 62 |
# File 'lib/pdf_oxide/pdf_document.rb', line 58 def self.extract_text(source, page: 0) # rubocop:disable Security/Open — PdfDocument.open opens a PDF, not a process. open(source) { |d| d.extract_text(page) } # rubocop:enable Security/Open end |
.finalizer(tracker) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Finalizer for GC cleanup. The mutable tracker lets explicit ‘#close` zero out the handle so a follow-up GC pass doesn’t double-free (the cdylib’s ‘pdf_document_free` is not idempotent on the same pointer).
295 296 297 298 299 300 301 302 303 |
# File 'lib/pdf_oxide/pdf_document.rb', line 295 def self.finalizer(tracker) proc do handle = tracker[0] if handle && !handle.null? Bindings.pdf_document_free(handle) tracker[0] = nil end end end |
.open(source, password: nil) {|PdfDocument| ... } ⇒ PdfDocument, Object
Open a PDF from disk or in-memory bytes.
43 44 45 46 47 48 49 50 51 52 |
# File 'lib/pdf_oxide/pdf_document.rb', line 43 def self.open(source, password: nil, &block) doc = new(source, password: password) return doc unless block_given? begin yield doc ensure doc.close end end |
Instance Method Details
#authenticate(password) ⇒ Boolean
Authenticate against this document’s encryption.
91 92 93 94 95 96 97 98 99 100 101 102 |
# File 'lib/pdf_oxide/pdf_document.rb', line 91 def authenticate(password) raise ::PdfOxide::ArgumentError, 'password cannot be nil' if password.nil? return true unless encrypted? # v0.3.55 cdylib doesn't expose a stable 3-arg unlock entry; # the legacy `pdf_document_unlock_with_password` is a phantom # (REMOVED) and `pdf_document_authenticate` only has the # 8-pointer placeholder shape. Return false on encrypted docs # rather than crash — Java's PdfDocument#authenticate has the # same fail-closed contract. false end |
#auto_extractor ⇒ AutoExtractor
Convenience accessor: get the configured AutoExtractor for this doc.
263 264 265 |
# File 'lib/pdf_oxide/pdf_document.rb', line 263 def auto_extractor @auto_extractor ||= AutoExtractor.new(self) end |
#close ⇒ Object
Free the native handle. Idempotent — calling more than once is a no-op, not a crash. Safe to call from an ensure block.
269 270 271 272 273 274 275 276 277 278 |
# File 'lib/pdf_oxide/pdf_document.rb', line 269 def close return if @closed h = @handle @handle = nil @closed = true # Defuse the finalizer (was @tracker[0] == @handle). @tracker[0] = nil if @tracker Bindings.pdf_document_free(h) if h && !h.null? end |
#closed? ⇒ Boolean
Returns true after #close.
286 287 288 |
# File 'lib/pdf_oxide/pdf_document.rb', line 286 def closed? @closed end |
#encrypted? ⇒ Boolean
Returns whether this PDF carries an encryption dictionary.
123 124 125 126 127 128 |
# File 'lib/pdf_oxide/pdf_document.rb', line 123 def encrypted? # bool pdf_document_is_encrypted(const PdfDocument *handle) — no err arg. # The cdylib silently swallowed the extra err pointer pre-v0.3.55, so # encryption-detection failures were never surfaced. Bindings.pdf_document_is_encrypted(handle) end |
#extract_text(page_index) ⇒ String
Extract plain text from a single page.
133 134 135 136 137 138 139 |
# File 'lib/pdf_oxide/pdf_document.rb', line 133 def extract_text(page_index) validate_page_index(page_index) err = ::FFI::MemoryPointer.new(:int32) ptr = Bindings.pdf_document_extract_text(handle, page_index, err) raise_for_code(err.read_int32, 'extract_text') StringMarshaller.from_c_string(ptr) || '' end |
#extract_text_auto(page_index) ⇒ String
Auto-routed extraction for a single page (v0.3.51 #517). Returns native text where present, OCR’d text for scanned regions when the ‘ocr` feature is available, and gracefully falls back to native + empty/partial text when OCR is not available — never raises an “OCR unavailable” error on this path.
148 149 150 151 152 153 154 |
# File 'lib/pdf_oxide/pdf_document.rb', line 148 def extract_text_auto(page_index) validate_page_index(page_index) err = ::FFI::MemoryPointer.new(:int32) ptr = Bindings.pdf_document_extract_text_auto(handle, page_index, err) raise_for_code(err.read_int32, 'extract_text_auto') StringMarshaller.from_c_string(ptr) || '' end |
#form_fields ⇒ Array<Hash>
Returns AcroForm fields as an array of ‘value:, type:, page:` hashes. v0.3.55 limitation: per-field `page` is -1 because pdf_oxide’s form extractor doesn’t yet surface per-field page placement; field is identified by ‘name`. When the cdylib build lacks the form-extract accessor, returns `[]` rather than raising — the simple-PDF case is “no form fields”.
198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 |
# File 'lib/pdf_oxide/pdf_document.rb', line 198 def form_fields return [] unless Bindings.respond_to?(:pdf_document_get_form_fields) err = ::FFI::MemoryPointer.new(:int32) ptr = begin Bindings.pdf_document_get_form_fields(handle, err) rescue ::ArgumentError # Phantom 8-pointer skeleton — graceful empty. return [] end raise_for_code(err.read_int32, 'form_fields') return [] if ptr.nil? || ptr.null? json = StringMarshaller.from_c_string(ptr) || '' return [] if json.empty? require 'json' arr = JSON.parse(json) Array(arr).map do |f| { name: f['name'], value: f['value'], type: f['type'], page: f.fetch('page', -1) } end rescue JSON::ParserError [] end |
#handle ⇒ FFI::Pointer
Returns raw handle for sibling classes (MarkdownConverter, AutoExtractor, PdfValidator, PdfSigner) that need to pass the pointer to their own FFI calls.
82 83 84 85 86 |
# File 'lib/pdf_oxide/pdf_document.rb', line 82 def handle raise InvalidStateError, 'PdfDocument has been closed' if @closed || @handle.nil? @handle end |
#open? ⇒ Boolean
Returns true if #close has not been called.
281 282 283 |
# File 'lib/pdf_oxide/pdf_document.rb', line 281 def open? !@closed end |
#page(index) ⇒ PdfPage
Returns a lightweight view of the page at ‘index`. The page borrows from this document; using it after the doc closes raises `InvalidStateError`.
250 251 252 253 |
# File 'lib/pdf_oxide/pdf_document.rb', line 250 def page(index) validate_page_index(index) PdfPage.new(self, index) end |
#page_count ⇒ Integer
Returns number of pages.
105 106 107 108 109 110 |
# File 'lib/pdf_oxide/pdf_document.rb', line 105 def page_count err = ::FFI::MemoryPointer.new(:int32) n = Bindings.pdf_document_get_page_count(handle, err) raise_for_code(err.read_int32, 'page_count') n end |
#pages ⇒ Array<PdfPage>
Returns every page in the document (eager).
256 257 258 259 |
# File 'lib/pdf_oxide/pdf_document.rb', line 256 def pages n = page_count Array.new(n) { |i| PdfPage.new(self, i) } end |
#pdf_version ⇒ String
Returns PDF version string (e.g. “1.7”).
113 114 115 116 117 118 119 120 |
# File 'lib/pdf_oxide/pdf_document.rb', line 113 def pdf_version maj = ::FFI::MemoryPointer.new(:uint8) min = ::FFI::MemoryPointer.new(:uint8) Bindings.pdf_document_get_version(handle, maj, min) "#{maj.read_uint8}.#{min.read_uint8}" rescue ::FFI::NotFoundError 'unknown' end |
#render(page_index, dpi: 150) ⇒ String
Render a single page to PNG bytes at the supplied DPI.
232 233 234 235 236 237 238 239 240 241 242 243 244 245 |
# File 'lib/pdf_oxide/pdf_document.rb', line 232 def render(page_index, dpi: 150) validate_page_index(page_index) err = ::FFI::MemoryPointer.new(:int32) img_ptr = Bindings.pdf_render_page_zoom(handle, page_index, dpi.to_f / 72.0, 0, err) raise_for_code(err.read_int32, 'render') raise InternalError, 'render returned null' if img_ptr.nil? || img_ptr.null? # Read length + bytes via rendered image helpers. The cdylib # exposes `pdf_oxide_rendered_image_*` accessors; the simpler # path is the byte-buffer accessor introduced for v0.3.5x. bytes = read_rendered_image_bytes(img_ptr) Bindings.pdf_rendered_image_free(img_ptr) if Bindings.respond_to?(:pdf_rendered_image_free) bytes.force_encoding(Encoding::BINARY) end |
#search(query, case_sensitive: false, regex: false) ⇒ Array<Hash>
Search this document.
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
# File 'lib/pdf_oxide/pdf_document.rb', line 176 def search(query, case_sensitive: false, regex: false) raise ::PdfOxide::ArgumentError, 'query cannot be nil' if query.nil? raise UnsupportedFeatureError, 'regex search not supported by this cdylib build' \ if regex && !Bindings.respond_to?(:pdf_document_search_regex) err = ::FFI::MemoryPointer.new(:int32) query_utf8 = StringMarshaller.to_utf8(query) results = if regex Bindings.pdf_document_search_regex(handle, query_utf8, case_sensitive, err) else Bindings.pdf_document_search_all(handle, query_utf8, case_sensitive, err) end raise_for_code(err.read_int32, 'search') parse_search_results(results) end |
#to_html(page_index = nil) ⇒ String
Convert one page to HTML.
166 167 168 |
# File 'lib/pdf_oxide/pdf_document.rb', line 166 def to_html(page_index = nil) page_index.nil? ? MarkdownConverter.to_html(self) : MarkdownConverter.to_html(self, page_index) end |
#to_markdown(page_index = nil) ⇒ String
Convert one page to Markdown.
159 160 161 |
# File 'lib/pdf_oxide/pdf_document.rb', line 159 def to_markdown(page_index = nil) page_index.nil? ? MarkdownConverter.to_markdown(self) : MarkdownConverter.to_markdown(self, page_index) end |