Class: PdfOxide::PdfDocument

Inherits:
Object
  • Object
show all
Defined in:
lib/pdf_oxide/pdf_document.rb

Overview

The primary read-only entry point to a PDF.

Mirrors ‘fyi.oxide.pdf.PdfDocument`. Lifecycle: a PdfDocument owns native memory and **must be closed** when no longer in use. The idiomatic Ruby pattern is the block form `PdfDocument.open(path) do |doc| … end` which closes automatically; for parity with the Java `AutoCloseable` contract, an explicit `#close` is also supported and is idempotent (a second call is a no-op, not a crash).

A ‘Finalizer` backstop frees leaked handles on GC; callers must not rely on it for timely cleanup.

Examples:

block form (recommended)

PdfOxide::PdfDocument.open('invoice.pdf') do |doc|
  puts doc.extract_text(0)
end

explicit close

doc = PdfOxide::PdfDocument.open('invoice.pdf')
begin
  puts doc.extract_text(0)
ensure
  doc.close
end

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(source, password: nil) ⇒ PdfDocument

Open a PDF. See open for the block-form factory.



65
66
67
68
69
70
71
72
73
74
75
76
# File 'lib/pdf_oxide/pdf_document.rb', line 65

def initialize(source, password: nil)
  raise ::PdfOxide::ArgumentError, 'source cannot be nil' if source.nil?

  @path, @handle = open_native(source)
  @closed = false
  # Mutable tracker lets an explicit `#close` defuse the finalizer
  # so the GC pass doesn't double-free.
  @tracker = [@handle]
  ObjectSpace.define_finalizer(self, self.class.finalizer(@tracker))

  authenticate(password) if password
end

Instance Attribute Details

#pathString (readonly)

Returns absolute path the document was opened from (or a synthetic ‘<in-memory>` token for byte-opened docs).

Returns:

  • (String)

    absolute path the document was opened from (or a synthetic ‘<in-memory>` token for byte-opened docs).



31
32
33
# File 'lib/pdf_oxide/pdf_document.rb', line 31

def path
  @path
end

Class Method Details

.extract_text(source, page: 0) ⇒ String

One-shot: open + extract page text + close.

Parameters:

  • source (String)

    path or bytes (see #open).

  • page (Integer) (defaults to: 0)

    0-based page index (default 0).

Returns:

  • (String)

    extracted text.



58
59
60
61
62
# File 'lib/pdf_oxide/pdf_document.rb', line 58

def self.extract_text(source, page: 0)
  # rubocop:disable Security/Open — PdfDocument.open opens a PDF, not a process.
  open(source) { |d| d.extract_text(page) }
  # rubocop:enable Security/Open
end

.finalizer(tracker) ⇒ Object

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Finalizer for GC cleanup. The mutable tracker lets explicit ‘#close` zero out the handle so a follow-up GC pass doesn’t double-free (the cdylib’s ‘pdf_document_free` is not idempotent on the same pointer).



312
313
314
315
316
317
318
319
320
# File 'lib/pdf_oxide/pdf_document.rb', line 312

def self.finalizer(tracker)
  proc do
    handle = tracker[0]
    if handle && !handle.null?
      Bindings.pdf_document_free(handle)
      tracker[0] = nil
    end
  end
end

.open(source, password: nil) {|PdfDocument| ... } ⇒ PdfDocument, Object

Open a PDF from disk or in-memory bytes.

Parameters:

  • source (String)

    either a filesystem path or raw PDF bytes (auto-detected via ‘%PDF-` magic on BINARY-encoded input).

  • password (String, nil) (defaults to: nil)

    optional password for encrypted PDFs.

Yields:

Returns:

  • (PdfDocument, Object)

    the document, or the block’s return value.

Raises:



43
44
45
46
47
48
49
50
51
52
# File 'lib/pdf_oxide/pdf_document.rb', line 43

def self.open(source, password: nil, &block)
  doc = new(source, password: password)
  return doc unless block_given?

  begin
    yield doc
  ensure
    doc.close
  end
end

Instance Method Details

#authenticate(password) ⇒ Boolean

Authenticate against this document’s encryption.

Parameters:

  • password (String)

Returns:

  • (Boolean)

    true on success / unencrypted; false on wrong password.

Raises:



91
92
93
94
95
96
97
98
99
100
101
102
# File 'lib/pdf_oxide/pdf_document.rb', line 91

def authenticate(password)
  raise ::PdfOxide::ArgumentError, 'password cannot be nil' if password.nil?
  return true unless encrypted?

  # v0.3.55 cdylib doesn't expose a stable 3-arg unlock entry;
  # the legacy `pdf_document_unlock_with_password` is a phantom
  # (REMOVED) and `pdf_document_authenticate` only has the
  # 8-pointer placeholder shape.  Return false on encrypted docs
  # rather than crash — Java's PdfDocument#authenticate has the
  # same fail-closed contract.
  false
end

#auto_extractorAutoExtractor

Convenience accessor: get the configured AutoExtractor for this doc.

Returns:



280
281
282
# File 'lib/pdf_oxide/pdf_document.rb', line 280

def auto_extractor
  @auto_extractor ||= AutoExtractor.new(self)
end

#closeObject

Free the native handle. Idempotent — calling more than once is a no-op, not a crash. Safe to call from an ensure block.



286
287
288
289
290
291
292
293
294
295
# File 'lib/pdf_oxide/pdf_document.rb', line 286

def close
  return if @closed

  h = @handle
  @handle = nil
  @closed = true
  # Defuse the finalizer (was @tracker[0] == @handle).
  @tracker[0] = nil if @tracker
  Bindings.pdf_document_free(h) if h && !h.null?
end

#closed?Boolean

Returns true after #close.

Returns:

  • (Boolean)

    true after #close.



303
304
305
# File 'lib/pdf_oxide/pdf_document.rb', line 303

def closed?
  @closed
end

#encrypted?Boolean

Returns whether this PDF carries an encryption dictionary.

Returns:

  • (Boolean)

    whether this PDF carries an encryption dictionary.



123
124
125
126
127
128
# File 'lib/pdf_oxide/pdf_document.rb', line 123

def encrypted?
  # bool pdf_document_is_encrypted(const PdfDocument *handle) — no err arg.
  # The cdylib silently swallowed the extra err pointer pre-v0.3.55, so
  # encryption-detection failures were never surfaced.
  Bindings.pdf_document_is_encrypted(handle)
end

#extract_structured(page) ⇒ Hash

Extract a structured representation of a single page (#536). Returns the parsed ‘StructuredPage` JSON as a Hash: `{ “page_index”, “page_width”, “page_height”,

"regions" => [ { "kind", "text", "bbox", "spans", "column_index" } ] }`.

Parameters:

  • page (Integer)

    0-based page index.

Returns:

  • (Hash)

    parsed structured page.



147
148
149
150
151
152
153
154
155
156
# File 'lib/pdf_oxide/pdf_document.rb', line 147

def extract_structured(page)
  validate_page_index(page)
  err = ::FFI::MemoryPointer.new(:int32)
  ptr = Bindings.pdf_document_extract_structured_to_json(handle, page, err)
  raise_for_code(err.read_int32, 'extract_structured')
  json = StringMarshaller.from_c_string(ptr) || ''

  require 'json'
  JSON.parse(json)
end

#extract_text(page_index) ⇒ String

Extract plain text from a single page.

Parameters:

  • page_index (Integer)

    0-based page index.

Returns:

  • (String)

    extracted text (empty for pages with no text layer).



133
134
135
136
137
138
139
# File 'lib/pdf_oxide/pdf_document.rb', line 133

def extract_text(page_index)
  validate_page_index(page_index)
  err = ::FFI::MemoryPointer.new(:int32)
  ptr = Bindings.pdf_document_extract_text(handle, page_index, err)
  raise_for_code(err.read_int32, 'extract_text')
  StringMarshaller.from_c_string(ptr) || ''
end

#extract_text_auto(page_index) ⇒ String

Auto-routed extraction for a single page (v0.3.51 #517). Returns native text where present, OCR’d text for scanned regions when the ‘ocr` feature is available, and gracefully falls back to native + empty/partial text when OCR is not available — never raises an “OCR unavailable” error on this path.

Parameters:

  • page_index (Integer)

    0-based.

Returns:

  • (String)

    extracted text.



165
166
167
168
169
170
171
# File 'lib/pdf_oxide/pdf_document.rb', line 165

def extract_text_auto(page_index)
  validate_page_index(page_index)
  err = ::FFI::MemoryPointer.new(:int32)
  ptr = Bindings.pdf_document_extract_text_auto(handle, page_index, err)
  raise_for_code(err.read_int32, 'extract_text_auto')
  StringMarshaller.from_c_string(ptr) || ''
end

#form_fieldsArray<Hash>

Returns AcroForm fields as an array of ‘value:, type:, page:` hashes. v0.3.55 limitation: per-field `page` is -1 because pdf_oxide’s form extractor doesn’t yet surface per-field page placement; field is identified by ‘name`. When the cdylib build lacks the form-extract accessor, returns `[]` rather than raising — the simple-PDF case is “no form fields”.

Returns:

  • (Array<Hash>)

    AcroForm fields as an array of ‘value:, type:, page:` hashes. v0.3.55 limitation: per-field `page` is -1 because pdf_oxide’s form extractor doesn’t yet surface per-field page placement; field is identified by ‘name`. When the cdylib build lacks the form-extract accessor, returns `[]` rather than raising — the simple-PDF case is “no form fields”.



215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
# File 'lib/pdf_oxide/pdf_document.rb', line 215

def form_fields
  return [] unless Bindings.respond_to?(:pdf_document_get_form_fields)

  err = ::FFI::MemoryPointer.new(:int32)
  ptr = begin
    Bindings.pdf_document_get_form_fields(handle, err)
  rescue ::ArgumentError
    # Phantom 8-pointer skeleton — graceful empty.
    return []
  end
  raise_for_code(err.read_int32, 'form_fields')
  return [] if ptr.nil? || ptr.null?

  json = StringMarshaller.from_c_string(ptr) || ''
  return [] if json.empty?

  require 'json'
  arr = JSON.parse(json)
  Array(arr).map do |f|
    {
      name: f['name'],
      value: f['value'],
      type: f['type'],
      page: f.fetch('page', -1)
    }
  end
rescue JSON::ParserError
  []
end

#handleFFI::Pointer

Returns raw handle for sibling classes (MarkdownConverter, AutoExtractor, PdfValidator, PdfSigner) that need to pass the pointer to their own FFI calls.

Returns:

  • (FFI::Pointer)

    raw handle for sibling classes (MarkdownConverter, AutoExtractor, PdfValidator, PdfSigner) that need to pass the pointer to their own FFI calls.

Raises:



82
83
84
85
86
# File 'lib/pdf_oxide/pdf_document.rb', line 82

def handle
  raise InvalidStateError, 'PdfDocument has been closed' if @closed || @handle.nil?

  @handle
end

#open?Boolean

Returns true if #close has not been called.

Returns:

  • (Boolean)

    true if #close has not been called.



298
299
300
# File 'lib/pdf_oxide/pdf_document.rb', line 298

def open?
  !@closed
end

#page(index) ⇒ PdfPage

Returns a lightweight view of the page at ‘index`. The page borrows from this document; using it after the doc closes raises `InvalidStateError`.

Returns:

  • (PdfPage)

    a lightweight view of the page at ‘index`. The page borrows from this document; using it after the doc closes raises `InvalidStateError`.



267
268
269
270
# File 'lib/pdf_oxide/pdf_document.rb', line 267

def page(index)
  validate_page_index(index)
  PdfPage.new(self, index)
end

#page_countInteger

Returns number of pages.

Returns:

  • (Integer)

    number of pages.



105
106
107
108
109
110
# File 'lib/pdf_oxide/pdf_document.rb', line 105

def page_count
  err = ::FFI::MemoryPointer.new(:int32)
  n = Bindings.pdf_document_get_page_count(handle, err)
  raise_for_code(err.read_int32, 'page_count')
  n
end

#pagesArray<PdfPage>

Returns every page in the document (eager).

Returns:

  • (Array<PdfPage>)

    every page in the document (eager).



273
274
275
276
# File 'lib/pdf_oxide/pdf_document.rb', line 273

def pages
  n = page_count
  Array.new(n) { |i| PdfPage.new(self, i) }
end

#pdf_versionString

Returns PDF version string (e.g. “1.7”).

Returns:

  • (String)

    PDF version string (e.g. “1.7”).



113
114
115
116
117
118
119
120
# File 'lib/pdf_oxide/pdf_document.rb', line 113

def pdf_version
  maj = ::FFI::MemoryPointer.new(:uint8)
  min = ::FFI::MemoryPointer.new(:uint8)
  Bindings.pdf_document_get_version(handle, maj, min)
  "#{maj.read_uint8}.#{min.read_uint8}"
rescue ::FFI::NotFoundError
  'unknown'
end

#render(page_index, dpi: 150) ⇒ String

Render a single page to PNG bytes at the supplied DPI.

Parameters:

  • page_index (Integer)
  • dpi (Integer) (defaults to: 150)

    resolution (default 150).

Returns:

  • (String)

    PNG-encoded image bytes (BINARY).

Raises:



249
250
251
252
253
254
255
256
257
258
259
260
261
262
# File 'lib/pdf_oxide/pdf_document.rb', line 249

def render(page_index, dpi: 150)
  validate_page_index(page_index)
  err = ::FFI::MemoryPointer.new(:int32)
  img_ptr = Bindings.pdf_render_page_zoom(handle, page_index, dpi.to_f / 72.0, 0, err)
  raise_for_code(err.read_int32, 'render')
  raise InternalError, 'render returned null' if img_ptr.nil? || img_ptr.null?

  # Read length + bytes via rendered image helpers.  The cdylib
  # exposes `pdf_oxide_rendered_image_*` accessors; the simpler
  # path is the byte-buffer accessor introduced for v0.3.5x.
  bytes = read_rendered_image_bytes(img_ptr)
  Bindings.pdf_rendered_image_free(img_ptr) if Bindings.respond_to?(:pdf_rendered_image_free)
  bytes.force_encoding(Encoding::BINARY)
end

#search(query, case_sensitive: false, regex: false) ⇒ Array<Hash>

Search this document.

Parameters:

  • query (String)

    literal text (or regex when ‘regex: true`).

  • case_sensitive (Boolean) (defaults to: false)
  • regex (Boolean) (defaults to: false)

    interpret query as a regex.

Returns:

  • (Array<Hash>)

    each match has keys :page, :text, :bbox (where :bbox is a Hash with :x, :y, :width, :height).

Raises:



193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
# File 'lib/pdf_oxide/pdf_document.rb', line 193

def search(query, case_sensitive: false, regex: false)
  raise ::PdfOxide::ArgumentError, 'query cannot be nil' if query.nil?
  raise UnsupportedFeatureError, 'regex search not supported by this cdylib build' \
    if regex && !Bindings.respond_to?(:pdf_document_search_regex)

  err = ::FFI::MemoryPointer.new(:int32)
  query_utf8 = StringMarshaller.to_utf8(query)
  results = if regex
              Bindings.pdf_document_search_regex(handle, query_utf8, case_sensitive, err)
            else
              Bindings.pdf_document_search_all(handle, query_utf8, case_sensitive, err)
            end
  raise_for_code(err.read_int32, 'search')
  parse_search_results(results)
end

#to_html(page_index = nil) ⇒ String

Convert one page to HTML.

Parameters:

  • page_index (Integer) (defaults to: nil)

Returns:

  • (String)

    HTML.



183
184
185
# File 'lib/pdf_oxide/pdf_document.rb', line 183

def to_html(page_index = nil)
  page_index.nil? ? MarkdownConverter.to_html(self) : MarkdownConverter.to_html(self, page_index)
end

#to_markdown(page_index = nil) ⇒ String

Convert one page to Markdown.

Parameters:

  • page_index (Integer) (defaults to: nil)

Returns:

  • (String)

    Markdown.



176
177
178
# File 'lib/pdf_oxide/pdf_document.rb', line 176

def to_markdown(page_index = nil)
  page_index.nil? ? MarkdownConverter.to_markdown(self) : MarkdownConverter.to_markdown(self, page_index)
end