Class: PdfOxide::PdfDocument

Inherits:
Object
  • Object
show all
Defined in:
lib/pdf_oxide/pdf_document.rb

Overview

The primary read-only entry point to a PDF.

Mirrors ‘fyi.oxide.pdf.PdfDocument`. Lifecycle: a PdfDocument owns native memory and **must be closed** when no longer in use. The idiomatic Ruby pattern is the block form `PdfDocument.open(path) do |doc| … end` which closes automatically; for parity with the Java `AutoCloseable` contract, an explicit `#close` is also supported and is idempotent (a second call is a no-op, not a crash).

A ‘Finalizer` backstop frees leaked handles on GC; callers must not rely on it for timely cleanup.

Examples:

block form (recommended)

PdfOxide::PdfDocument.open('invoice.pdf') do |doc|
  puts doc.extract_text(0)
end

explicit close

doc = PdfOxide::PdfDocument.open('invoice.pdf')
begin
  puts doc.extract_text(0)
ensure
  doc.close
end

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(source, password: nil) ⇒ PdfDocument

Open a PDF. See open for the block-form factory.



65
66
67
68
69
70
71
72
73
74
75
76
# File 'lib/pdf_oxide/pdf_document.rb', line 65

def initialize(source, password: nil)
  raise ::PdfOxide::ArgumentError, 'source cannot be nil' if source.nil?

  @path, @handle = open_native(source)
  @closed = false
  # Mutable tracker lets an explicit `#close` defuse the finalizer
  # so the GC pass doesn't double-free.
  @tracker = [@handle]
  ObjectSpace.define_finalizer(self, self.class.finalizer(@tracker))

  authenticate(password) if password
end

Instance Attribute Details

#pathString (readonly)

Returns absolute path the document was opened from (or a synthetic ‘<in-memory>` token for byte-opened docs).

Returns:

  • (String)

    absolute path the document was opened from (or a synthetic ‘<in-memory>` token for byte-opened docs).



31
32
33
# File 'lib/pdf_oxide/pdf_document.rb', line 31

def path
  @path
end

Class Method Details

.extract_text(source, page: 0) ⇒ String

One-shot: open + extract page text + close.

Parameters:

  • source (String)

    path or bytes (see #open).

  • page (Integer) (defaults to: 0)

    0-based page index (default 0).

Returns:

  • (String)

    extracted text.



58
59
60
61
62
# File 'lib/pdf_oxide/pdf_document.rb', line 58

def self.extract_text(source, page: 0)
  # rubocop:disable Security/Open — PdfDocument.open opens a PDF, not a process.
  open(source) { |d| d.extract_text(page) }
  # rubocop:enable Security/Open
end

.finalizer(tracker) ⇒ Object

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Finalizer for GC cleanup. The mutable tracker lets explicit ‘#close` zero out the handle so a follow-up GC pass doesn’t double-free (the cdylib’s ‘pdf_document_free` is not idempotent on the same pointer).



295
296
297
298
299
300
301
302
303
# File 'lib/pdf_oxide/pdf_document.rb', line 295

def self.finalizer(tracker)
  proc do
    handle = tracker[0]
    if handle && !handle.null?
      Bindings.pdf_document_free(handle)
      tracker[0] = nil
    end
  end
end

.open(source, password: nil) {|PdfDocument| ... } ⇒ PdfDocument, Object

Open a PDF from disk or in-memory bytes.

Parameters:

  • source (String)

    either a filesystem path or raw PDF bytes (auto-detected via ‘%PDF-` magic on BINARY-encoded input).

  • password (String, nil) (defaults to: nil)

    optional password for encrypted PDFs.

Yields:

Returns:

  • (PdfDocument, Object)

    the document, or the block’s return value.

Raises:



43
44
45
46
47
48
49
50
51
52
# File 'lib/pdf_oxide/pdf_document.rb', line 43

def self.open(source, password: nil, &block)
  doc = new(source, password: password)
  return doc unless block_given?

  begin
    yield doc
  ensure
    doc.close
  end
end

Instance Method Details

#authenticate(password) ⇒ Boolean

Authenticate against this document’s encryption.

Parameters:

  • password (String)

Returns:

  • (Boolean)

    true on success / unencrypted; false on wrong password.

Raises:



91
92
93
94
95
96
97
98
99
100
101
102
# File 'lib/pdf_oxide/pdf_document.rb', line 91

def authenticate(password)
  raise ::PdfOxide::ArgumentError, 'password cannot be nil' if password.nil?
  return true unless encrypted?

  # v0.3.55 cdylib doesn't expose a stable 3-arg unlock entry;
  # the legacy `pdf_document_unlock_with_password` is a phantom
  # (REMOVED) and `pdf_document_authenticate` only has the
  # 8-pointer placeholder shape.  Return false on encrypted docs
  # rather than crash — Java's PdfDocument#authenticate has the
  # same fail-closed contract.
  false
end

#auto_extractorAutoExtractor

Convenience accessor: get the configured AutoExtractor for this doc.

Returns:



263
264
265
# File 'lib/pdf_oxide/pdf_document.rb', line 263

def auto_extractor
  @auto_extractor ||= AutoExtractor.new(self)
end

#closeObject

Free the native handle. Idempotent — calling more than once is a no-op, not a crash. Safe to call from an ensure block.



269
270
271
272
273
274
275
276
277
278
# File 'lib/pdf_oxide/pdf_document.rb', line 269

def close
  return if @closed

  h = @handle
  @handle = nil
  @closed = true
  # Defuse the finalizer (was @tracker[0] == @handle).
  @tracker[0] = nil if @tracker
  Bindings.pdf_document_free(h) if h && !h.null?
end

#closed?Boolean

Returns true after #close.

Returns:

  • (Boolean)

    true after #close.



286
287
288
# File 'lib/pdf_oxide/pdf_document.rb', line 286

def closed?
  @closed
end

#encrypted?Boolean

Returns whether this PDF carries an encryption dictionary.

Returns:

  • (Boolean)

    whether this PDF carries an encryption dictionary.



123
124
125
126
127
128
# File 'lib/pdf_oxide/pdf_document.rb', line 123

def encrypted?
  # bool pdf_document_is_encrypted(const PdfDocument *handle) — no err arg.
  # The cdylib silently swallowed the extra err pointer pre-v0.3.55, so
  # encryption-detection failures were never surfaced.
  Bindings.pdf_document_is_encrypted(handle)
end

#extract_text(page_index) ⇒ String

Extract plain text from a single page.

Parameters:

  • page_index (Integer)

    0-based page index.

Returns:

  • (String)

    extracted text (empty for pages with no text layer).



133
134
135
136
137
138
139
# File 'lib/pdf_oxide/pdf_document.rb', line 133

def extract_text(page_index)
  validate_page_index(page_index)
  err = ::FFI::MemoryPointer.new(:int32)
  ptr = Bindings.pdf_document_extract_text(handle, page_index, err)
  raise_for_code(err.read_int32, 'extract_text')
  StringMarshaller.from_c_string(ptr) || ''
end

#extract_text_auto(page_index) ⇒ String

Auto-routed extraction for a single page (v0.3.51 #517). Returns native text where present, OCR’d text for scanned regions when the ‘ocr` feature is available, and gracefully falls back to native + empty/partial text when OCR is not available — never raises an “OCR unavailable” error on this path.

Parameters:

  • page_index (Integer)

    0-based.

Returns:

  • (String)

    extracted text.



148
149
150
151
152
153
154
# File 'lib/pdf_oxide/pdf_document.rb', line 148

def extract_text_auto(page_index)
  validate_page_index(page_index)
  err = ::FFI::MemoryPointer.new(:int32)
  ptr = Bindings.pdf_document_extract_text_auto(handle, page_index, err)
  raise_for_code(err.read_int32, 'extract_text_auto')
  StringMarshaller.from_c_string(ptr) || ''
end

#form_fieldsArray<Hash>

Returns AcroForm fields as an array of ‘value:, type:, page:` hashes. v0.3.55 limitation: per-field `page` is -1 because pdf_oxide’s form extractor doesn’t yet surface per-field page placement; field is identified by ‘name`. When the cdylib build lacks the form-extract accessor, returns `[]` rather than raising — the simple-PDF case is “no form fields”.

Returns:

  • (Array<Hash>)

    AcroForm fields as an array of ‘value:, type:, page:` hashes. v0.3.55 limitation: per-field `page` is -1 because pdf_oxide’s form extractor doesn’t yet surface per-field page placement; field is identified by ‘name`. When the cdylib build lacks the form-extract accessor, returns `[]` rather than raising — the simple-PDF case is “no form fields”.



198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
# File 'lib/pdf_oxide/pdf_document.rb', line 198

def form_fields
  return [] unless Bindings.respond_to?(:pdf_document_get_form_fields)

  err = ::FFI::MemoryPointer.new(:int32)
  ptr = begin
    Bindings.pdf_document_get_form_fields(handle, err)
  rescue ::ArgumentError
    # Phantom 8-pointer skeleton — graceful empty.
    return []
  end
  raise_for_code(err.read_int32, 'form_fields')
  return [] if ptr.nil? || ptr.null?

  json = StringMarshaller.from_c_string(ptr) || ''
  return [] if json.empty?

  require 'json'
  arr = JSON.parse(json)
  Array(arr).map do |f|
    {
      name: f['name'],
      value: f['value'],
      type: f['type'],
      page: f.fetch('page', -1)
    }
  end
rescue JSON::ParserError
  []
end

#handleFFI::Pointer

Returns raw handle for sibling classes (MarkdownConverter, AutoExtractor, PdfValidator, PdfSigner) that need to pass the pointer to their own FFI calls.

Returns:

  • (FFI::Pointer)

    raw handle for sibling classes (MarkdownConverter, AutoExtractor, PdfValidator, PdfSigner) that need to pass the pointer to their own FFI calls.

Raises:



82
83
84
85
86
# File 'lib/pdf_oxide/pdf_document.rb', line 82

def handle
  raise InvalidStateError, 'PdfDocument has been closed' if @closed || @handle.nil?

  @handle
end

#open?Boolean

Returns true if #close has not been called.

Returns:

  • (Boolean)

    true if #close has not been called.



281
282
283
# File 'lib/pdf_oxide/pdf_document.rb', line 281

def open?
  !@closed
end

#page(index) ⇒ PdfPage

Returns a lightweight view of the page at ‘index`. The page borrows from this document; using it after the doc closes raises `InvalidStateError`.

Returns:

  • (PdfPage)

    a lightweight view of the page at ‘index`. The page borrows from this document; using it after the doc closes raises `InvalidStateError`.



250
251
252
253
# File 'lib/pdf_oxide/pdf_document.rb', line 250

def page(index)
  validate_page_index(index)
  PdfPage.new(self, index)
end

#page_countInteger

Returns number of pages.

Returns:

  • (Integer)

    number of pages.



105
106
107
108
109
110
# File 'lib/pdf_oxide/pdf_document.rb', line 105

def page_count
  err = ::FFI::MemoryPointer.new(:int32)
  n = Bindings.pdf_document_get_page_count(handle, err)
  raise_for_code(err.read_int32, 'page_count')
  n
end

#pagesArray<PdfPage>

Returns every page in the document (eager).

Returns:

  • (Array<PdfPage>)

    every page in the document (eager).



256
257
258
259
# File 'lib/pdf_oxide/pdf_document.rb', line 256

def pages
  n = page_count
  Array.new(n) { |i| PdfPage.new(self, i) }
end

#pdf_versionString

Returns PDF version string (e.g. “1.7”).

Returns:

  • (String)

    PDF version string (e.g. “1.7”).



113
114
115
116
117
118
119
120
# File 'lib/pdf_oxide/pdf_document.rb', line 113

def pdf_version
  maj = ::FFI::MemoryPointer.new(:uint8)
  min = ::FFI::MemoryPointer.new(:uint8)
  Bindings.pdf_document_get_version(handle, maj, min)
  "#{maj.read_uint8}.#{min.read_uint8}"
rescue ::FFI::NotFoundError
  'unknown'
end

#render(page_index, dpi: 150) ⇒ String

Render a single page to PNG bytes at the supplied DPI.

Parameters:

  • page_index (Integer)
  • dpi (Integer) (defaults to: 150)

    resolution (default 150).

Returns:

  • (String)

    PNG-encoded image bytes (BINARY).

Raises:



232
233
234
235
236
237
238
239
240
241
242
243
244
245
# File 'lib/pdf_oxide/pdf_document.rb', line 232

def render(page_index, dpi: 150)
  validate_page_index(page_index)
  err = ::FFI::MemoryPointer.new(:int32)
  img_ptr = Bindings.pdf_render_page_zoom(handle, page_index, dpi.to_f / 72.0, 0, err)
  raise_for_code(err.read_int32, 'render')
  raise InternalError, 'render returned null' if img_ptr.nil? || img_ptr.null?

  # Read length + bytes via rendered image helpers.  The cdylib
  # exposes `pdf_oxide_rendered_image_*` accessors; the simpler
  # path is the byte-buffer accessor introduced for v0.3.5x.
  bytes = read_rendered_image_bytes(img_ptr)
  Bindings.pdf_rendered_image_free(img_ptr) if Bindings.respond_to?(:pdf_rendered_image_free)
  bytes.force_encoding(Encoding::BINARY)
end

#search(query, case_sensitive: false, regex: false) ⇒ Array<Hash>

Search this document.

Parameters:

  • query (String)

    literal text (or regex when ‘regex: true`).

  • case_sensitive (Boolean) (defaults to: false)
  • regex (Boolean) (defaults to: false)

    interpret query as a regex.

Returns:

  • (Array<Hash>)

    each match has keys :page, :text, :bbox (where :bbox is a Hash with :x, :y, :width, :height).

Raises:



176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
# File 'lib/pdf_oxide/pdf_document.rb', line 176

def search(query, case_sensitive: false, regex: false)
  raise ::PdfOxide::ArgumentError, 'query cannot be nil' if query.nil?
  raise UnsupportedFeatureError, 'regex search not supported by this cdylib build' \
    if regex && !Bindings.respond_to?(:pdf_document_search_regex)

  err = ::FFI::MemoryPointer.new(:int32)
  query_utf8 = StringMarshaller.to_utf8(query)
  results = if regex
              Bindings.pdf_document_search_regex(handle, query_utf8, case_sensitive, err)
            else
              Bindings.pdf_document_search_all(handle, query_utf8, case_sensitive, err)
            end
  raise_for_code(err.read_int32, 'search')
  parse_search_results(results)
end

#to_html(page_index = nil) ⇒ String

Convert one page to HTML.

Parameters:

  • page_index (Integer) (defaults to: nil)

Returns:

  • (String)

    HTML.



166
167
168
# File 'lib/pdf_oxide/pdf_document.rb', line 166

def to_html(page_index = nil)
  page_index.nil? ? MarkdownConverter.to_html(self) : MarkdownConverter.to_html(self, page_index)
end

#to_markdown(page_index = nil) ⇒ String

Convert one page to Markdown.

Parameters:

  • page_index (Integer) (defaults to: nil)

Returns:

  • (String)

    Markdown.



159
160
161
# File 'lib/pdf_oxide/pdf_document.rb', line 159

def to_markdown(page_index = nil)
  page_index.nil? ? MarkdownConverter.to_markdown(self) : MarkdownConverter.to_markdown(self, page_index)
end