Class: PdfOxide::PdfDocument

Inherits:

Object

Object
PdfOxide::PdfDocument

show all

Defined in:: lib/pdf_oxide/pdf_document.rb

Overview

The primary read-only entry point to a PDF.

Mirrors ‘fyi.oxide.pdf.PdfDocument`. Lifecycle: a PdfDocument owns native memory and **must be closed** when no longer in use. The idiomatic Ruby pattern is the block form `PdfDocument.open(path) do |doc| … end` which closes automatically; for parity with the Java `AutoCloseable` contract, an explicit `#close` is also supported and is idempotent (a second call is a no-op, not a crash).

A ‘Finalizer` backstop frees leaked handles on GC; callers must not rely on it for timely cleanup.

Examples:

block form (recommended)

PdfOxide::PdfDocument.open('invoice.pdf') do |doc|
  puts doc.extract_text(0)
end

explicit close

doc = PdfOxide::PdfDocument.open('invoice.pdf')
begin
  puts doc.extract_text(0)
ensure
  doc.close
end

Instance Attribute Summary collapse

#path ⇒ String readonly

Absolute path the document was opened from (or a synthetic ‘<in-memory>` token for byte-opened docs).

Class Method Summary collapse

.extract_text(source, page: 0) ⇒ String

One-shot: open + extract page text + close.
.finalizer(tracker) ⇒ Object private

Finalizer for GC cleanup.
.open(source, password: nil) {|PdfDocument| ... } ⇒ PdfDocument, Object

Open a PDF from disk or in-memory bytes.

Instance Method Summary collapse

#authenticate(password) ⇒ Boolean

Authenticate against this document’s encryption.
#auto_extractor ⇒ AutoExtractor

Convenience accessor: get the configured AutoExtractor for this doc.
#close ⇒ Object

Free the native handle.
#closed? ⇒ Boolean

True after #close.
#encrypted? ⇒ Boolean

Whether this PDF carries an encryption dictionary.
#extract_structured(page) ⇒ Hash

Extract a structured representation of a single page (#536).
#extract_text(page_index) ⇒ String

Extract plain text from a single page.
#extract_text_auto(page_index) ⇒ String

Auto-routed extraction for a single page (v0.3.51 #517).
#form_fields ⇒ Array<Hash>

AcroForm fields as an array of ‘value:, type:, page:` hashes.
#handle ⇒ FFI::Pointer

Raw handle for sibling classes (MarkdownConverter, AutoExtractor, PdfValidator, PdfSigner) that need to pass the pointer to their own FFI calls.
#initialize(source, password: nil) ⇒ PdfDocument constructor

Open a PDF.
#open? ⇒ Boolean

True if #close has not been called.
#page(index) ⇒ PdfPage

A lightweight view of the page at ‘index`.
#page_count ⇒ Integer

Number of pages.
#pages ⇒ Array<PdfPage>

Every page in the document (eager).
#pdf_version ⇒ String

PDF version string (e.g. “1.7”).
#render(page_index, dpi: 150) ⇒ String

Render a single page to PNG bytes at the supplied DPI.
#search(query, case_sensitive: false, regex: false) ⇒ Array<Hash>

Search this document.
#to_html(page_index = nil) ⇒ String

Convert one page to HTML.
#to_markdown(page_index = nil) ⇒ String

Convert one page to Markdown.

Constructor Details

#initialize(source, password: nil) ⇒ `PdfDocument`

Open a PDF. See open for the block-form factory.

Raises:

(::PdfOxide::ArgumentError)

# File 'lib/pdf_oxide/pdf_document.rb', line 65

def initialize(source, password: nil)
  raise ::PdfOxide::ArgumentError, 'source cannot be nil' if source.nil?

  @path, @handle = open_native(source)
  @closed = false
  # Mutable tracker lets an explicit `#close` defuse the finalizer
  # so the GC pass doesn't double-free.
  @tracker = [@handle]
  ObjectSpace.define_finalizer(self, self.class.finalizer(@tracker))

  authenticate(password) if password
end

Instance Attribute Details

#path ⇒ `String` (readonly)

Returns absolute path the document was opened from (or a synthetic ‘<in-memory>` token for byte-opened docs).

Returns:

(String) —

absolute path the document was opened from (or a synthetic ‘<in-memory>` token for byte-opened docs).



31
32
33

# File 'lib/pdf_oxide/pdf_document.rb', line 31

def path
  @path
end

Class Method Details

.extract_text(source, page: 0) ⇒ `String`

One-shot: open + extract page text + close.

Parameters:

source (String) —

path or bytes (see #open).
page (Integer) (defaults to: 0) —

0-based page index (default 0).

Returns:

(String) —

extracted text.

# File 'lib/pdf_oxide/pdf_document.rb', line 58

def self.extract_text(source, page: 0)
  # rubocop:disable Security/Open — PdfDocument.open opens a PDF, not a process.
  open(source) { |d| d.extract_text(page) }
  # rubocop:enable Security/Open
end

.finalizer(tracker) ⇒ `Object`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Finalizer for GC cleanup. The mutable tracker lets explicit ‘#close` zero out the handle so a follow-up GC pass doesn’t double-free (the cdylib’s ‘pdf_document_free` is not idempotent on the same pointer).

# File 'lib/pdf_oxide/pdf_document.rb', line 312

def self.finalizer(tracker)
  proc do
    handle = tracker[0]
    if handle && !handle.null?
      Bindings.pdf_document_free(handle)
      tracker[0] = nil
    end
  end
end

.open(source, password: nil) {|PdfDocument| ... } ⇒ `PdfDocument`, `Object`

Open a PDF from disk or in-memory bytes.

Parameters:

source (String) —

either a filesystem path or raw PDF bytes (auto-detected via ‘%PDF-` magic on BINARY-encoded input).
password (String, nil) (defaults to: nil) —

optional password for encrypted PDFs.

Yields:

(PdfDocument) —

block form auto-closes on return.

Returns:

(PdfDocument, Object) —

the document, or the block’s return value.

Raises:

(FileNotFoundError) —

path doesn’t exist.
(ParseError) —

malformed PDF.
(EncryptedError) —

wrong password / authentication failed.

# File 'lib/pdf_oxide/pdf_document.rb', line 43

def self.open(source, password: nil, &block)
  doc = new(source, password: password)
  return doc unless block_given?

  begin
    yield doc
  ensure
    doc.close
  end
end

Instance Method Details

#authenticate(password) ⇒ `Boolean`

Authenticate against this document’s encryption.

Parameters:

password (String)

Returns:

(Boolean) —

true on success / unencrypted; false on wrong password.

Raises:

(::PdfOxide::ArgumentError)

# File 'lib/pdf_oxide/pdf_document.rb', line 91

def authenticate(password)
  raise ::PdfOxide::ArgumentError, 'password cannot be nil' if password.nil?
  return true unless encrypted?

  # v0.3.55 cdylib doesn't expose a stable 3-arg unlock entry;
  # the legacy `pdf_document_unlock_with_password` is a phantom
  # (REMOVED) and `pdf_document_authenticate` only has the
  # 8-pointer placeholder shape.  Return false on encrypted docs
  # rather than crash — Java's PdfDocument#authenticate has the
  # same fail-closed contract.
  false
end

#auto_extractor ⇒ `AutoExtractor`

Convenience accessor: get the configured AutoExtractor for this doc.

Returns:

(AutoExtractor)



280
281
282

# File 'lib/pdf_oxide/pdf_document.rb', line 280

def auto_extractor
  @auto_extractor ||= AutoExtractor.new(self)
end

#close ⇒ `Object`

Free the native handle. Idempotent — calling more than once is a no-op, not a crash. Safe to call from an ensure block.

# File 'lib/pdf_oxide/pdf_document.rb', line 286

def close
  return if @closed

  h = @handle
  @handle = nil
  @closed = true
  # Defuse the finalizer (was @tracker[0] == @handle).
  @tracker[0] = nil if @tracker
  Bindings.pdf_document_free(h) if h && !h.null?
end

#closed? ⇒ `Boolean`

Returns true after #close.

Returns:

(Boolean) —

true after #close.



303
304
305

# File 'lib/pdf_oxide/pdf_document.rb', line 303

def closed?
  @closed
end

#encrypted? ⇒ `Boolean`

Returns whether this PDF carries an encryption dictionary.

Returns:

(Boolean) —

whether this PDF carries an encryption dictionary.

# File 'lib/pdf_oxide/pdf_document.rb', line 123

def encrypted?
  # bool pdf_document_is_encrypted(const PdfDocument *handle) — no err arg.
  # The cdylib silently swallowed the extra err pointer pre-v0.3.55, so
  # encryption-detection failures were never surfaced.
  Bindings.pdf_document_is_encrypted(handle)
end

#extract_structured(page) ⇒ `Hash`

Extract a structured representation of a single page (#536). Returns the parsed ‘StructuredPage` JSON as a Hash: `{ “page_index”, “page_width”, “page_height”,

"regions" => [ { "kind", "text", "bbox", "spans", "column_index" } ] }`.

Parameters:

page (Integer) —

0-based page index.

Returns:

(Hash) —

parsed structured page.

# File 'lib/pdf_oxide/pdf_document.rb', line 147

def extract_structured(page)
  validate_page_index(page)
  err = ::FFI::MemoryPointer.new(:int32)
  ptr = Bindings.pdf_document_extract_structured_to_json(handle, page, err)
  raise_for_code(err.read_int32, 'extract_structured')
  json = StringMarshaller.from_c_string(ptr) || ''

  require 'json'
  JSON.parse(json)
end

#extract_text(page_index) ⇒ `String`

Extract plain text from a single page.

Parameters:

page_index (Integer) —

0-based page index.

Returns:

(String) —

extracted text (empty for pages with no text layer).

# File 'lib/pdf_oxide/pdf_document.rb', line 133

def extract_text(page_index)
  validate_page_index(page_index)
  err = ::FFI::MemoryPointer.new(:int32)
  ptr = Bindings.pdf_document_extract_text(handle, page_index, err)
  raise_for_code(err.read_int32, 'extract_text')
  StringMarshaller.from_c_string(ptr) || ''
end

#extract_text_auto(page_index) ⇒ `String`

Auto-routed extraction for a single page (v0.3.51 #517). Returns native text where present, OCR’d text for scanned regions when the ‘ocr` feature is available, and gracefully falls back to native + empty/partial text when OCR is not available — never raises an “OCR unavailable” error on this path.

Parameters:

page_index (Integer) —

0-based.

Returns:

(String) —

extracted text.

# File 'lib/pdf_oxide/pdf_document.rb', line 165

def extract_text_auto(page_index)
  validate_page_index(page_index)
  err = ::FFI::MemoryPointer.new(:int32)
  ptr = Bindings.pdf_document_extract_text_auto(handle, page_index, err)
  raise_for_code(err.read_int32, 'extract_text_auto')
  StringMarshaller.from_c_string(ptr) || ''
end

#form_fields ⇒ `Array<Hash>`

Returns AcroForm fields as an array of ‘value:, type:, page:` hashes. v0.3.55 limitation: per-field `page` is -1 because pdf_oxide’s form extractor doesn’t yet surface per-field page placement; field is identified by ‘name`. When the cdylib build lacks the form-extract accessor, returns `[]` rather than raising — the simple-PDF case is “no form fields”.

Returns:

(Array<Hash>) —

AcroForm fields as an array of ‘value:, type:, page:` hashes. v0.3.55 limitation: per-field `page` is -1 because pdf_oxide’s form extractor doesn’t yet surface per-field page placement; field is identified by ‘name`. When the cdylib build lacks the form-extract accessor, returns `[]` rather than raising — the simple-PDF case is “no form fields”.

# File 'lib/pdf_oxide/pdf_document.rb', line 215

def form_fields
  return [] unless Bindings.respond_to?(:pdf_document_get_form_fields)

  err = ::FFI::MemoryPointer.new(:int32)
  ptr = begin
    Bindings.pdf_document_get_form_fields(handle, err)
  rescue ::ArgumentError
    # Phantom 8-pointer skeleton — graceful empty.
    return []
  end
  raise_for_code(err.read_int32, 'form_fields')
  return [] if ptr.nil? || ptr.null?

  json = StringMarshaller.from_c_string(ptr) || ''
  return [] if json.empty?

  require 'json'
  arr = JSON.parse(json)
  Array(arr).map do |f|
    {
      name: f['name'],
      value: f['value'],
      type: f['type'],
      page: f.fetch('page', -1)
    }
  end
rescue JSON::ParserError
  []
end

#handle ⇒ `FFI::Pointer`

Returns raw handle for sibling classes (MarkdownConverter, AutoExtractor, PdfValidator, PdfSigner) that need to pass the pointer to their own FFI calls.

Returns:

(FFI::Pointer) —

raw handle for sibling classes (MarkdownConverter, AutoExtractor, PdfValidator, PdfSigner) that need to pass the pointer to their own FFI calls.

Raises:

(InvalidStateError) —

document has been closed.

# File 'lib/pdf_oxide/pdf_document.rb', line 82

def handle
  raise InvalidStateError, 'PdfDocument has been closed' if @closed || @handle.nil?

  @handle
end

#open? ⇒ `Boolean`

Returns true if #close has not been called.

Returns:

(Boolean) —

true if #close has not been called.



298
299
300

# File 'lib/pdf_oxide/pdf_document.rb', line 298

def open?
  !@closed
end

#page(index) ⇒ `PdfPage`

Returns a lightweight view of the page at ‘index`. The page borrows from this document; using it after the doc closes raises `InvalidStateError`.

Returns:

(PdfPage) —

a lightweight view of the page at ‘index`. The page borrows from this document; using it after the doc closes raises `InvalidStateError`.

# File 'lib/pdf_oxide/pdf_document.rb', line 267

def page(index)
  validate_page_index(index)
  PdfPage.new(self, index)
end

#page_count ⇒ `Integer`

Returns number of pages.

Returns:

(Integer) —

number of pages.

# File 'lib/pdf_oxide/pdf_document.rb', line 105

def page_count
  err = ::FFI::MemoryPointer.new(:int32)
  n = Bindings.pdf_document_get_page_count(handle, err)
  raise_for_code(err.read_int32, 'page_count')
  n
end

#pages ⇒ `Array<PdfPage>`

Returns every page in the document (eager).

Returns:

(Array<PdfPage>) —

every page in the document (eager).

# File 'lib/pdf_oxide/pdf_document.rb', line 273

def pages
  n = page_count
  Array.new(n) { |i| PdfPage.new(self, i) }
end

#pdf_version ⇒ `String`

Returns PDF version string (e.g. “1.7”).

Returns:

(String) —

PDF version string (e.g. “1.7”).

# File 'lib/pdf_oxide/pdf_document.rb', line 113

def pdf_version
  maj = ::FFI::MemoryPointer.new(:uint8)
  min = ::FFI::MemoryPointer.new(:uint8)
  Bindings.pdf_document_get_version(handle, maj, min)
  "#{maj.read_uint8}.#{min.read_uint8}"
rescue ::FFI::NotFoundError
  'unknown'
end

#render(page_index, dpi: 150) ⇒ `String`

Render a single page to PNG bytes at the supplied DPI.

Parameters:

page_index (Integer)
dpi (Integer) (defaults to: 150) —

resolution (default 150).

Returns:

(String) —

PNG-encoded image bytes (BINARY).

Raises:

(InternalError)

# File 'lib/pdf_oxide/pdf_document.rb', line 249

def render(page_index, dpi: 150)
  validate_page_index(page_index)
  err = ::FFI::MemoryPointer.new(:int32)
  img_ptr = Bindings.pdf_render_page_zoom(handle, page_index, dpi.to_f / 72.0, 0, err)
  raise_for_code(err.read_int32, 'render')
  raise InternalError, 'render returned null' if img_ptr.nil? || img_ptr.null?

  # Read length + bytes via rendered image helpers.  The cdylib
  # exposes `pdf_oxide_rendered_image_*` accessors; the simpler
  # path is the byte-buffer accessor introduced for v0.3.5x.
  bytes = read_rendered_image_bytes(img_ptr)
  Bindings.pdf_rendered_image_free(img_ptr) if Bindings.respond_to?(:pdf_rendered_image_free)
  bytes.force_encoding(Encoding::BINARY)
end

#search(query, case_sensitive: false, regex: false) ⇒ `Array<Hash>`

Search this document.

Parameters:

query (String) —

literal text (or regex when ‘regex: true`).
case_sensitive (Boolean) (defaults to: false)
regex (Boolean) (defaults to: false) —

interpret query as a regex.

Returns:

(Array<Hash>) —

each match has keys :page, :text, :bbox (where :bbox is a Hash with :x, :y, :width, :height).

Raises:

(::PdfOxide::ArgumentError)

# File 'lib/pdf_oxide/pdf_document.rb', line 193

def search(query, case_sensitive: false, regex: false)
  raise ::PdfOxide::ArgumentError, 'query cannot be nil' if query.nil?
  raise UnsupportedFeatureError, 'regex search not supported by this cdylib build' \
    if regex && !Bindings.respond_to?(:pdf_document_search_regex)

  err = ::FFI::MemoryPointer.new(:int32)
  query_utf8 = StringMarshaller.to_utf8(query)
  results = if regex
              Bindings.pdf_document_search_regex(handle, query_utf8, case_sensitive, err)
            else
              Bindings.pdf_document_search_all(handle, query_utf8, case_sensitive, err)
            end
  raise_for_code(err.read_int32, 'search')
  parse_search_results(results)
end

#to_html(page_index = nil) ⇒ `String`

Convert one page to HTML.

Parameters:

page_index (Integer) (defaults to: nil)

Returns:

(String) —

HTML.



183
184
185

# File 'lib/pdf_oxide/pdf_document.rb', line 183

def to_html(page_index = nil)
  page_index.nil? ? MarkdownConverter.to_html(self) : MarkdownConverter.to_html(self, page_index)
end

#to_markdown(page_index = nil) ⇒ `String`

Convert one page to Markdown.

Parameters:

page_index (Integer) (defaults to: nil)

Returns:

(String) —

Markdown.



176
177
178

# File 'lib/pdf_oxide/pdf_document.rb', line 176

def to_markdown(page_index = nil)
  page_index.nil? ? MarkdownConverter.to_markdown(self) : MarkdownConverter.to_markdown(self, page_index)
end

Class: PdfOxide::PdfDocument

Overview

Examples:

block form (recommended)

explicit close

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(source, password: nil) ⇒ PdfDocument

Instance Attribute Details

#path ⇒ String (readonly)

Class Method Details

.extract_text(source, page: 0) ⇒ String

.finalizer(tracker) ⇒ Object

.open(source, password: nil) {|PdfDocument| ... } ⇒ PdfDocument, Object

Instance Method Details

#authenticate(password) ⇒ Boolean

#auto_extractor ⇒ AutoExtractor

#close ⇒ Object

#closed? ⇒ Boolean

#encrypted? ⇒ Boolean

#extract_structured(page) ⇒ Hash

#extract_text(page_index) ⇒ String

#extract_text_auto(page_index) ⇒ String

#form_fields ⇒ Array<Hash>

#handle ⇒ FFI::Pointer

#open? ⇒ Boolean

#page(index) ⇒ PdfPage

#page_count ⇒ Integer

#pages ⇒ Array<PdfPage>

#pdf_version ⇒ String

#render(page_index, dpi: 150) ⇒ String

#search(query, case_sensitive: false, regex: false) ⇒ Array<Hash>

#to_html(page_index = nil) ⇒ String

#to_markdown(page_index = nil) ⇒ String

#initialize(source, password: nil) ⇒ `PdfDocument`

#path ⇒ `String` (readonly)

.extract_text(source, page: 0) ⇒ `String`

.finalizer(tracker) ⇒ `Object`

.open(source, password: nil) {|PdfDocument| ... } ⇒ `PdfDocument`, `Object`

#authenticate(password) ⇒ `Boolean`

#auto_extractor ⇒ `AutoExtractor`

#close ⇒ `Object`

#closed? ⇒ `Boolean`

#encrypted? ⇒ `Boolean`

#extract_structured(page) ⇒ `Hash`

#extract_text(page_index) ⇒ `String`

#extract_text_auto(page_index) ⇒ `String`

#form_fields ⇒ `Array<Hash>`

#handle ⇒ `FFI::Pointer`

#open? ⇒ `Boolean`

#page(index) ⇒ `PdfPage`

#page_count ⇒ `Integer`

#pages ⇒ `Array<PdfPage>`

#pdf_version ⇒ `String`

#render(page_index, dpi: 150) ⇒ `String`

#search(query, case_sensitive: false, regex: false) ⇒ `Array<Hash>`

#to_html(page_index = nil) ⇒ `String`

#to_markdown(page_index = nil) ⇒ `String`