Class: Rpdfium::Document

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/rpdfium/document.rb

Overview

Document-level wrapper. Exposes:

  • opening from path / IO / bytes / page by index

  • metadata (Title, Author, etc.)

  • permissions

  • outline (bookmarks)

  • attachments

  • form environment (lazy)

Constant Summary collapse

META_KEYS =
%w[Title Author Subject Keywords Creator Producer
CreationDate ModDate Trapped].freeze
PERMISSIONS =

Permission bits according to the PDF spec (Table 22 §7.6.3.2)

{
  print:       1 << 2,
  modify:      1 << 3,
  copy:        1 << 4,
  annotate:    1 << 5,
  fill_forms:  1 << 8,
  extract_acc: 1 << 9,
  assemble:    1 << 10,
  print_hq:    1 << 11
}.freeze
FORM_TYPES =
Form type =====
{
  Raw::FORMTYPE_NONE      => :none,
  Raw::FORMTYPE_ACRO_FORM => :acroform,
  Raw::FORMTYPE_XFA_FULL  => :xfa_full,
  Raw::FORMTYPE_XFA_FOREGROUND => :xfa_foreground
}.freeze

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(input, password: nil) ⇒ Document

Returns a new instance of Document.



30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# File 'lib/rpdfium/document.rb', line 30

def initialize(input, password: nil)
  Rpdfium.init!
  @password = password
  @source   = input
  handle, retain_buffer = load_handle(input, password)
  if handle.null?
    code = Rpdfium.last_error_code
    msg  = Rpdfium.last_error_message
    raise PasswordError, msg if code == 4

    raise LoadError, "Failed to load PDF: #{msg}"
  end
  # State shared between the instance and the finalizer. Wrapped in a
  # mutable Hash because the finalizer closure and the explicit
  # close() must see the same :closed flag — otherwise whichever
  # arrives second calls FPDF_CloseDocument on an already-freed
  # handle and PDFium segfaults.
  @state = {
    handle: handle,
    retain_buffer: retain_buffer,
    closed: false
  }
  @form_env = nil
  @page_cache = {}
  # IMPORTANT: the finalizer captures @state (Hash), NOT self.
  # Capturing self would prevent the GC from collecting the Document.
  # Moreover the finalizer does NOT touch @page_cache: Pages have
  # their own individual finalizer, and the execution order among
  # finalizers is non-deterministic in Ruby.
  ObjectSpace.define_finalizer(self, self.class.finalizer(@state))
end

Instance Attribute Details

#sourceObject (readonly)

Returns the value of attribute source.



17
18
19
# File 'lib/rpdfium/document.rb', line 17

def source
  @source
end

Class Method Details

.finalizer(state) ⇒ Object



62
63
64
65
66
67
68
69
70
71
# File 'lib/rpdfium/document.rb', line 62

def self.finalizer(state)
  proc do
    next if state[:closed]
    next if state[:handle].null?

    Raw.FPDF_CloseDocument(state[:handle])
    state[:closed] = true
    state[:retain_buffer] = nil
  end
end

.open(input, password: nil, &block) ⇒ Object



19
20
21
22
23
24
25
26
27
28
# File 'lib/rpdfium/document.rb', line 19

def self.open(input, password: nil, &block)
  doc = new(input, password: password)
  return doc unless block_given?

  begin
    yield doc
  ensure
    doc.close
  end
end

Instance Method Details

#attachmentsObject

Attachments =====


199
200
201
202
# File 'lib/rpdfium/document.rb', line 199

def attachments
  n = Raw.FPDFDoc_GetAttachmentCount(@state[:handle])
  Array.new(n) { |i| Attachment.new(self, i) }
end

#closeObject

Close =====


206
207
208
209
210
211
212
213
214
215
216
217
218
# File 'lib/rpdfium/document.rb', line 206

def close
  return if @state[:closed]

  # Order: close form env and cached pages first, then the document.
  @form_env&.close
  @page_cache.each_value(&:close)
  @page_cache.clear
  Raw.FPDF_CloseDocument(@state[:handle]) unless @state[:handle].null?
  @state[:handle] = FFI::Pointer::NULL
  @state[:retain_buffer] = nil
  @state[:closed] = true
  ObjectSpace.undefine_finalizer(self)
end

#closed?Boolean

Returns:

  • (Boolean)


220
221
222
# File 'lib/rpdfium/document.rb', line 220

def closed?
  @state[:closed]
end

#eachObject



97
98
99
100
101
# File 'lib/rpdfium/document.rb', line 97

def each
  return enum_for(:each) unless block_given?

  page_count.times { |i| yield page(i) }
end

#each_page_streamingObject

Iterates the pages WITHOUT retaining them in the page cache: each page is closed (native FPDF_PAGE / text page handles and the per-page char and line-segment caches) as soon as the block returns.

‘#each` caches every visited page for the document’s whole lifetime —ideal for interactive, random-access use, but for a single linear pass over a large document it makes peak memory grow with the page count (each page keeps thousands of char hashes alive). The batch helpers (‘Rpdfium.extract_text`, `.extract_tables`, `.render_to_pngs`) visit each page exactly once, so they stream instead: only one page is alive at a time and peak RSS stays flat in the number of pages.



114
115
116
117
118
119
120
121
122
123
124
125
126
# File 'lib/rpdfium/document.rb', line 114

def each_page_streaming
  return enum_for(:each_page_streaming) unless block_given?

  ensure_open!
  page_count.times do |i|
    pg = Page.new(self, i)
    begin
      yield pg
    ensure
      pg.close
    end
  end
end

#file_versionObject



141
142
143
144
145
146
147
148
# File 'lib/rpdfium/document.rb', line 141

def file_version
  buf = FFI::MemoryPointer.new(:int)
  return nil if Raw.FPDF_GetFileVersion(@state[:handle], buf) == 0

  v = buf.read_int
  # PDFium returns 14 → 1.4, 17 → 1.7
  "#{v / 10}.#{v % 10}"
end

#form_envObject

Lazy form environment. Required to:

  • read FormFieldType/Value/Name on widget annotations

  • render the form fields over the page (FFLDraw)



187
188
189
# File 'lib/rpdfium/document.rb', line 187

def form_env
  @form_env ||= Form::Environment.new(self) if has_forms?
end

#form_typeObject



176
177
178
# File 'lib/rpdfium/document.rb', line 176

def form_type
  FORM_TYPES[Raw.FPDF_GetFormType(@state[:handle])] || :unknown
end

#handleObject



73
74
75
# File 'lib/rpdfium/document.rb', line 73

def handle
  @state[:handle]
end

#has_forms?Boolean

Returns:

  • (Boolean)


180
181
182
# File 'lib/rpdfium/document.rb', line 180

def has_forms?
  form_type != :none
end

#metadataObject

Metadata =====


134
135
136
137
138
139
# File 'lib/rpdfium/document.rb', line 134

def 
  META_KEYS.each_with_object({}) do |key, h|
    v = Raw.read_utf16_string(:FPDF_GetMetaText, @state[:handle], key)
    h[key.downcase.to_sym] = v unless v.empty?
  end
end

#outlineObject

Outline =====


193
194
195
# File 'lib/rpdfium/document.rb', line 193

def outline
  Outline.from_document(self)
end

#page(index) ⇒ Object Also known as: []

Raises:



86
87
88
89
90
91
92
93
94
# File 'lib/rpdfium/document.rb', line 86

def page(index)
  ensure_open!
  raise PageError, "Page index #{index} out of range" unless (0...page_count).cover?(index)

  # Pages are cacheable: reloading them is expensive and the objects
  # are immutable from the application's point of view (in read-only
  # mode).
  @page_cache[index] ||= Page.new(self, index)
end

#page_countObject Also known as: size, length

Pages =====


79
80
81
82
# File 'lib/rpdfium/document.rb', line 79

def page_count
  ensure_open!
  Raw.FPDF_GetPageCount(@state[:handle])
end

#page_label(index) ⇒ Object



128
129
130
# File 'lib/rpdfium/document.rb', line 128

def page_label(index)
  Raw.read_utf16_string(:FPDF_GetPageLabel, @state[:handle], index)
end

#permissionsObject



162
163
164
165
# File 'lib/rpdfium/document.rb', line 162

def permissions
  bits = Raw.FPDF_GetDocPermissions(@state[:handle])
  PERMISSIONS.transform_values { |mask| (bits & mask) == mask }
end