Class: Rpdfium::Document
- Inherits:
-
Object
- Object
- Rpdfium::Document
- Includes:
- Enumerable
- Defined in:
- lib/rpdfium/document.rb
Overview
Document-level wrapper. Exposes:
-
opening from path / IO / bytes / page by index
-
metadata (Title, Author, etc.)
-
permissions
-
outline (bookmarks)
-
attachments
-
form environment (lazy)
Constant Summary collapse
- META_KEYS =
%w[Title Author Subject Keywords Creator Producer CreationDate ModDate Trapped].freeze
- PERMISSIONS =
Permission bits according to the PDF spec (Table 22 §7.6.3.2)
{ print: 1 << 2, modify: 1 << 3, copy: 1 << 4, annotate: 1 << 5, fill_forms: 1 << 8, extract_acc: 1 << 9, assemble: 1 << 10, print_hq: 1 << 11 }.freeze
- FORM_TYPES =
Form type =====
{ Raw::FORMTYPE_NONE => :none, Raw::FORMTYPE_ACRO_FORM => :acroform, Raw::FORMTYPE_XFA_FULL => :xfa_full, Raw::FORMTYPE_XFA_FOREGROUND => :xfa_foreground }.freeze
Instance Attribute Summary collapse
-
#source ⇒ Object
readonly
Returns the value of attribute source.
Class Method Summary collapse
Instance Method Summary collapse
-
#attachments ⇒ Object
Attachments =====.
-
#close ⇒ Object
Close =====.
- #closed? ⇒ Boolean
- #each ⇒ Object
-
#each_page_streaming ⇒ Object
Iterates the pages WITHOUT retaining them in the page cache: each page is closed (native FPDF_PAGE / text page handles and the per-page char and line-segment caches) as soon as the block returns.
- #file_version ⇒ Object
-
#form_env ⇒ Object
Lazy form environment.
- #form_type ⇒ Object
- #handle ⇒ Object
- #has_forms? ⇒ Boolean
-
#initialize(input, password: nil) ⇒ Document
constructor
A new instance of Document.
-
#metadata ⇒ Object
Metadata =====.
-
#outline ⇒ Object
Outline =====.
- #page(index) ⇒ Object (also: #[])
-
#page_count ⇒ Object
(also: #size, #length)
Pages =====.
- #page_label(index) ⇒ Object
- #permissions ⇒ Object
Constructor Details
#initialize(input, password: nil) ⇒ Document
Returns a new instance of Document.
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
# File 'lib/rpdfium/document.rb', line 30 def initialize(input, password: nil) Rpdfium.init! @password = password @source = input handle, retain_buffer = load_handle(input, password) if handle.null? code = Rpdfium.last_error_code msg = Rpdfium. raise PasswordError, msg if code == 4 raise LoadError, "Failed to load PDF: #{msg}" end # State shared between the instance and the finalizer. Wrapped in a # mutable Hash because the finalizer closure and the explicit # close() must see the same :closed flag — otherwise whichever # arrives second calls FPDF_CloseDocument on an already-freed # handle and PDFium segfaults. @state = { handle: handle, retain_buffer: retain_buffer, closed: false } @form_env = nil @page_cache = {} # IMPORTANT: the finalizer captures @state (Hash), NOT self. # Capturing self would prevent the GC from collecting the Document. # Moreover the finalizer does NOT touch @page_cache: Pages have # their own individual finalizer, and the execution order among # finalizers is non-deterministic in Ruby. ObjectSpace.define_finalizer(self, self.class.finalizer(@state)) end |
Instance Attribute Details
#source ⇒ Object (readonly)
Returns the value of attribute source.
17 18 19 |
# File 'lib/rpdfium/document.rb', line 17 def source @source end |
Class Method Details
.finalizer(state) ⇒ Object
62 63 64 65 66 67 68 69 70 71 |
# File 'lib/rpdfium/document.rb', line 62 def self.finalizer(state) proc do next if state[:closed] next if state[:handle].null? Raw.FPDF_CloseDocument(state[:handle]) state[:closed] = true state[:retain_buffer] = nil end end |
.open(input, password: nil, &block) ⇒ Object
19 20 21 22 23 24 25 26 27 28 |
# File 'lib/rpdfium/document.rb', line 19 def self.open(input, password: nil, &block) doc = new(input, password: password) return doc unless block_given? begin yield doc ensure doc.close end end |
Instance Method Details
#attachments ⇒ Object
Attachments =====
199 200 201 202 |
# File 'lib/rpdfium/document.rb', line 199 def n = Raw.FPDFDoc_GetAttachmentCount(@state[:handle]) Array.new(n) { |i| Attachment.new(self, i) } end |
#close ⇒ Object
Close =====
206 207 208 209 210 211 212 213 214 215 216 217 218 |
# File 'lib/rpdfium/document.rb', line 206 def close return if @state[:closed] # Order: close form env and cached pages first, then the document. @form_env&.close @page_cache.each_value(&:close) @page_cache.clear Raw.FPDF_CloseDocument(@state[:handle]) unless @state[:handle].null? @state[:handle] = FFI::Pointer::NULL @state[:retain_buffer] = nil @state[:closed] = true ObjectSpace.undefine_finalizer(self) end |
#closed? ⇒ Boolean
220 221 222 |
# File 'lib/rpdfium/document.rb', line 220 def closed? @state[:closed] end |
#each ⇒ Object
97 98 99 100 101 |
# File 'lib/rpdfium/document.rb', line 97 def each return enum_for(:each) unless block_given? page_count.times { |i| yield page(i) } end |
#each_page_streaming ⇒ Object
Iterates the pages WITHOUT retaining them in the page cache: each page is closed (native FPDF_PAGE / text page handles and the per-page char and line-segment caches) as soon as the block returns.
‘#each` caches every visited page for the document’s whole lifetime —ideal for interactive, random-access use, but for a single linear pass over a large document it makes peak memory grow with the page count (each page keeps thousands of char hashes alive). The batch helpers (‘Rpdfium.extract_text`, `.extract_tables`, `.render_to_pngs`) visit each page exactly once, so they stream instead: only one page is alive at a time and peak RSS stays flat in the number of pages.
114 115 116 117 118 119 120 121 122 123 124 125 126 |
# File 'lib/rpdfium/document.rb', line 114 def each_page_streaming return enum_for(:each_page_streaming) unless block_given? ensure_open! page_count.times do |i| pg = Page.new(self, i) begin yield pg ensure pg.close end end end |
#file_version ⇒ Object
141 142 143 144 145 146 147 148 |
# File 'lib/rpdfium/document.rb', line 141 def file_version buf = FFI::MemoryPointer.new(:int) return nil if Raw.FPDF_GetFileVersion(@state[:handle], buf) == 0 v = buf.read_int # PDFium returns 14 → 1.4, 17 → 1.7 "#{v / 10}.#{v % 10}" end |
#form_env ⇒ Object
Lazy form environment. Required to:
-
read FormFieldType/Value/Name on widget annotations
-
render the form fields over the page (FFLDraw)
187 188 189 |
# File 'lib/rpdfium/document.rb', line 187 def form_env @form_env ||= Form::Environment.new(self) if has_forms? end |
#form_type ⇒ Object
176 177 178 |
# File 'lib/rpdfium/document.rb', line 176 def form_type FORM_TYPES[Raw.FPDF_GetFormType(@state[:handle])] || :unknown end |
#handle ⇒ Object
73 74 75 |
# File 'lib/rpdfium/document.rb', line 73 def handle @state[:handle] end |
#has_forms? ⇒ Boolean
180 181 182 |
# File 'lib/rpdfium/document.rb', line 180 def has_forms? form_type != :none end |
#metadata ⇒ Object
Metadata =====
134 135 136 137 138 139 |
# File 'lib/rpdfium/document.rb', line 134 def META_KEYS.each_with_object({}) do |key, h| v = Raw.read_utf16_string(:FPDF_GetMetaText, @state[:handle], key) h[key.downcase.to_sym] = v unless v.empty? end end |
#outline ⇒ Object
Outline =====
193 194 195 |
# File 'lib/rpdfium/document.rb', line 193 def outline Outline.from_document(self) end |
#page(index) ⇒ Object Also known as: []
86 87 88 89 90 91 92 93 94 |
# File 'lib/rpdfium/document.rb', line 86 def page(index) ensure_open! raise PageError, "Page index #{index} out of range" unless (0...page_count).cover?(index) # Pages are cacheable: reloading them is expensive and the objects # are immutable from the application's point of view (in read-only # mode). @page_cache[index] ||= Page.new(self, index) end |
#page_count ⇒ Object Also known as: size, length
Pages =====
79 80 81 82 |
# File 'lib/rpdfium/document.rb', line 79 def page_count ensure_open! Raw.FPDF_GetPageCount(@state[:handle]) end |
#page_label(index) ⇒ Object
128 129 130 |
# File 'lib/rpdfium/document.rb', line 128 def page_label(index) Raw.read_utf16_string(:FPDF_GetPageLabel, @state[:handle], index) end |
#permissions ⇒ Object
162 163 164 165 |
# File 'lib/rpdfium/document.rb', line 162 def bits = Raw.FPDF_GetDocPermissions(@state[:handle]) PERMISSIONS.transform_values { |mask| (bits & mask) == mask } end |