Class: Rpdfium::Structure::Tree
- Inherits:
-
Object
- Object
- Rpdfium::Structure::Tree
- Defined in:
- lib/rpdfium/structure/tree.rb
Overview
StructTree of a tagged PDF page.
For tagged PDFs (PDF/UA, accessibility-friendly exports from Word/LibreOffice/InDesign), it exposes the logical structure of the document: Document → P, H1, Table, TR, TH, TD, Figure, etc.
For NON-tagged PDFs, ‘Page#struct_tree` returns nil. For “tagged but empty” PDFs (e.g. CR Banca d’Italia, StructTreeRoot present but with placeholder elements without type/MCID), ‘Tree#empty?` returns true.
Lifecycle: the Tree holds a PDFium handle that is “owning” — calling ‘FPDF_StructTree_Close` deallocates it. PDFium automatically deallocates the struct tree when the document is closed, so in practice:
- if you never close the tree explicitly, PDFium frees it with
`FPDF_CloseDocument` (zero persistent leak, but the tree stays
in memory until the doc is closed — it may be ~MB)
- for deterministic control (release immediately), use the block:
page.struct_tree do |tree|
tree.walk { |el| ... }
end
on exit from the block the tree is closed, even on exception.
As a design choice we do NOT use ‘ObjectSpace.define_finalizer`: if the GC were to call `FPDF_StructTree_Close` after the document had already been closed, this would cause a use-after-free → segfault. Closing via Document is always safe; closing via Tree.close (explicit or through a block) requires the document to still be alive.
Instance Attribute Summary collapse
-
#handle ⇒ Object
readonly
Returns the value of attribute handle.
-
#page ⇒ Object
readonly
Returns the value of attribute page.
Class Method Summary collapse
-
.for_page(page) ⇒ Object
Returns nil if the page is not tagged.
Instance Method Summary collapse
-
#close ⇒ Object
Explicit close (idempotent).
- #closed? ⇒ Boolean
-
#empty? ⇒ Boolean
True if the tree is structurally empty (no element with a readable type among the roots).
-
#find_all(type:) ⇒ Object
Finds all the elements of the specified type (e.g. “Table”, “P”, “Figure”).
-
#initialize(page, handle) ⇒ Tree
constructor
A new instance of Tree.
-
#mcid_text_map ⇒ Object
Page objects grouped by Marked Content ID, to allow Element#text to resolve the text of its MCIDs.
-
#root_count ⇒ Object
Number of root elements (direct children of the StructTreeRoot for this page).
-
#roots ⇒ Object
Root elements (direct children of the StructTreeRoot).
-
#tables ⇒ Object
Returns all the elements of type “Table”.
- #to_s ⇒ Object (also: #inspect)
-
#walk(&block) ⇒ Object
Depth-first walk of ALL the elements of the tree.
Constructor Details
#initialize(page, handle) ⇒ Tree
Returns a new instance of Tree.
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
# File 'lib/rpdfium/structure/tree.rb', line 46 def initialize(page, handle) @page = page @handle = handle @closed = false @mcid_text_cache = nil # NOTE: no finalizer. FPDF_StructTree_Close is "owning": it calls # ~CPDF_StructTree() which frees the object. If the PDF document # is closed before the tree, the GC finalizer would call Close on # already-freed memory → segfault. Safe lifetime: # - explicit close via `tree.close` or via the block # `page.struct_tree { |tree| ... }` # - if nobody closes it explicitly, PDFium frees the tree # together with the document at `FPDF_CloseDocument` (no # persistent leak, only memory held until the doc is closed) end |
Instance Attribute Details
#handle ⇒ Object (readonly)
Returns the value of attribute handle.
36 37 38 |
# File 'lib/rpdfium/structure/tree.rb', line 36 def handle @handle end |
#page ⇒ Object (readonly)
Returns the value of attribute page.
36 37 38 |
# File 'lib/rpdfium/structure/tree.rb', line 36 def page @page end |
Class Method Details
.for_page(page) ⇒ Object
Returns nil if the page is not tagged. Otherwise a Tree.
39 40 41 42 43 44 |
# File 'lib/rpdfium/structure/tree.rb', line 39 def self.for_page(page) h = Raw.FPDF_StructTree_GetForPage(page.handle) return nil if h.null? new(page, h) end |
Instance Method Details
#close ⇒ Object
Explicit close (idempotent). After close, do not call methods on this Tree nor on the Elements it generated.
69 70 71 72 73 74 75 |
# File 'lib/rpdfium/structure/tree.rb', line 69 def close return if @closed Raw.FPDF_StructTree_Close(@handle) @closed = true @mcid_text_cache = nil end |
#closed? ⇒ Boolean
63 64 65 |
# File 'lib/rpdfium/structure/tree.rb', line 63 def closed? @closed end |
#empty? ⇒ Boolean
True if the tree is structurally empty (no element with a readable type among the roots). A common case for “fake-tagged” PDFs such as CR Banca d’Italia: the StructTreeRoot exists but the elements are empty placeholders.
98 99 100 101 102 |
# File 'lib/rpdfium/structure/tree.rb', line 98 def empty? return true if root_count.zero? roots.none? { |r| r.type || r.children.any? } end |
#find_all(type:) ⇒ Object
Finds all the elements of the specified type (e.g. “Table”, “P”, “Figure”). Case-sensitive comparison (PDF types are “Table”, “P”, “H1”, etc.).
115 116 117 |
# File 'lib/rpdfium/structure/tree.rb', line 115 def find_all(type:) walk.select { |el| el.type == type } end |
#mcid_text_map ⇒ Object
Page objects grouped by Marked Content ID, to allow Element#text to resolve the text of its MCIDs. The map is built only once per Tree and cached.
Public but intended for internal use; not part of the stable API.
130 131 132 |
# File 'lib/rpdfium/structure/tree.rb', line 130 def mcid_text_map @mcid_text_cache ||= build_mcid_text_map end |
#root_count ⇒ Object
Number of root elements (direct children of the StructTreeRoot for this page). Typically 1 (‘<Document>`), but it can be arbitrarily high on odd PDFs (e.g. cu.pdf: 717 placeholders).
80 81 82 83 |
# File 'lib/rpdfium/structure/tree.rb', line 80 def root_count n = Raw.FPDF_StructTree_CountChildren(@handle) [n, 0].max end |
#roots ⇒ Object
Root elements (direct children of the StructTreeRoot). Typically 1 (‘<Document>`).
87 88 89 90 91 92 |
# File 'lib/rpdfium/structure/tree.rb', line 87 def roots (0...root_count).filter_map do |i| h = Raw.FPDF_StructTree_GetChildAtIndex(@handle, i) h.null? ? nil : Element.new(self, h) end end |
#tables ⇒ Object
Returns all the elements of type “Table”. Convenient for semantic table extraction.
121 122 123 |
# File 'lib/rpdfium/structure/tree.rb', line 121 def tables find_all(type: "Table") end |
#to_s ⇒ Object Also known as: inspect
134 135 136 |
# File 'lib/rpdfium/structure/tree.rb', line 134 def to_s "#<Rpdfium::Structure::Tree roots=#{root_count}#{empty? ? ' empty' : ''}>" end |
#walk(&block) ⇒ Object
Depth-first walk of ALL the elements of the tree. Equivalent to ‘roots.flat_map(&:walk)`. Without a block it returns an Enumerator.
106 107 108 109 110 |
# File 'lib/rpdfium/structure/tree.rb', line 106 def walk(&block) return enum_for(:walk) unless block roots.each { |r| r.walk(&block) } end |