Class: Rpdfium::Structure::Tree

Inherits:
Object
  • Object
show all
Defined in:
lib/rpdfium/structure/tree.rb

Overview

StructTree of a tagged PDF page.

For tagged PDFs (PDF/UA, accessibility-friendly exports from Word/LibreOffice/InDesign), it exposes the logical structure of the document: Document → P, H1, Table, TR, TH, TD, Figure, etc.

For NON-tagged PDFs, ‘Page#struct_tree` returns nil. For “tagged but empty” PDFs (e.g. CR Banca d’Italia, StructTreeRoot present but with placeholder elements without type/MCID), ‘Tree#empty?` returns true.

Lifecycle: the Tree holds a PDFium handle that is “owning” — calling ‘FPDF_StructTree_Close` deallocates it. PDFium automatically deallocates the struct tree when the document is closed, so in practice:

- if you never close the tree explicitly, PDFium frees it with
  `FPDF_CloseDocument` (zero persistent leak, but the tree stays
  in memory until the doc is closed — it may be ~MB)
- for deterministic control (release immediately), use the block:

    page.struct_tree do |tree|
      tree.walk { |el| ... }
    end
  on exit from the block the tree is closed, even on exception.

As a design choice we do NOT use ‘ObjectSpace.define_finalizer`: if the GC were to call `FPDF_StructTree_Close` after the document had already been closed, this would cause a use-after-free → segfault. Closing via Document is always safe; closing via Tree.close (explicit or through a block) requires the document to still be alive.

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(page, handle) ⇒ Tree

Returns a new instance of Tree.



46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# File 'lib/rpdfium/structure/tree.rb', line 46

def initialize(page, handle)
  @page = page
  @handle = handle
  @closed = false
  @mcid_text_cache = nil

  # NOTE: no finalizer. FPDF_StructTree_Close is "owning": it calls
  # ~CPDF_StructTree() which frees the object. If the PDF document
  # is closed before the tree, the GC finalizer would call Close on
  # already-freed memory → segfault. Safe lifetime:
  #   - explicit close via `tree.close` or via the block
  #     `page.struct_tree { |tree| ... }`
  #   - if nobody closes it explicitly, PDFium frees the tree
  #     together with the document at `FPDF_CloseDocument` (no
  #     persistent leak, only memory held until the doc is closed)
end

Instance Attribute Details

#handleObject (readonly)

Returns the value of attribute handle.



36
37
38
# File 'lib/rpdfium/structure/tree.rb', line 36

def handle
  @handle
end

#pageObject (readonly)

Returns the value of attribute page.



36
37
38
# File 'lib/rpdfium/structure/tree.rb', line 36

def page
  @page
end

Class Method Details

.for_page(page) ⇒ Object

Returns nil if the page is not tagged. Otherwise a Tree.



39
40
41
42
43
44
# File 'lib/rpdfium/structure/tree.rb', line 39

def self.for_page(page)
  h = Raw.FPDF_StructTree_GetForPage(page.handle)
  return nil if h.null?

  new(page, h)
end

Instance Method Details

#closeObject

Explicit close (idempotent). After close, do not call methods on this Tree nor on the Elements it generated.



69
70
71
72
73
74
75
# File 'lib/rpdfium/structure/tree.rb', line 69

def close
  return if @closed

  Raw.FPDF_StructTree_Close(@handle)
  @closed = true
  @mcid_text_cache = nil
end

#closed?Boolean

Returns:

  • (Boolean)


63
64
65
# File 'lib/rpdfium/structure/tree.rb', line 63

def closed?
  @closed
end

#empty?Boolean

True if the tree is structurally empty (no element with a readable type among the roots). A common case for “fake-tagged” PDFs such as CR Banca d’Italia: the StructTreeRoot exists but the elements are empty placeholders.

Returns:

  • (Boolean)


98
99
100
101
102
# File 'lib/rpdfium/structure/tree.rb', line 98

def empty?
  return true if root_count.zero?

  roots.none? { |r| r.type || r.children.any? }
end

#find_all(type:) ⇒ Object

Finds all the elements of the specified type (e.g. “Table”, “P”, “Figure”). Case-sensitive comparison (PDF types are “Table”, “P”, “H1”, etc.).



115
116
117
# File 'lib/rpdfium/structure/tree.rb', line 115

def find_all(type:)
  walk.select { |el| el.type == type }
end

#mcid_text_mapObject

Page objects grouped by Marked Content ID, to allow Element#text to resolve the text of its MCIDs. The map is built only once per Tree and cached.

Public but intended for internal use; not part of the stable API.



130
131
132
# File 'lib/rpdfium/structure/tree.rb', line 130

def mcid_text_map
  @mcid_text_cache ||= build_mcid_text_map
end

#root_countObject

Number of root elements (direct children of the StructTreeRoot for this page). Typically 1 (‘<Document>`), but it can be arbitrarily high on odd PDFs (e.g. cu.pdf: 717 placeholders).



80
81
82
83
# File 'lib/rpdfium/structure/tree.rb', line 80

def root_count
  n = Raw.FPDF_StructTree_CountChildren(@handle)
  [n, 0].max
end

#rootsObject

Root elements (direct children of the StructTreeRoot). Typically 1 (‘<Document>`).



87
88
89
90
91
92
# File 'lib/rpdfium/structure/tree.rb', line 87

def roots
  (0...root_count).filter_map do |i|
    h = Raw.FPDF_StructTree_GetChildAtIndex(@handle, i)
    h.null? ? nil : Element.new(self, h)
  end
end

#tablesObject

Returns all the elements of type “Table”. Convenient for semantic table extraction.



121
122
123
# File 'lib/rpdfium/structure/tree.rb', line 121

def tables
  find_all(type: "Table")
end

#to_sObject Also known as: inspect



134
135
136
# File 'lib/rpdfium/structure/tree.rb', line 134

def to_s
  "#<Rpdfium::Structure::Tree roots=#{root_count}#{empty? ? ' empty' : ''}>"
end

#walk(&block) ⇒ Object

Depth-first walk of ALL the elements of the tree. Equivalent to ‘roots.flat_map(&:walk)`. Without a block it returns an Enumerator.



106
107
108
109
110
# File 'lib/rpdfium/structure/tree.rb', line 106

def walk(&block)
  return enum_for(:walk) unless block

  roots.each { |r| r.walk(&block) }
end