Class: Rpdfium::Structure::Element

Inherits:
Object
  • Object
show all
Defined in:
lib/rpdfium/structure/element.rb

Overview

Element of a tagged PDF StructTree.

An Element represents a node of the document’s logical structure: ‘Document`, `P` (paragraph), `H1`..`H6` (headings), `Table`, `TR`, `TH`, `TD`, `Figure`, `Span`, `Lbl`, `LI`, `Caption`, etc. See PDF spec §14.8 for the complete taxonomy.

Elements have no independent lifetime: they belong to the Tree that produced them. When the Tree is closed, the elements become invalid. Do not call methods on an element after ‘tree.close`.

All methods are read-only: PDFium exposes no API to modify the StructTree (it is a “read-only” structure even in its public C API).

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(tree, handle) ⇒ Element

Returns a new instance of Element.



22
23
24
25
# File 'lib/rpdfium/structure/element.rb', line 22

def initialize(tree, handle)
  @tree = tree
  @handle = handle
end

Instance Attribute Details

#handleObject (readonly)

Returns the value of attribute handle.



20
21
22
# File 'lib/rpdfium/structure/element.rb', line 20

def handle
  @handle
end

#treeObject (readonly)

Returns the value of attribute tree.



20
21
22
# File 'lib/rpdfium/structure/element.rb', line 20

def tree
  @tree
end

Instance Method Details

#actual_textObject

ActualText: override of the “logical” text for the element. Resolves ligatures (the PDF shows ‘fi` but actual_text says “fi”), math symbols (“∫” → “integral”), abbreviations. When present, it takes precedence over the graphical text for accessibility and search.



63
64
65
# File 'lib/rpdfium/structure/element.rb', line 63

def actual_text
  read_utf16_string(:FPDF_StructElement_GetActualText)
end

#alt_textObject

AltText: alternative text for Figure / Formula / images. PDF/UA requires every Figure to have a non-empty alt_text.



69
70
71
# File 'lib/rpdfium/structure/element.rb', line 69

def alt_text
  read_utf16_string(:FPDF_StructElement_GetAltText)
end

#attributesObject

Structural PDF attributes. Returns a Hash { name => value } with all attributes declared on this element (RowSpan, ColSpan, Scope, Headers, BBox, etc.). Values are Ruby-native: Integer, Float, String, true/false, or Array for “Headers” attributes that contain lists of IDs.



169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
# File 'lib/rpdfium/structure/element.rb', line 169

def attributes
  result = {}
  attr_count = Raw.FPDF_StructElement_GetAttributeCount(@handle)
  return result if attr_count <= 0

  (0...attr_count).each do |ai|
    attr = Raw.FPDF_StructElement_GetAttributeAtIndex(@handle, ai)
    next if attr.null?

    key_count = Raw.FPDF_StructElement_Attr_GetCount(attr)
    (0...key_count).each do |ki|
      name = read_attr_name(attr, ki)
      next if name.nil? || name.empty?

      value = read_attr_value(attr, name)
      result[name] = value unless value.nil?
    end
  end
  result
end

#childrenObject

Direct children of the element. Ordered as declared in the PDF (top-to-bottom, left-to-right for reading order).



108
109
110
111
112
113
114
115
116
# File 'lib/rpdfium/structure/element.rb', line 108

def children
  n = Raw.FPDF_StructElement_CountChildren(@handle)
  return [] if n <= 0

  (0...n).filter_map do |i|
    child_handle = Raw.FPDF_StructElement_GetChildAtIndex(@handle, i)
    child_handle.null? ? nil : Element.new(@tree, child_handle)
  end
end

#expansionObject

Expansion text for abbreviations (e.g. an element of type “Span” with content “Dr.” and expansion “Doctor”). Used for text-to-speech.



75
76
77
# File 'lib/rpdfium/structure/element.rb', line 75

def expansion
  read_utf16_string(:FPDF_StructElement_GetExpansion)
end

#idObject

Unique ID of the element (if declared in the /ID dictionary of the StructTreeRoot). Enables cross-element references (e.g. the Headers attribute of a TD cell pointing to a TH by id).



49
50
51
# File 'lib/rpdfium/structure/element.rb', line 49

def id
  read_utf16_string(:FPDF_StructElement_GetID)
end

#inspectObject



201
202
203
# File 'lib/rpdfium/structure/element.rb', line 201

def inspect
  "#<Rpdfium::Structure::Element #{self}>"
end

#langObject

Language declared on the element (e.g. “it-IT”, “en-US”). Inherited from the parent if not overridden. Useful for language-aware pipelines.



55
56
57
# File 'lib/rpdfium/structure/element.rb', line 55

def lang
  read_utf16_string(:FPDF_StructElement_GetLang)
end

#leavesObject

Leaves of the sub-tree (elements without children). These are the nodes that typically hold the direct MCID.



138
139
140
141
142
# File 'lib/rpdfium/structure/element.rb', line 138

def leaves
  return [self] if children.empty?

  children.flat_map(&:leaves)
end

#marked_content_idsObject

Marked Content IDs linked to this element. An element typically has 1 MCID (e.g. a ‘<P>` holds all the paragraph text inside a BDC with mcid=N) or 0 (a pure structural element: `<Document>`, `<Table>`, `<TR>` — their MCIDs reside in the leaf children).

To link an MCID to the page text: read the page objects and group by ‘FPDFPageObj_GetMarkedContentID`. See `Element#text`.



86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# File 'lib/rpdfium/structure/element.rb', line 86

def marked_content_ids
  first = Raw.FPDF_StructElement_GetMarkedContentID(@handle)
  count = Raw.FPDF_StructElement_GetMarkedContentIdCount(@handle)
  # Cases: GetMarkedContentIdCount returns -1 when there are no direct
  # MCIDs (structural element). GetMarkedContentID returns -1 in the
  # same case.
  return [] if count <= 0 && first < 0

  # When a single MCID exists, GetMarkedContentIdCount may return
  # 0 or -1 while GetMarkedContentID provides the value. Coalesce:
  if count <= 0
    first >= 0 ? [first] : []
  else
    (0...count).filter_map do |i|
      mcid = Raw.FPDF_StructElement_GetMarkedContentIdAtIndex(@handle, i)
      mcid >= 0 ? mcid : nil
    end
  end
end

#obj_typeObject

Type of the underlying PDF object: usually “StructElem”, but may be “MCR” (Marked Content Reference) or “OBJR” (Object Reference) for specialized nodes. Most users use ‘type`.



36
37
38
# File 'lib/rpdfium/structure/element.rb', line 36

def obj_type
  read_utf16_string(:FPDF_StructElement_GetObjType)
end

#parentObject

Parent. Nil for root elements (direct children of the StructTree).



119
120
121
122
123
124
# File 'lib/rpdfium/structure/element.rb', line 119

def parent
  h = Raw.FPDF_StructElement_GetParent(@handle)
  return nil if h.null?

  Element.new(@tree, h)
end

#textObject

Text of the element, reconstructed from the page via MCID. Resolution:

  1. If ‘actual_text` is present, use it (handles ligatures/abbreviations).

  2. Otherwise collect all MCIDs of the sub-tree (this element + recursively the children) and concatenate the text of the page objects with those MCIDs, in document order.

For pure structural elements (‘Table`, `TR`) the text is the concatenation of all descendants — useful as a “summary”.



152
153
154
155
156
157
158
159
160
161
162
# File 'lib/rpdfium/structure/element.rb', line 152

def text
  return actual_text if actual_text && !actual_text.empty?

  # Collect MCIDs of the entire sub-tree depth-first
  all_mcids = []
  walk { |el| all_mcids.concat(el.marked_content_ids) }
  return "" if all_mcids.empty?

  mcid_map = @tree.send(:mcid_text_map)
  all_mcids.filter_map { |id| mcid_map[id] }.join
end

#titleObject

Title attribute (rare, used in some documents to give the element a descriptive name, e.g. “Capitolo 1”).



42
43
44
# File 'lib/rpdfium/structure/element.rb', line 42

def title
  read_utf16_string(:FPDF_StructElement_GetTitle)
end

#to_sObject



190
191
192
193
194
195
196
197
198
199
# File 'lib/rpdfium/structure/element.rb', line 190

def to_s
  parts = ["<#{type || obj_type || '?'}>"]
  mcids = marked_content_ids
  parts << "mcid=#{mcids.first}" if mcids.size == 1
  parts << "mcids=#{mcids.inspect}" if mcids.size > 1
  parts << "lang=#{lang.inspect}" if lang
  parts << "actual_text=#{actual_text.inspect[0, 30]}" if actual_text
  parts << "alt_text=#{alt_text.inspect[0, 30]}" if alt_text
  parts.join(" ")
end

#typeObject

Structural type of the element (e.g. “P”, “H1”, “Table”, “TR”, “TD”). Nil if PDFium cannot read it (placeholder element).



29
30
31
# File 'lib/rpdfium/structure/element.rb', line 29

def type
  read_utf16_string(:FPDF_StructElement_GetType)
end

#walk {|_self| ... } ⇒ Object

Depth-first walk of the entire sub-tree starting from this element. Visits self first, then recursively the children. Without a block returns an Enumerator.

Yields:

  • (_self)

Yield Parameters:



129
130
131
132
133
134
# File 'lib/rpdfium/structure/element.rb', line 129

def walk(&block)
  return enum_for(:walk) unless block

  yield self
  children.each { |c| c.walk(&block) }
end