Class: Rpdfium::Structure::Element
- Inherits:
-
Object
- Object
- Rpdfium::Structure::Element
- Defined in:
- lib/rpdfium/structure/element.rb
Overview
Element of a tagged PDF StructTree.
An Element represents a node of the document’s logical structure: ‘Document`, `P` (paragraph), `H1`..`H6` (headings), `Table`, `TR`, `TH`, `TD`, `Figure`, `Span`, `Lbl`, `LI`, `Caption`, etc. See PDF spec §14.8 for the complete taxonomy.
Elements have no independent lifetime: they belong to the Tree that produced them. When the Tree is closed, the elements become invalid. Do not call methods on an element after ‘tree.close`.
All methods are read-only: PDFium exposes no API to modify the StructTree (it is a “read-only” structure even in its public C API).
Instance Attribute Summary collapse
-
#handle ⇒ Object
readonly
Returns the value of attribute handle.
-
#tree ⇒ Object
readonly
Returns the value of attribute tree.
Instance Method Summary collapse
-
#actual_text ⇒ Object
ActualText: override of the “logical” text for the element.
-
#alt_text ⇒ Object
AltText: alternative text for Figure / Formula / images.
-
#attributes ⇒ Object
Structural PDF attributes.
-
#children ⇒ Object
Direct children of the element.
-
#expansion ⇒ Object
Expansion text for abbreviations (e.g. an element of type “Span” with content “Dr.” and expansion “Doctor”).
-
#id ⇒ Object
Unique ID of the element (if declared in the /ID dictionary of the StructTreeRoot).
-
#initialize(tree, handle) ⇒ Element
constructor
A new instance of Element.
- #inspect ⇒ Object
-
#lang ⇒ Object
Language declared on the element (e.g. “it-IT”, “en-US”).
-
#leaves ⇒ Object
Leaves of the sub-tree (elements without children).
-
#marked_content_ids ⇒ Object
Marked Content IDs linked to this element.
-
#obj_type ⇒ Object
Type of the underlying PDF object: usually “StructElem”, but may be “MCR” (Marked Content Reference) or “OBJR” (Object Reference) for specialized nodes.
-
#parent ⇒ Object
Parent.
-
#text ⇒ Object
Text of the element, reconstructed from the page via MCID.
-
#title ⇒ Object
Title attribute (rare, used in some documents to give the element a descriptive name, e.g. “Capitolo 1”).
- #to_s ⇒ Object
-
#type ⇒ Object
Structural type of the element (e.g. “P”, “H1”, “Table”, “TR”, “TD”).
-
#walk {|_self| ... } ⇒ Object
Depth-first walk of the entire sub-tree starting from this element.
Constructor Details
#initialize(tree, handle) ⇒ Element
Returns a new instance of Element.
22 23 24 25 |
# File 'lib/rpdfium/structure/element.rb', line 22 def initialize(tree, handle) @tree = tree @handle = handle end |
Instance Attribute Details
#handle ⇒ Object (readonly)
Returns the value of attribute handle.
20 21 22 |
# File 'lib/rpdfium/structure/element.rb', line 20 def handle @handle end |
#tree ⇒ Object (readonly)
Returns the value of attribute tree.
20 21 22 |
# File 'lib/rpdfium/structure/element.rb', line 20 def tree @tree end |
Instance Method Details
#actual_text ⇒ Object
ActualText: override of the “logical” text for the element. Resolves ligatures (the PDF shows ‘fi` but actual_text says “fi”), math symbols (“∫” → “integral”), abbreviations. When present, it takes precedence over the graphical text for accessibility and search.
63 64 65 |
# File 'lib/rpdfium/structure/element.rb', line 63 def actual_text read_utf16_string(:FPDF_StructElement_GetActualText) end |
#alt_text ⇒ Object
AltText: alternative text for Figure / Formula / images. PDF/UA requires every Figure to have a non-empty alt_text.
69 70 71 |
# File 'lib/rpdfium/structure/element.rb', line 69 def alt_text read_utf16_string(:FPDF_StructElement_GetAltText) end |
#attributes ⇒ Object
Structural PDF attributes. Returns a Hash { name => value } with all attributes declared on this element (RowSpan, ColSpan, Scope, Headers, BBox, etc.). Values are Ruby-native: Integer, Float, String, true/false, or Array for “Headers” attributes that contain lists of IDs.
169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 |
# File 'lib/rpdfium/structure/element.rb', line 169 def attributes result = {} attr_count = Raw.FPDF_StructElement_GetAttributeCount(@handle) return result if attr_count <= 0 (0...attr_count).each do |ai| attr = Raw.FPDF_StructElement_GetAttributeAtIndex(@handle, ai) next if attr.null? key_count = Raw.FPDF_StructElement_Attr_GetCount(attr) (0...key_count).each do |ki| name = read_attr_name(attr, ki) next if name.nil? || name.empty? value = read_attr_value(attr, name) result[name] = value unless value.nil? end end result end |
#children ⇒ Object
Direct children of the element. Ordered as declared in the PDF (top-to-bottom, left-to-right for reading order).
108 109 110 111 112 113 114 115 116 |
# File 'lib/rpdfium/structure/element.rb', line 108 def children n = Raw.FPDF_StructElement_CountChildren(@handle) return [] if n <= 0 (0...n).filter_map do |i| child_handle = Raw.FPDF_StructElement_GetChildAtIndex(@handle, i) child_handle.null? ? nil : Element.new(@tree, child_handle) end end |
#expansion ⇒ Object
Expansion text for abbreviations (e.g. an element of type “Span” with content “Dr.” and expansion “Doctor”). Used for text-to-speech.
75 76 77 |
# File 'lib/rpdfium/structure/element.rb', line 75 def expansion read_utf16_string(:FPDF_StructElement_GetExpansion) end |
#id ⇒ Object
Unique ID of the element (if declared in the /ID dictionary of the StructTreeRoot). Enables cross-element references (e.g. the Headers attribute of a TD cell pointing to a TH by id).
49 50 51 |
# File 'lib/rpdfium/structure/element.rb', line 49 def id read_utf16_string(:FPDF_StructElement_GetID) end |
#inspect ⇒ Object
201 202 203 |
# File 'lib/rpdfium/structure/element.rb', line 201 def inspect "#<Rpdfium::Structure::Element #{self}>" end |
#lang ⇒ Object
Language declared on the element (e.g. “it-IT”, “en-US”). Inherited from the parent if not overridden. Useful for language-aware pipelines.
55 56 57 |
# File 'lib/rpdfium/structure/element.rb', line 55 def lang read_utf16_string(:FPDF_StructElement_GetLang) end |
#leaves ⇒ Object
Leaves of the sub-tree (elements without children). These are the nodes that typically hold the direct MCID.
138 139 140 141 142 |
# File 'lib/rpdfium/structure/element.rb', line 138 def leaves return [self] if children.empty? children.flat_map(&:leaves) end |
#marked_content_ids ⇒ Object
Marked Content IDs linked to this element. An element typically has 1 MCID (e.g. a ‘<P>` holds all the paragraph text inside a BDC with mcid=N) or 0 (a pure structural element: `<Document>`, `<Table>`, `<TR>` — their MCIDs reside in the leaf children).
To link an MCID to the page text: read the page objects and group by ‘FPDFPageObj_GetMarkedContentID`. See `Element#text`.
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
# File 'lib/rpdfium/structure/element.rb', line 86 def marked_content_ids first = Raw.FPDF_StructElement_GetMarkedContentID(@handle) count = Raw.FPDF_StructElement_GetMarkedContentIdCount(@handle) # Cases: GetMarkedContentIdCount returns -1 when there are no direct # MCIDs (structural element). GetMarkedContentID returns -1 in the # same case. return [] if count <= 0 && first < 0 # When a single MCID exists, GetMarkedContentIdCount may return # 0 or -1 while GetMarkedContentID provides the value. Coalesce: if count <= 0 first >= 0 ? [first] : [] else (0...count).filter_map do |i| mcid = Raw.FPDF_StructElement_GetMarkedContentIdAtIndex(@handle, i) mcid >= 0 ? mcid : nil end end end |
#obj_type ⇒ Object
Type of the underlying PDF object: usually “StructElem”, but may be “MCR” (Marked Content Reference) or “OBJR” (Object Reference) for specialized nodes. Most users use ‘type`.
36 37 38 |
# File 'lib/rpdfium/structure/element.rb', line 36 def obj_type read_utf16_string(:FPDF_StructElement_GetObjType) end |
#parent ⇒ Object
Parent. Nil for root elements (direct children of the StructTree).
119 120 121 122 123 124 |
# File 'lib/rpdfium/structure/element.rb', line 119 def parent h = Raw.FPDF_StructElement_GetParent(@handle) return nil if h.null? Element.new(@tree, h) end |
#text ⇒ Object
Text of the element, reconstructed from the page via MCID. Resolution:
-
If ‘actual_text` is present, use it (handles ligatures/abbreviations).
-
Otherwise collect all MCIDs of the sub-tree (this element + recursively the children) and concatenate the text of the page objects with those MCIDs, in document order.
For pure structural elements (‘Table`, `TR`) the text is the concatenation of all descendants — useful as a “summary”.
152 153 154 155 156 157 158 159 160 161 162 |
# File 'lib/rpdfium/structure/element.rb', line 152 def text return actual_text if actual_text && !actual_text.empty? # Collect MCIDs of the entire sub-tree depth-first all_mcids = [] walk { |el| all_mcids.concat(el.marked_content_ids) } return "" if all_mcids.empty? mcid_map = @tree.send(:mcid_text_map) all_mcids.filter_map { |id| mcid_map[id] }.join end |
#title ⇒ Object
Title attribute (rare, used in some documents to give the element a descriptive name, e.g. “Capitolo 1”).
42 43 44 |
# File 'lib/rpdfium/structure/element.rb', line 42 def title read_utf16_string(:FPDF_StructElement_GetTitle) end |
#to_s ⇒ Object
190 191 192 193 194 195 196 197 198 199 |
# File 'lib/rpdfium/structure/element.rb', line 190 def to_s parts = ["<#{type || obj_type || '?'}>"] mcids = marked_content_ids parts << "mcid=#{mcids.first}" if mcids.size == 1 parts << "mcids=#{mcids.inspect}" if mcids.size > 1 parts << "lang=#{lang.inspect}" if lang parts << "actual_text=#{actual_text.inspect[0, 30]}" if actual_text parts << "alt_text=#{alt_text.inspect[0, 30]}" if alt_text parts.join(" ") end |
#type ⇒ Object
Structural type of the element (e.g. “P”, “H1”, “Table”, “TR”, “TD”). Nil if PDFium cannot read it (placeholder element).
29 30 31 |
# File 'lib/rpdfium/structure/element.rb', line 29 def type read_utf16_string(:FPDF_StructElement_GetType) end |
#walk {|_self| ... } ⇒ Object
Depth-first walk of the entire sub-tree starting from this element. Visits self first, then recursively the children. Without a block returns an Enumerator.
129 130 131 132 133 134 |
# File 'lib/rpdfium/structure/element.rb', line 129 def walk(&block) return enum_for(:walk) unless block yield self children.each { |c| c.walk(&block) } end |