Class: Uniword::Visitor::TextExtractor
- Inherits:
-
BaseVisitor
- Object
- BaseVisitor
- Uniword::Visitor::TextExtractor
- Defined in:
- lib/uniword/visitor/text_extractor.rb
Overview
Concrete visitor that extracts all text content from a document.
This visitor demonstrates the visitor pattern by traversing the document structure and collecting text from all text-containing elements.
Instance Method Summary collapse
-
#initialize(separator: "\n") ⇒ TextExtractor
constructor
Initialize a new text extractor.
-
#text ⇒ String
The extracted text joined by separator.
-
#visit_document(document) ⇒ void
Visit a document and extract text from all its elements.
-
#visit_image(image) ⇒ void
Visit an image element.
-
#visit_paragraph(paragraph) ⇒ void
Visit a paragraph and extract its text content.
-
#visit_run(run) ⇒ void
Visit a run and extract its text content.
-
#visit_table(table) ⇒ void
Visit a table and extract text from all its cells.
-
#visit_table_cell(table_cell) ⇒ void
Visit a table cell and extract its text content.
-
#visit_table_row(table_row) ⇒ void
Visit a table row and extract text from all its cells.
Constructor Details
#initialize(separator: "\n") ⇒ TextExtractor
Initialize a new text extractor.
23 24 25 26 27 |
# File 'lib/uniword/visitor/text_extractor.rb', line 23 def initialize(separator: "\n") super() @separator = separator @text_parts = [] end |
Instance Method Details
#text ⇒ String
Returns the extracted text joined by separator.
30 31 32 |
# File 'lib/uniword/visitor/text_extractor.rb', line 30 def text @text_parts.join(@separator) end |
#visit_document(document) ⇒ void
This method returns an undefined value.
Visit a document and extract text from all its elements.
38 39 40 41 42 43 44 45 46 47 48 49 50 |
# File 'lib/uniword/visitor/text_extractor.rb', line 38 def visit_document(document) # Visit both paragraphs and tables from body if document.body.respond_to?(:elements) document.body.elements.each do |element| element.accept(self) if element.respond_to?(:accept) end else # Legacy support: visit paragraphs only document.elements.each do |element| element.accept(self) if element.respond_to?(:accept) end end end |
#visit_image(image) ⇒ void
This method returns an undefined value.
Visit an image element. Images don’t contain text, so this is a no-op.
114 115 116 |
# File 'lib/uniword/visitor/text_extractor.rb', line 114 def visit_image(image) # Images don't contain text, intentionally empty end |
#visit_paragraph(paragraph) ⇒ void
This method returns an undefined value.
Visit a paragraph and extract its text content.
56 57 58 59 60 61 62 63 |
# File 'lib/uniword/visitor/text_extractor.rb', line 56 def visit_paragraph(paragraph) paragraph_text = paragraph.runs.filter_map do |run| run.accept(self) @text_parts.pop # Get the last added text end.join @text_parts << paragraph_text unless paragraph_text.empty? end |
#visit_run(run) ⇒ void
This method returns an undefined value.
Visit a run and extract its text content.
105 106 107 |
# File 'lib/uniword/visitor/text_extractor.rb', line 105 def visit_run(run) @text_parts << run.text if run.text end |
#visit_table(table) ⇒ void
This method returns an undefined value.
Visit a table and extract text from all its cells.
69 70 71 72 73 |
# File 'lib/uniword/visitor/text_extractor.rb', line 69 def visit_table(table) table.rows.each do |row| row.accept(self) end end |
#visit_table_cell(table_cell) ⇒ void
This method returns an undefined value.
Visit a table cell and extract its text content.
92 93 94 95 96 97 98 99 |
# File 'lib/uniword/visitor/text_extractor.rb', line 92 def visit_table_cell(table_cell) cell_text = table_cell.paragraphs.filter_map do |paragraph| paragraph.accept(self) @text_parts.pop # Get the last added text end.join @text_parts << cell_text end |
#visit_table_row(table_row) ⇒ void
This method returns an undefined value.
Visit a table row and extract text from all its cells.
79 80 81 82 83 84 85 86 |
# File 'lib/uniword/visitor/text_extractor.rb', line 79 def visit_table_row(table_row) row_text = table_row.cells.filter_map do |cell| cell.accept(self) @text_parts.pop # Get the last added text end.join(" | ") @text_parts << row_text unless row_text.empty? end |