Class: Uniword::Visitor::TextExtractor

Inherits:
BaseVisitor show all
Defined in:
lib/uniword/visitor/text_extractor.rb

Overview

Concrete visitor that extracts all text content from a document.

This visitor demonstrates the visitor pattern by traversing the document structure and collecting text from all text-containing elements.

Examples:

Extract text from a document

extractor = Uniword::Visitor::TextExtractor.new
document.accept(extractor)
text = extractor.text

Extract text with separator

extractor = Uniword::Visitor::TextExtractor.new(separator: "\n\n")
document.accept(extractor)
text = extractor.text

Instance Method Summary collapse

Constructor Details

#initialize(separator: "\n") ⇒ TextExtractor

Initialize a new text extractor.

Parameters:

  • separator (String) (defaults to: "\n")

    the separator to use between text elements



23
24
25
26
27
# File 'lib/uniword/visitor/text_extractor.rb', line 23

def initialize(separator: "\n")
  super()
  @separator = separator
  @text_parts = []
end

Instance Method Details

#textString

Returns the extracted text joined by separator.

Returns:

  • (String)

    the extracted text joined by separator



30
31
32
# File 'lib/uniword/visitor/text_extractor.rb', line 30

def text
  @text_parts.join(@separator)
end

#visit_document(document) ⇒ void

This method returns an undefined value.

Visit a document and extract text from all its elements.

Parameters:

  • document (Document)

    The document to visit



38
39
40
41
42
43
44
45
46
47
48
49
50
# File 'lib/uniword/visitor/text_extractor.rb', line 38

def visit_document(document)
  # Visit both paragraphs and tables from body
  if document.body.respond_to?(:elements)
    document.body.elements.each do |element|
      element.accept(self) if element.respond_to?(:accept)
    end
  else
    # Legacy support: visit paragraphs only
    document.elements.each do |element|
      element.accept(self) if element.respond_to?(:accept)
    end
  end
end

#visit_image(image) ⇒ void

This method returns an undefined value.

Visit an image element. Images don’t contain text, so this is a no-op.

Parameters:

  • image (Image)

    The image to visit



114
115
116
# File 'lib/uniword/visitor/text_extractor.rb', line 114

def visit_image(image)
  # Images don't contain text, intentionally empty
end

#visit_paragraph(paragraph) ⇒ void

This method returns an undefined value.

Visit a paragraph and extract its text content.

Parameters:

  • paragraph (Paragraph)

    The paragraph to visit



56
57
58
59
60
61
62
63
# File 'lib/uniword/visitor/text_extractor.rb', line 56

def visit_paragraph(paragraph)
  paragraph_text = paragraph.runs.filter_map do |run|
    run.accept(self)
    @text_parts.pop # Get the last added text
  end.join

  @text_parts << paragraph_text unless paragraph_text.empty?
end

#visit_run(run) ⇒ void

This method returns an undefined value.

Visit a run and extract its text content.

Parameters:

  • run (Run)

    The run to visit



105
106
107
# File 'lib/uniword/visitor/text_extractor.rb', line 105

def visit_run(run)
  @text_parts << run.text if run.text
end

#visit_table(table) ⇒ void

This method returns an undefined value.

Visit a table and extract text from all its cells.

Parameters:

  • table (Table)

    The table to visit



69
70
71
72
73
# File 'lib/uniword/visitor/text_extractor.rb', line 69

def visit_table(table)
  table.rows.each do |row|
    row.accept(self)
  end
end

#visit_table_cell(table_cell) ⇒ void

This method returns an undefined value.

Visit a table cell and extract its text content.

Parameters:

  • table_cell (TableCell)

    The table cell to visit



92
93
94
95
96
97
98
99
# File 'lib/uniword/visitor/text_extractor.rb', line 92

def visit_table_cell(table_cell)
  cell_text = table_cell.paragraphs.filter_map do |paragraph|
    paragraph.accept(self)
    @text_parts.pop # Get the last added text
  end.join

  @text_parts << cell_text
end

#visit_table_row(table_row) ⇒ void

This method returns an undefined value.

Visit a table row and extract text from all its cells.

Parameters:

  • table_row (TableRow)

    The table row to visit



79
80
81
82
83
84
85
86
# File 'lib/uniword/visitor/text_extractor.rb', line 79

def visit_table_row(table_row)
  row_text = table_row.cells.filter_map do |cell|
    cell.accept(self)
    @text_parts.pop # Get the last added text
  end.join(" | ")

  @text_parts << row_text unless row_text.empty?
end