Class: VivlioStarter::Pdf::Reader::PageTextCollector

Inherits:
HexaPDF::Content::Processor
  • Object
show all
Defined in:
lib/vivlio_starter/cli/pdf/reader.rb

Overview

HexaPDF のコンテンツストリームを走査し、版面内のテキスト断片と画像出現を収集する

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(resources, bounds:, line_merge_tolerance:) ⇒ PageTextCollector

Returns a new instance of PageTextCollector.

Parameters:

  • resources (HexaPDF::Type::Resources)

    ページリソース

  • bounds (Hash, nil)

    テキスト抽出領域の座標境界

  • line_merge_tolerance (Float)

    同一行とみなす Y 座標差の閾値(pt)



63
64
65
66
67
68
69
# File 'lib/vivlio_starter/cli/pdf/reader.rb', line 63

def initialize(resources, bounds:, line_merge_tolerance:)
  super(resources)
  @bounds = bounds
  @line_merge_tolerance = line_merge_tolerance.to_f
  @fragments = []
  @image_occurrences = []
end

Instance Attribute Details

#fragmentsObject (readonly)

Returns the value of attribute fragments.



58
59
60
# File 'lib/vivlio_starter/cli/pdf/reader.rb', line 58

def fragments
  @fragments
end

#image_occurrencesObject (readonly)

Returns the value of attribute image_occurrences.



58
59
60
# File 'lib/vivlio_starter/cli/pdf/reader.rb', line 58

def image_occurrences
  @image_occurrences
end

Instance Method Details

#linesObject

収集した断片を Y 座標でグループ化し、Line 配列を返す



82
83
84
# File 'lib/vivlio_starter/cli/pdf/reader.rb', line 82

def lines
  build_lines
end

#paint_xobject(name) ⇒ Object

PDF オペレータ Do: XObject 描画。画像なら出現位置を記録する



92
93
94
95
96
97
98
# File 'lib/vivlio_starter/cli/pdf/reader.rb', line 92

def paint_xobject(name)
  xobject = resources.xobject(name)
  collect_image_occurrence(xobject) if image_object?(xobject)
  super
rescue StandardError
  nil
end

#show_text(data) ⇒ Object

PDF オペレータ Tj: テキスト表示



72
73
74
# File 'lib/vivlio_starter/cli/pdf/reader.rb', line 72

def show_text(data)
  collect_text_box(decode_text_with_positioning(data))
end

#show_text_with_positioning(data) ⇒ Object

PDF オペレータ TJ: 位置調整付きテキスト表示



77
78
79
# File 'lib/vivlio_starter/cli/pdf/reader.rb', line 77

def show_text_with_positioning(data)
  collect_text_box(decode_text_with_positioning(data))
end

#textObject

全行を改行で結合したテキストを返す



87
88
89
# File 'lib/vivlio_starter/cli/pdf/reader.rb', line 87

def text
  lines.map(&:text).join("\n")
end