Class: VivlioStarter::Pdf::Reader

Inherits:

Object

Object
VivlioStarter::Pdf::Reader

show all

Defined in:: lib/vivlio_starter/cli/pdf/reader.rb

Overview

HexaPDF ベースの高精度 PDF リーダー

テキスト座標解析・画像抽出・OCR 連携・イラスト領域自動検出を統合し、PDF → Markdown 変換パイプラインを提供する。

Defined Under Namespace

Classes: Fragment, IllustrationRegion, ImageAsset, ImageOccurrence, Line, OcrResult, OcrSettings, PageContent, PageTextCollector, RenderedPageCrop, ResolvedPage

Constant Summary collapse

MIN_ILLUSTRATION_AREA_RATIO = — イラスト検出用定数 — イラスト領域とみなす最小面積比

0.02

ILLUSTRATION_TEXT_ASPECT_MAX = テキスト領域と区別するアスペクト比上限

6.0

FOREGROUND_THRESHOLD = 前景検出の輝度閾値（この値未満のピクセルを前景とする）

ROW_ACTIVITY_STDDEV_SCALE = 行プロファイル閾値の標準偏差スケール係数

0.5

COLUMN_ACTIVITY_THRESHOLD = 列単位の前景活性度閾値

0.05

PROFILE_SMOOTHING_SIGMA_RATIO = ガウシアン平滑化のシグマ比率（画像高さに対する割合）

0.012

MIN_PROFILE_SMOOTHING_SIGMA = シグマの下限・上限

MAX_PROFILE_SMOOTHING_SIGMA =

Instance Method Summary collapse

#execute ⇒ Hash

PDF を解析し、Markdown テキスト・画像アセット・メタデータを含む Hash を返す.
#initialize(pdf_path, page_separator: true, text_area: nil, line_merge_tolerance: 2.0, images_dir: nil, image_reference_dir: nil, ocr: nil) ⇒ Reader constructor

A new instance of Reader.

Constructor Details

#initialize(pdf_path, page_separator: true, text_area: nil, line_merge_tolerance: 2.0, images_dir: nil, image_reference_dir: nil, ocr: nil) ⇒ `Reader`

Returns a new instance of Reader.

Parameters:

pdf_path (String) —

入力 PDF のパス
page_separator (Boolean) (defaults to: true) —

ページ間に “—” を挿入するか
text_area (Hash, nil) (defaults to: nil) —

テキスト抽出領域のマージン（pt 単位）
line_merge_tolerance (Float) (defaults to: 2.0) —

同一行とみなす Y 座標差の閾値（pt）
images_dir (String, nil) (defaults to: nil) —

画像の保存先ディレクトリ
image_reference_dir (String, nil) (defaults to: nil) —

Markdown 内の画像参照パスの基底
ocr (Hash, nil) (defaults to: nil) —

OCR 設定

# File 'lib/vivlio_starter/cli/pdf/reader.rb', line 233

def initialize(pdf_path, page_separator: true, text_area: nil, line_merge_tolerance: 2.0, images_dir: nil, image_reference_dir: nil, ocr: nil)
  @pdf_path = pdf_path
  @page_separator = page_separator != false
  @text_area = normalize_text_area(text_area)
  @line_merge_tolerance = line_merge_tolerance.to_f
  @images_dir = images_dir&.to_s&.strip
  @image_reference_dir = image_reference_dir&.to_s&.strip
  @ocr = normalize_ocr(ocr)
end

Instance Method Details

#execute ⇒ `Hash`

PDF を解析し、Markdown テキスト・画像アセット・メタデータを含む Hash を返す

Returns:

(Hash) —

:markdown, :page_texts, :page_chunks, :pages, :images

# File 'lib/vivlio_starter/cli/pdf/reader.rb', line 245

def execute
  document = HexaPDF::Document.open(pdf_path)
  page_texts = []
  page_chunks = []
  images = []

  document.pages.each_with_index do |page, index|
    resolution = nil

    begin
      content = extract_page_content(page, index)
      resolution = resolve_page_content(page, index, content)
      page_images = extract_page_images(page, resolution.content.image_occurrences, index,
                                        suppress_full_page_scans: resolution.ocr_applied)
      page_lines, image_captions = apply_inline_image_text_policy(resolution.content.lines, page_images)
      page_text = build_page_text(page_lines, resolution.content.text, image_captions)
      page_texts << page_text
      page_chunks << build_page_chunk(page_lines, page_images, page_text, image_captions:)
      images.concat(page_images)
    ensure
      cleanup_ocr_temp_dir(resolution&.ocr_temp_dir)
    end
  end

  {
    markdown: build_markdown(page_chunks),
    page_texts:,
    page_chunks:,
    pages: document.pages.count,
    images: images.map(&:to_h)
  }
end