Class: Vivlio::Starter::PDF::MecabNewlineCleaner

Inherits:
Object
  • Object
show all
Defined in:
lib/vivlio/starter/cli/pdf/mecab_newline_cleaner.rb

Overview

MeCab を利用して PDF から抽出したテキストの不要改行を除去するクリーナー

Defined Under Namespace

Classes: Token

Constant Summary collapse

CONNECTIVE_PREFIXES =
%w[   ので のに けど けれど けれども ものの  すると そして しかし でも].freeze
LIST_MARKER_REGEX =
/\A(?:[-*・]|\d+\.)/
HEADING_REGEX =
/\A#+/
SECTION_HEADING_PREFIX_REGEX =
/\A(?:第[一二三四五六七八九十百千0-9]+章|[♣♠♥♦]?\s*\d+(?:-\d+)*(?:\.)?)\z/
SECTION_HEADING_REGEX =
/\A(?:第[一二三四五六七八九十百千0-9]+章\s*.+|[♣♠♥♦]?\s*\d+(?:-\d+)*(?:\.)?\s*.+)\z/
CHAPTER_HEADING_ONLY_REGEX =
/\A第[一二三四五六七八九十百千0-9]+章\z/
PUNCTUATION_REGEX =
/[。.!?!?…]+[))]】」』》]*\z/
SMALL_KANA_START_REGEX =
/\A[ゃゅょぁぃぅぇぉゎっァィゥェォヵヶッャュョヮ]/
MIDWORD_END_REGEX =
/[A-Za-z0-9a-zA-Z0-9ァ-ヶぁ-ゖ一-龯ー々〆ヵヶ]\z/
MIDWORD_START_REGEX =
/\A[A-Za-z0-9a-zA-Z0-9ァ-ヶぁ-ゖ一-龯ー々〆ヵヶ]/
ENDING_AUXILIARIES =
%w[です ます  だった でした でしょう だろう である ません でしたら].freeze
ENDING_PARTICLES =
%w[  よね    かな かしら].freeze

Instance Method Summary collapse

Constructor Details

#initialize(config = nil) ⇒ MecabNewlineCleaner

Returns a new instance of MecabNewlineCleaner.



26
27
28
29
30
# File 'lib/vivlio/starter/cli/pdf/mecab_newline_cleaner.rb', line 26

def initialize(config = nil)
  cfg = config || {}
  @mecab_command = cfg[:mecab_command] || 'mecab'
  @mecab_cache = {}
end

Instance Method Details

#clean(text) ⇒ Object



32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# File 'lib/vivlio/starter/cli/pdf/mecab_newline_cleaner.rb', line 32

def clean(text)
  str = normalize_pdf_extracted_text(text)
  return str if str.empty?

  segments = str.split(/(\n{2,})/, -1)
  rebuilt = []

  segments.each_with_index do |segment, idx|
    if idx.even?
      rebuilt << clean_paragraph(segment)
    else
      prev_block = rebuilt.last
      next_block = segments[idx + 1]

      if chapter_heading_gap?(prev_block, next_block)
        rebuilt << "\n"
      elsif chapter_title_gap?(rebuilt, prev_block, next_block)
        rebuilt << "\n"
      elsif should_merge_gap?(prev_block, next_block)
        # ギャップを完全に削除(結合)
      else
        rebuilt << segment
      end
    end
  end

  rebuilt.join
end