Class: Coelacanth::Extractor::MorphologicalAnalyzer
- Inherits:
-
Object
- Object
- Coelacanth::Extractor::MorphologicalAnalyzer
- Defined in:
- lib/coelacanth/extractor/morphological_analyzer.rb
Overview
Scores candidate morphemes extracted from article content. The implementation follows a light-weight heuristic approach that approximates the specification shared in the user instructions. It prioritises noun-ish phrases for both Japanese and English text, applies positional boosts, and returns the highest scoring terms.
Defined Under Namespace
Classes: Term
Constant Summary collapse
- TOKEN_PATTERN =
/ \p{Han}+ | # Kanji sequences \p{Hiragana}+ | # Hiragana sequences [\p{Katakana}ー]+ | # Katakana sequences including the choonpu [A-Za-z0-9]+(?:-[A-Za-z0-9]+)* # Latin alphanumerics keeping inner hyphen /x.freeze
- MARKDOWN_CONTROL_PATTERN =
/[`*_>#\[\]\(\)\{\}!\+=|~]/.freeze
- EN_STOPWORDS =
%w[ a an and are as at be but by for if in into is it its of on or such that the their then there these they this to was were will with ].freeze
- EN_GENERIC_TERMS =
%w[ page pages article articles category categories tag tags image images video videos click home link links read more author authors post posts ].freeze
- JA_GENERIC_TERMS =
%w[カテゴリ カテゴリー 記事 画像 写真 まとめ サイト 投稿 ブログ 最新 人気 関連].freeze
- FULLWIDTH_ALPHA =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz".freeze
- HALF_WIDTH_ALPHA =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz".freeze
- FULLWIDTH_DIGITS =
"0123456789".freeze
- HALF_WIDTH_DIGITS =
"0123456789".freeze
- FULLWIDTH_HYPHENS =
"-―ーー".freeze
- TOP_K =
8- MAX_POSITION_BOOST =
3.0- LENGTH_BONUS_FACTOR =
0.15- MAX_LENGTH_BONUS =
1.6- POSITION_WEIGHTS =
{ body: 1.0, title: 2.2, h1: 1.6, h2: 1.3, accent: 1.1 }.freeze
- CATEGORY_ALIASES =
{ "kanji" => :kanji, "hiragana" => :hiragana, "katakana" => :katakana, "latin" => :latin }.freeze
Instance Method Summary collapse
- #call(node:, title:, markdown:) ⇒ Object
- #call_text(text, title: nil) ⇒ Object
-
#initialize(config: Coelacanth.config) ⇒ MorphologicalAnalyzer
constructor
A new instance of MorphologicalAnalyzer.
Constructor Details
#initialize(config: Coelacanth.config) ⇒ MorphologicalAnalyzer
Returns a new instance of MorphologicalAnalyzer.
63 64 65 |
# File 'lib/coelacanth/extractor/morphological_analyzer.rb', line 63 def initialize(config: Coelacanth.config) @config = config end |
Instance Method Details
#call(node:, title:, markdown:) ⇒ Object
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
# File 'lib/coelacanth/extractor/morphological_analyzer.rb', line 71 def call(node:, title:, markdown:) stats = Hash.new do |hash, key| hash[key] = { token: nil, components: 1, body_count: 0, pos_bonus: 0.0, language: nil } end body_terms = extract_terms(markdown) contexts = [ [POSITION_WEIGHTS[:title], extract_terms(title)], [POSITION_WEIGHTS[:h1], extract_terms(text_for(node, "h1"))], [POSITION_WEIGHTS[:h2], extract_terms(text_for(node, "h2"))], [ POSITION_WEIGHTS[:accent], extract_terms( [ text_for(node, "a"), text_for(node, "strong"), text_for(node, "b"), text_for(node, "img", attribute: "alt") ].compact.join(" ") ) ], [POSITION_WEIGHTS[:body], body_terms] ] contexts.each do |weight, terms| next if terms.empty? grouped = terms.group_by(&:key) grouped.each_value do |occurrences| representative = occurrences.max_by(&:components) entry = stats[representative.key] entry[:token] ||= representative.token entry[:components] = [entry[:components], representative.components].max entry[:language] ||= representative.language bonus = weight - 1.0 entry[:pos_bonus] += bonus if bonus.positive? end end body_terms.each do |term| entry = stats[term.key] entry[:token] ||= term.token entry[:components] = [entry[:components], term.components].max entry[:language] ||= term.language entry[:body_count] += 1 end scored = stats.values.map do |entry| next if entry[:body_count].zero? tf = Math.log(entry[:body_count] + 1.0) pos_boost = [1.0 + entry[:pos_bonus], MAX_POSITION_BOOST].min len_bonus = [1.0 + LENGTH_BONUS_FACTOR * (entry[:components] - 1), MAX_LENGTH_BONUS].min score = tf * pos_boost * len_bonus entry.merge(score: score) end.compact return [] if scored.empty? sorted = scored.sort_by { |entry| [-entry[:score], entry[:token]] } pruned = prune_inclusions(sorted) max_score = pruned.first[:score] threshold = max_score * 0.35 selected = pruned.select { |entry| entry[:score] >= threshold } if selected.length < TOP_K pruned.each do |entry| next if selected.include?(entry) selected << entry break if selected.length >= TOP_K end end selected = selected.take(TOP_K) selected.map do |entry| { token: entry[:token], score: entry[:score], count: entry[:body_count] } end end |
#call_text(text, title: nil) ⇒ Object
67 68 69 |
# File 'lib/coelacanth/extractor/morphological_analyzer.rb', line 67 def call_text(text, title: nil) call(node: nil, title: title, markdown: text) end |