Class: Ace::Docs::Atoms::TerminologyExtractor
- Inherits:
-
Object
- Object
- Ace::Docs::Atoms::TerminologyExtractor
- Defined in:
- lib/ace/docs/atoms/terminology_extractor.rb
Overview
Extracts and analyzes terminology from documents to find conflicts
Constant Summary collapse
- COMMON_WORDS =
Common words to exclude from terminology analysis
%w[ a an and are as at be but by for from has have i in is it of on or that the this to was will with you your we our us their them they can could should would may might must shall will do does did done get got gets getting make makes made making take takes took taken use uses used using go goes went gone going come comes came coming see sees saw seen seeing know knows knew known knowing think thinks thought thinking want wants wanted wanting need needs needed needing give gives gave given giving find finds found finding tell tells told telling work works worked working call calls called calling try tries tried trying ask asks asked asking feel feels felt feeling become becomes became becoming leave leaves left leaving put puts putting keep keeps kept keeping let lets letting begin begins began beginning seem seems seemed seeming help helps helped helping talk talks talked talking turn turns turned turning start starts started starting show shows showed shown showing hear hears heard hearing play plays played playing run runs ran running move moves moved moving like likes liked liking live lives lived living believe believes believed believing bring brings brought bringing happen happens happened happening write writes wrote written writing provide provides provided providing sit sits sat sitting stand stands stood standing lose loses lost losing pay pays paid paying meet meets met meeting include includes included including continue continues continued continuing set sets setting learn learns learned learning change changes changed changing lead leads led leading understand understands understood understanding watch watches watched watching follow follows followed following stop stops stopped stopping create creates created creating speak speaks spoke spoken speaking read reads reading allow allows allowed allowing add adds added adding spend spends spent spending grow grows grew grown growing open opens opened opening walk walks walked walking win wins won winning offer offers offered offering remember remembers remembered remembering love loves loved loving consider considers considered considering appear appears appeared appearing buy buys bought buying wait waits waited waiting serve serves served serving die dies died dying send sends sent sending expect expects expected expecting build builds built building stay stays stayed staying fall falls fell fallen falling cut cuts cutting reach reaches reached reaching kill kills killed killing remain remains remained remaining suggest suggests suggested suggesting raise raises raised raising pass passes passed passing sell sells sold selling require requires required requiring report reports reported reporting decide decides decided deciding pull pulls pulled pulling one two three four five six seven eight nine ten first second third last next new old good bad best worst more most less least very much many few some any all no not yes other another each every either neither both such own same different various certain several many most few little much enough only just still already yet even also too quite rather almost nearly always usually often sometimes rarely never again further then once now here there where when why how what which who whom whose if unless until while although though because since before after during within without through across beyond behind below beneath beside between above over under around among against along toward towards upon down up out off away back forward backward forwards backwards inside outside onto into about for from with without by at in on to as of ].freeze
Instance Method Summary collapse
-
#extract_terms(content, doc_path = nil) ⇒ Hash
Extract key terms from document content with frequency counts.
-
#filter_common_words(terms) ⇒ Array
Filter out common words from a list of terms.
-
#find_conflicts(documents) ⇒ Array
Find terminology conflicts across multiple documents.
Instance Method Details
#extract_terms(content, doc_path = nil) ⇒ Hash
Extract key terms from document content with frequency counts
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
# File 'lib/ace/docs/atoms/terminology_extractor.rb', line 69 def extract_terms(content, doc_path = nil) terms = {} lines = content.lines lines.each_with_index do |line, index| # Skip code blocks and front matter next if line.strip.start_with?("```", "---") # Extract words and normalize them words = line.downcase.scan(/\b[a-z]+(?:-[a-z]+)*\b/) words.each do |word| # Skip common words and very short words next if COMMON_WORDS.include?(word) || word.length < 3 # Track term frequency and locations terms[word] ||= {count: 0, locations: [], variations: Set.new} terms[word][:count] += 1 terms[word][:locations] << {file: doc_path, line: index + 1} # Track original variations (case) original = line[/\b#{Regexp.escape(word)}\b/i] terms[word][:variations] << original if original end end # Filter to meaningful terms (appears multiple times or has variations) terms.select do |_term, data| data[:count] > 1 || data[:variations].size > 1 end end |
#filter_common_words(terms) ⇒ Array
Filter out common words from a list of terms
143 144 145 |
# File 'lib/ace/docs/atoms/terminology_extractor.rb', line 143 def filter_common_words(terms) terms.reject { |term| COMMON_WORDS.include?(term.downcase) } end |
#find_conflicts(documents) ⇒ Array
Find terminology conflicts across multiple documents
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
# File 'lib/ace/docs/atoms/terminology_extractor.rb', line 104 def find_conflicts(documents) all_terms = {} conflicts = [] # Extract terms from each document documents.each do |path, content| doc_terms = extract_terms(content, path) doc_terms.each do |term, data| all_terms[term] ||= {} all_terms[term][path] = data end end # Find similar terms that might be conflicts term_list = all_terms.keys term_list.each_with_index do |term1, i| term_list[(i + 1)..-1].each do |term2| similarity = calculate_similarity(term1, term2) # Check for potential conflicts (similar but not identical) if similarity > 0.7 && similarity < 1.0 conflicts << build_conflict(term1, term2, all_terms) elsif are_variants?(term1, term2) conflicts << build_conflict(term1, term2, all_terms) end end end # Also find inconsistent usage of the same base term find_inconsistent_usage(all_terms, conflicts) conflicts.compact end |