Class: Ace::Docs::Atoms::TerminologyExtractor

Inherits:
Object
  • Object
show all
Defined in:
lib/ace/docs/atoms/terminology_extractor.rb

Overview

Extracts and analyzes terminology from documents to find conflicts

Constant Summary collapse

COMMON_WORDS =

Common words to exclude from terminology analysis

%w[
  a an and are as at be but by for from has have i in is it of on or
  that the this to was will with you your we our us their them they
  can could should would may might must shall will do does did done
  get got gets getting make makes made making take takes took taken
  use uses used using go goes went gone going come comes came coming
  see sees saw seen seeing know knows knew known knowing think thinks
  thought thinking want wants wanted wanting need needs needed needing
  give gives gave given giving find finds found finding tell tells told
  telling work works worked working call calls called calling try tries
  tried trying ask asks asked asking feel feels felt feeling become
  becomes became becoming leave leaves left leaving put puts putting
  keep keeps kept keeping let lets letting begin begins began beginning
  seem seems seemed seeming help helps helped helping talk talks talked
  talking turn turns turned turning start starts started starting show
  shows showed shown showing hear hears heard hearing play plays played
  playing run runs ran running move moves moved moving like likes liked
  liking live lives lived living believe believes believed believing
  bring brings brought bringing happen happens happened happening write
  writes wrote written writing provide provides provided providing sit
  sits sat sitting stand stands stood standing lose loses lost losing
  pay pays paid paying meet meets met meeting include includes included
  including continue continues continued continuing set sets setting
  learn learns learned learning change changes changed changing lead
  leads led leading understand understands understood understanding
  watch watches watched watching follow follows followed following stop
  stops stopped stopping create creates created creating speak speaks
  spoke spoken speaking read reads reading allow allows allowed allowing
  add adds added adding spend spends spent spending grow grows grew
  grown growing open opens opened opening walk walks walked walking win
  wins won winning offer offers offered offering remember remembers
  remembered remembering love loves loved loving consider considers
  considered considering appear appears appeared appearing buy buys
  bought buying wait waits waited waiting serve serves served serving
  die dies died dying send sends sent sending expect expects expected
  expecting build builds built building stay stays stayed staying fall
  falls fell fallen falling cut cuts cutting reach reaches reached
  reaching kill kills killed killing remain remains remained remaining
  suggest suggests suggested suggesting raise raises raised raising
  pass passes passed passing sell sells sold selling require requires
  required requiring report reports reported reporting decide decides
  decided deciding pull pulls pulled pulling one two three four five
  six seven eight nine ten first second third last next new old good
  bad best worst more most less least very much many few some any all
  no not yes other another each every either neither both such own same
  different various certain several many most few little much enough
  only just still already yet even also too quite rather almost nearly
  always usually often sometimes rarely never again further then once
  now here there where when why how what which who whom whose if unless
  until while although though because since before after during within
  without through across beyond behind below beneath beside between
  above over under around among against along toward towards upon down
  up out off away back forward backward forwards backwards inside
  outside onto into about for from with without by at in on to as of
].freeze

Instance Method Summary collapse

Instance Method Details

#extract_terms(content, doc_path = nil) ⇒ Hash

Extract key terms from document content with frequency counts

Parameters:

  • content (String)

    the document content

  • doc_path (String) (defaults to: nil)

    the document path for reference

Returns:

  • (Hash)

    terms with their frequencies and locations



69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# File 'lib/ace/docs/atoms/terminology_extractor.rb', line 69

def extract_terms(content, doc_path = nil)
  terms = {}
  lines = content.lines

  lines.each_with_index do |line, index|
    # Skip code blocks and front matter
    next if line.strip.start_with?("```", "---")

    # Extract words and normalize them
    words = line.downcase.scan(/\b[a-z]+(?:-[a-z]+)*\b/)

    words.each do |word|
      # Skip common words and very short words
      next if COMMON_WORDS.include?(word) || word.length < 3

      # Track term frequency and locations
      terms[word] ||= {count: 0, locations: [], variations: Set.new}
      terms[word][:count] += 1
      terms[word][:locations] << {file: doc_path, line: index + 1}

      # Track original variations (case)
      original = line[/\b#{Regexp.escape(word)}\b/i]
      terms[word][:variations] << original if original
    end
  end

  # Filter to meaningful terms (appears multiple times or has variations)
  terms.select do |_term, data|
    data[:count] > 1 || data[:variations].size > 1
  end
end

#filter_common_words(terms) ⇒ Array

Filter out common words from a list of terms

Parameters:

  • terms (Array)

    list of terms to filter

Returns:

  • (Array)

    filtered list without common words



143
144
145
# File 'lib/ace/docs/atoms/terminology_extractor.rb', line 143

def filter_common_words(terms)
  terms.reject { |term| COMMON_WORDS.include?(term.downcase) }
end

#find_conflicts(documents) ⇒ Array

Find terminology conflicts across multiple documents

Parameters:

  • documents (Hash)

    hash of { path => content }

Returns:

  • (Array)

    array of conflict hashes



104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
# File 'lib/ace/docs/atoms/terminology_extractor.rb', line 104

def find_conflicts(documents)
  all_terms = {}
  conflicts = []

  # Extract terms from each document
  documents.each do |path, content|
    doc_terms = extract_terms(content, path)

    doc_terms.each do |term, data|
      all_terms[term] ||= {}
      all_terms[term][path] = data
    end
  end

  # Find similar terms that might be conflicts
  term_list = all_terms.keys

  term_list.each_with_index do |term1, i|
    term_list[(i + 1)..-1].each do |term2|
      similarity = calculate_similarity(term1, term2)

      # Check for potential conflicts (similar but not identical)
      if similarity > 0.7 && similarity < 1.0
        conflicts << build_conflict(term1, term2, all_terms)
      elsif are_variants?(term1, term2)
        conflicts << build_conflict(term1, term2, all_terms)
      end
    end
  end

  # Also find inconsistent usage of the same base term
  find_inconsistent_usage(all_terms, conflicts)

  conflicts.compact
end